This is an old revision of the document!
[bertlluk][✓ bertlluk, 2025-06-25]
Advances in AI, especially the convolutional neural network, allow us to process raw sensory information and recognize objects and categorize them into classes with higher levels of abstraction (pedestrians, cars, trees, etc.). Taking these categories into account allows autonomous vehicles to understand the scene and reason about future actions of the vehicle as well as about the other participants in road traffic and make assumptions on/predictions of their possible interactions. This section elaborates on the comparison of commonly used methods, their advantages, and weaknesses.
Traditional perception pipelines used hand-crafted algorithms for feature extraction and rule-based classification (e.g., edge detection, optical flow, color segmentation). While effective for controlled conditions, these systems failed to generalize to the vast variability of real-world driving — lighting changes, weather conditions, sensor noise, and unexpected objects.
The advent of deep learning revolutionized perception by enabling systems to learn features automatically from large datasets rather than relying on manually designed rules. Deep neural networks, trained on millions of labeled examples, can capture complex, nonlinear relationships between raw sensor inputs and semantic concepts such as vehicles, pedestrians, and traffic lights.
In an autonomous vehicle, AI-based perception performs several core tasks:
Deep learning architectures form the computational backbone of AI-based perception systems in autonomous vehicles. They enable the extraction of complex spatial and temporal patterns directly from raw sensory data such as images, point clouds, and radar returns. Different neural network paradigms specialize in different types of data and tasks, yet modern perception stacks often combine several architectures into hybrid frameworks.
Convolutional Neural Networks are the most established class of models in computer vision. They process visual information through layers of convolutional filters that learn spatial hierarchies of features — from edges and corners to textures and object parts. CNNs are particularly effective for object detection, semantic segmentation, and image classification tasks. Prominent CNN-based architectures used in autonomous driving include:
ResNet and EfficientNet for general feature extraction,Faster R-CNN and YOLO families for real-time object detection,U-Net and DeepLab for dense semantic segmentation.
While cameras capture two-dimensional projections, LiDAR and radar sensors produce three-dimensional point clouds that require specialized processing.
3D convolutional networks, such as VoxelNet and SECOND, discretize space into voxels and apply convolutional filters to learn geometric features.
Alternatively, point-based networks like PointNet and PointNet++ operate directly on raw point sets without voxelization, preserving fine geometric detail.
These models are critical for estimating the shape and distance of objects in 3D space, especially under challenging lighting or weather conditions.
Transformer networks, initially developed for natural language processing, have been adapted for vision and multimodal perception.
They rely on self-attention mechanisms, which allow the model to capture long-range dependencies and contextual relationships between different parts of an image or between multiple sensors.
In autonomous driving, transformers are used for feature fusion, bird’s-eye-view (BEV) mapping, and trajectory prediction.
Notable examples include DETR (Detection Transformer), BEVFormer, and TransFusion, which unify information from cameras and LiDARs into a consistent spatial representation.
Driving is inherently a dynamic process, requiring understanding of motion and temporal evolution. Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models, are used to process sequences of observations and capture temporal dependencies. They are common in object tracking and motion prediction, where maintaining consistent identities and velocities of moving objects over time is essential. More recent architectures use temporal convolutional networks or transformers to achieve similar results with greater parallelism and stability.
Graph Neural Networks extend deep learning to relational data, representing scenes as graphs where nodes correspond to agents or landmarks and edges encode spatial or behavioral relationships.
This structure makes GNNs well suited for modeling interactions among vehicles, pedestrians, and infrastructure elements.
Models such as VectorNet, Trajectron++, and Scene Transformer use GNNs to learn dependencies between agents, supporting both scene understanding and trajectory forecasting.
Modern perception systems often combine multiple architectural families into unified frameworks. For instance, a CNN may extract image features, a point-based network may process LiDAR geometry, and a transformer may fuse both into a joint representation. These hierarchical and multimodal architectures enable robust perception across varied environments and sensor conditions, providing the high-level scene understanding required for safe autonomous behavior.
Scene understanding is a process by which an autonomous agent interprets its environment as a coherent model — integrating environment map, objects, semantics, and dynamics into a structured representation that supports decision-making. It is the bridge between raw perception and higher-level autonomy functions such as planning, prediction, and control.
The goal of scene understanding is to transform fragmented sensor detections into a meaningful, temporally consistent model of the surrounding scene.
Scene understanding often relies on multi-layered representations:
The relational layer captures how entities within a traffic scene interact with one another and with the static environment. While lower layers (geometric and semantic) describe what exists and where it is, the relational layer describes how elements relate — spatially, functionally, and behaviorally.
Spatial relation describes e.g. mutual distance, relative velocity, and possible collision of trajectories. Functional relations describe when one entity modifies, limits, or restricts functions of another, e.g., traffic lanes modify the movement of vehicles, railing restricts the movement of pedestrians, etc.
These relations can be explicitly represented by scene graphs, where nodes represent entities and edges represent relationships, or encoded in different types of neural networks, e.g., visual-language models.
Scene understanding must maintain temporal stability across frames. Flickering detections or inconsistent semantic labels can lead to unstable planning. Techniques include temporal smoothing, cross-frame data association to maintain consistent object identities, or memory networks that preserve contextual information across time.
The temporal part of the scene understanding is tightly coupled with motion prediction and forecasting future trajectories of all dynamic agents. Two primary approaches are Physics-based models (e.g., constant-velocity, bicycle models), which are simple and interpretable, but limited in complex interactions, and learning-based models, where data-driven networks capture contextual dependencies and multiple possible futures (e.g., MultiPath, Trajectron++, VectorNet).