This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| en:safeav:maps:ai [2025/04/24 15:05] – created pczekalski | en:safeav:maps:ai [2025/10/21 13:18] (current) – [Data Requirements] kosnark | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== AI-based Perception and Scene Understanding ====== | ====== AI-based Perception and Scene Understanding ====== | ||
| {{: | {{: | ||
| + | |||
| + | <todo @bertlluk # | ||
| + | |||
| + | Advances in AI, especially the convolutional neural network, allow us to process raw sensory information and recognize objects and categorize them into classes with higher levels of abstraction (pedestrians, | ||
| + | |||
| + | Traditional perception pipelines used hand-crafted algorithms for feature extraction and rule-based classification (e.g., edge detection, optical flow, color segmentation). While effective for controlled conditions, these systems failed to generalize to the vast variability of real-world driving — lighting changes, weather conditions, sensor noise, and unexpected objects. | ||
| + | |||
| + | The advent of deep learning revolutionized perception by enabling systems to learn features automatically from large datasets rather than relying on manually designed rules. Deep neural networks, trained on millions of labeled examples, can capture complex, nonlinear relationships between raw sensor inputs and semantic concepts such as vehicles, pedestrians, | ||
| + | |||
| + | In an autonomous vehicle, AI-based perception performs several core tasks: | ||
| + | |||
| + | * | ||
| + | * | ||
| + | * | ||
| + | * Scene understanding – integrating spatial, semantic, and temporal information into a coherent representation. | ||
| + | * | ||
| + | |||
| + | ===== Deep Learning Architectures ===== | ||
| + | |||
| + | Deep learning architectures form the computational backbone of AI-based perception systems in autonomous vehicles. | ||
| + | They enable the extraction of complex spatial and temporal patterns directly from raw sensory data such as images, point clouds, and radar returns. | ||
| + | Different neural network paradigms specialize in different types of data and tasks, yet modern perception stacks often combine several architectures into hybrid frameworks. | ||
| + | |||
| + | === Convolutional Neural Networks (CNNs) === | ||
| + | Convolutional Neural Networks are the most established class of models in computer vision. | ||
| + | They process visual information through layers of convolutional filters that learn spatial hierarchies of features — from edges and corners to textures and object parts. | ||
| + | CNNs are particularly effective for **object detection**, | ||
| + | Prominent CNN-based architectures used in autonomous driving include: | ||
| + | |||
| + | * '' | ||
| + | * '' | ||
| + | * '' | ||
| + | |||
| + | === 3D Convolutional and Point-Based Networks === | ||
| + | While cameras capture two-dimensional projections, | ||
| + | 3D convolutional networks, such as '' | ||
| + | Alternatively, | ||
| + | These models are critical for estimating the shape and distance of objects in 3D space, especially under challenging lighting or weather conditions. | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | === Transformer Architectures === | ||
| + | Transformer networks, initially developed for natural language processing, have been adapted for vision and multimodal perception. | ||
| + | They rely on **self-attention mechanisms**, | ||
| + | In autonomous driving, transformers are used for **feature fusion**, **bird’s-eye-view (BEV) mapping**, and **trajectory prediction**. | ||
| + | Notable examples include '' | ||
| + | |||
| + | |||
| + | |||
| + | === Recurrent and Temporal Models === | ||
| + | Driving is inherently a dynamic process, requiring understanding of motion and temporal evolution. | ||
| + | Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models, are used to process sequences of observations and capture temporal dependencies. | ||
| + | They are common in **object tracking** and **motion prediction**, | ||
| + | More recent architectures use temporal convolutional networks or transformers to achieve similar results with greater parallelism and stability. | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | === Graph Neural Networks (GNNs) === | ||
| + | Graph Neural Networks extend deep learning to relational data, representing scenes as graphs where nodes correspond to agents or landmarks and edges encode spatial or behavioral relationships. | ||
| + | This structure makes GNNs well suited for modeling **interactions** among vehicles, pedestrians, | ||
| + | Models such as '' | ||
| + | |||
| + | Modern perception systems often combine multiple architectural families into unified frameworks. | ||
| + | For instance, a CNN may extract image features, a point-based network may process LiDAR geometry, and a transformer may fuse both into a joint representation. | ||
| + | These hierarchical and multimodal architectures enable robust perception across varied environments and sensor conditions, providing the high-level scene understanding required for safe autonomous behavior. | ||
| + | |||
| + | ==== Data Requirements ==== | ||
| + | |||
| + | The effectiveness of ''' | ||
| + | As deep neural networks do not rely on explicit programming, | ||
| + | |||
| + | Robust perception requires exposure to the full range of operating conditions that a vehicle may encounter. | ||
| + | Datasets must include variations in: | ||
| + | |||
| + | * **Sensor modalities** – data from cameras, LiDAR, radar, GNSS, and IMU, reflecting the multimodal nature of perception. | ||
| + | * **Environmental conditions** – daytime and nighttime scenes, different seasons, weather effects such as rain, fog, or snow. | ||
| + | * **Geographical and cultural contexts** – urban, suburban, and rural areas; diverse traffic rules and road signage conventions. | ||
| + | * **Behavioral diversity** – normal driving, aggressive maneuvers, and rare events such as jaywalking or emergency stops. | ||
| + | * **Edge cases** – rare but safety-critical situations, including near-collisions or sensor occlusions. | ||
| + | |||
| + | A balanced dataset should capture both common and unusual situations to ensure that perception models generalize safely beyond the training distribution. | ||
| + | Because collecting real-world data for every possible scenario is impractical and almost impossible, simulated or synthetic data are often used to supplement real-world datasets. | ||
| + | Photorealistic simulators such as '' | ||
| + | Synthetic data helps to fill gaps in real-world coverage and supports transfer learning, though domain adaptation is often required to mitigate the so-called '' | ||
| + | |||
| + | === Annotation and Labeling === | ||
| + | Supervised learning models rely on accurately annotated datasets, where each image, frame, or point cloud is labeled with semantic information such as object classes, bounding boxes, or segmentation masks. | ||
| + | Annotation quality is critical: inconsistent or noisy labels can propagate systematic errors through the learning process. | ||
| + | Modern annotation pipelines combine human labeling with automation — using pre-trained models, interactive tools, and active learning to accelerate the process. | ||
| + | High-precision labeling is particularly demanding for LiDAR point clouds and multi-sensor fusion datasets, where 3D geometric consistency must be maintained across frames. | ||
| + | |||
| + | |||
| + | === Ethical and Privacy Considerations === | ||
| + | Data used in autonomous driving frequently includes imagery of people, vehicles, and property. | ||
| + | To comply with privacy regulations and ethical standards, datasets must be anonymized by blurring faces and license plates, encrypting location data, and maintaining secure data storage. | ||
| + | Fairness and inclusivity in dataset design are equally important to prevent bias across geographic regions or demographic contexts. | ||
| + | |||
| + | |||
| + | ==== Scene Understanding ==== | ||
| + | |||
| + | Scene understanding is a process by which an autonomous agent interprets its environment as a coherent model — integrating environment map, objects, semantics, and dynamics into a structured representation that supports decision-making. | ||
| + | It is the bridge between raw perception and higher-level autonomy functions such as planning, prediction, and control. | ||
| + | |||
| + | The goal of scene understanding is to transform fragmented sensor detections into a meaningful, temporally consistent model of the surrounding scene. | ||
| + | |||
| + | |||
| + | Scene understanding often relies on multi-layered representations: | ||
| + | |||
| + | * | ||
| + | * | ||
| + | * | ||
| + | * | ||
| + | * | ||
| + | |||
| + | The relational layer captures how entities within a traffic scene interact with one another and with the static environment. | ||
| + | While lower layers (geometric and semantic) describe what exists and where it is, the relational layer describes how elements relate — spatially, functionally, | ||
| + | |||
| + | Spatial relation describes e.g. mutual distance, relative velocity, and possible collision of trajectories. Functional relations describe when one entity modifies, limits, or restricts functions of another, e.g., traffic lanes modify the movement of vehicles, railing restricts the movement of pedestrians, | ||
| + | |||
| + | These relations can be explicitly represented by scene graphs, where nodes represent entities and edges represent relationships, | ||
| + | |||
| + | {{: | ||
| + | |||
| + | Scene understanding must maintain temporal stability across frames. Flickering detections or inconsistent semantic labels can lead to unstable planning. | ||
| + | | ||
| + | |||
| + | The temporal part of the scene understanding is tightly coupled with motion prediction and forecasting future trajectories of all dynamic agents. | ||
| + | Two primary approaches are Physics-based models (e.g., constant-velocity, | ||