AI-based Perception and Scene Understanding

[bertlluk][✓ bertlluk, 2025-06-25]

Advances in AI, especially the convolutional neural network, allow us to process raw sensory information and recognize objects and categorize them into classes with higher levels of abstraction (pedestrians, cars, trees, etc.). Taking these categories into account allows autonomous vehicles to understand the scene and reason about future actions of the vehicle as well as about the other participants in road traffic and make assumptions on/predictions of their possible interactions. This section elaborates on the comparison of commonly used methods, their advantages, and weaknesses.

Scene Understanding

Scene understanding is a process by which an autonomous agent interprets its environment as a coherent model — integrating environment map, objects, semantics, and dynamics into a structured representation that supports decision-making. It is the bridge between raw perception and higher-level autonomy functions such as planning, prediction, and control.

The goal of scene understanding is to transform fragmented sensor detections into a meaningful, temporally consistent model of the surrounding scene.

Scene understanding often relies on multi-layered representations:

Geometric Layer – 3D occupancy grids or bird’s-eye-view (BEV) maps of the environment.
Semantic Layer – class labels for objects and surfaces.
Relational Layer – links, relations and dependencies among entities.
Temporal Layer – short-term evolution and motion prediction.
Behavioral Layer – inferred intentions and possible maneuvers.

The relational layer captures how entities within a traffic scene interact with one another and with the static environment. While lower layers (geometric and semantic) describe what exists and where it is, the relational layer describes how elements relate — spatially, functionally, and behaviorally.

Spatial relation describes e.g. mutual distance, relative velocity, and possible collision of trajectories. Functional relations describe when one entity modifies, limits, or restricts functions of another, e.g., traffic lanes modify the movement of vehicles, railing restricts the movement of pedestrians, etc.

These relations can be explicitly represented by scene graphs, where nodes represent entities and edges represent relationships, or encoded in different types of neural networks, e.g., visual-language models.

Scene understanding must maintain temporal stability across frames. Flickering detections or inconsistent semantic labels can lead to unstable planning. Techniques include temporal smoothing, cross-frame data association to maintain consistent object identities, or memory networks that preserve contextual information across time.

The temporal part of the scene understanding is tightly coupled with motion prediction and forecasting future trajectories of all dynamic agents. Two primary approaches are Physics-based models (e.g., constant-velocity, bicycle models), which are simple and interpretable, but limited in complex interactions, and learning-based models, where data-driven networks capture contextual dependencies and multiple possible futures (e.g., MultiPath, Trajectron++, VectorNet).

en/safeav/maps/ai.1761039722.txt.gz · Last modified: 2025/10/21 09:42 by kosnark