Self-driving technologies hold the potential to revolutionize the transportation system. However, current autonomous vehicles predominantly rely on single-agent perception, which is constrained by a limited viewpoint and often suffers from occlusions, sparse sensor measurements, and potential sensor failures. These limitations pose serious safety threats to modern autonomous driving systems, preventing deploying the system at large scale. To overcome these challenges, this thesis advances the field of cooperative perception, enabling connected agents to collaborate and share sensory information, thereby enhancing the robustness and scalability of connected automation.
The first part of this thesis introduces OPV2V, the first large-scale simulation dataset for Vehicle-to-Vehicle cooperative perception. Building on this foundation, it presents a hetero-modal cooperative perception framework, HM-ViT, which employs a novel heterogeneous 3D graph transformer to jointly model inter-agent and intra-agent interactions. This approach enables effective collaboration among agents with distinct sensor modalities, significantly improving the scalability of cooperative perception.
The second part delves into enhancing the robustness of cooperative perception. It introduces V2X-ViT, a unified vision transformer designed to address common V2X challenges such as heterogeneity, localization errors, and time delays. Additionally, it presents V2XP-ASG, an adversarial scene generation framework that employs adversarial attacks to identify challenging scenarios for cooperative perception. These scenarios are subsequently used to fine-tune the model, further improving its robustness.
The final part transitions to the real-world deployment and development of cooperative perception. It introduces V2X-Real, the first large-scale real-world dataset for Vehicle-to-Everything cooperative perception, featuring full sensor modalities and various V2X collaboration modes. Furthermore, it presents V2X-ReaLO, the first online cooperative perception framework for deploying cooperative perception in real-world systems. This framework is benchmarked to provide insights for future research and development in this field.