3D object detection systems based on deep neural network become a core component of self-driving vehicles. 3D object detection helps to understand the geometry of physical objects in 3D space that are important to predict future motion of objects. While there has been remarkable progress in the fields of image based 2D object detection and instance segmentation, 3D object detection is less explored in the literature.
This dissertation is concerned with various challenges in 3D object detection for self-driving vehicles. We mainly discuss how to improve the performance of 3D object detection system as well as the computational efficiency of the detection pipeline.
The scope of this research lies in the field of 2D camera image vision, 3D LiDAR point clouds processing, sensor-fusion based detection method, efficiency of detection pipeline, and novel data augmentation method for 3D LiDAR point clouds. While the primary research focus is on improvement of the detection performance in terms of precision and recall which are core metrics for object detection task, also emphasized is the importance of practicality of the proposed methods.
In Chapter 2, we discuss sensor-fusion based 3D object detection system. Sensor-fusion based detection system for self-driving vehicles becomes an essential component for safety of self-driving. It becomes crucial to find out optimal combination of multiple sensors to build more accurate and efficient detection system. We mainly discuss how to use 2D camera vision system and 3D LiDAR sensor together for accurate and efficient 3D object detection system.
We first explore various methods in monocular pose estimation. A monocular pose estimation is one of the research areas in computer vision. It predicts the location of objects in 3D space by using single camera image. While most monocular pose estimation based 3D object detection systems have not shown enough performance, the proposed approach in Chapter 2 utilizes the geometrical consistency assumption to narrow down the huge 3D search space into smaller one. Then, the proposed approach applies PointNet for further refinement of a 3D bounding box coordinates.
The proposed approach in Chapter 2, RoarNet, shows one of the best performances in 3D object detection task in KITTI dataset, which is a standard benchmark for the self-driving vehicle detection.
In Chapter 3, we discuss about LiDAR point clouds based 3D object detection pipeline. We analyze the weakness of the most 3D object detection systems which use end-to-end detection pipeline for training and testing. One interesting observation on recent 3D object detection systems is that even though predicted objects are proximal to ground truth objects, many of those predictions are classified as false positive due to low quality in box regression task.
In this chapter, we introduce a practical method to improve the performance of 3D object detection system. The proposed approach, epBRM_V1 aims at improving the quality of 3D bounding box regression task, thus increasing the overall performance of 3D object detection task. The proposed approach requires less than 1 hour of training time and only 12ms of additional latency to improve the performance of standard detection methods to the state-of-the-art 3D object detection methods.
In Chapter 4, we develop epBRM_V2 which overcomes the limitation of epBRM_V1 and further improve the recall performance of the 3D object detection system. We improve the previous baseline, which is epBRM_V1, in two aspects: 1) building a more sophisticated network structure for box regression task to improve the representation power of the 3D LiDAR point clouds feature, 2) introducing a novel data augmentation method for 3D LiDAR point clouds using tracklet information. These two simple modifications enable us to get approximately 95% of recall and 80% of mAP which are 10.0% improvement in recall and 2.7% improvement in mAP compared to epBRM_V1.