Learning Robust Representations in Random Forest and Deep Neural Networks for Semantic Segmentation
- Author(s): Kang, Byeongkeun
- Advisor(s): Nguyen, Truong Q.
- et al.
As semantic segmentation provides the class and the location of objects in a captured scene, it has been one of the core algorithms in many computer vision applications including autonomous driving, robot navigation, surveillance camera system, and human-machine interaction. Most of these applications demand high accuracy, robustness, and efficiency to understand a captured scene accurately in a timely manner in order to avoid accidents, to provide a meaningful warning, and to communicate naturally. We address this needs by using two popular approaches: random forest and deep neural network.
We start by introducing a cascaded random forest for binary class segmentation. The framework first detects regions of interest and then segments foreground in the regions. Since the detection reduces the regions for the segmentation forest, the cascaded scheme improves efficiency and accuracy. We then explore learning more robust representations in a random forest. Since predetermined constraints in typical feature extractors restrict learning and extracting optimal features, we present a random forest framework that learns the weights, shapes, and sparsities of feature extractors. We propose an unconstrained filter, an iterative optimization algorithm for learning, a processing pipeline for inference. Experimental results demonstrate that the proposed method achieves real-time semantic segmentation using limited computational and memory resources.
Moreover, we present a method to learn/extract depth-adaptive features in a deep neural network. It accomplishes a step toward depth-invariant feature learning and extracting. Since typical neural networks receive inputs from predetermined locations regardless of the distance from the camera, it is challenging to generalize the features of objects at various distances. Hence, we propose the depth-adaptive multiscale convolution layer consisting of the adaptive perception neuron and the in-layer multiscale neuron. The adaptive neuron is to adjust the receptive field at each spatial location using the depth information. The multiscale neuron is to learn features at multiple scales. Experimental results show that the proposed method outperforms the state-of-the-art methods without any additional layers or pre/post-processing.
Lastly, we present applications of segmentation including sign language fingerspelling recognition and hand articulation tracking. We also present a potential data augmentation method using generative adversarial networks.