Automatically tracking people and their body poses in unconstrained videos is a core prob- lem of computer vision. It serves as a foundation for high-level reasoning such as activity recognition and human computer interaction. We consider two standard tracking tasks; tracking a human as its encapsulating bounding box or as an articulating body (poses).
Each task has its own challenge. The accuracy of tracking bounding boxes has been signifi- cantly improved for the past decade, but detecting small people remain challenging simply due to the lack of signals. The accuracy of tracking poses is noticeably lower, especially the one of tracking arms, mainly due to the fundamental difficulty of detecting indisticntive parts.
The algorithms for solving these problems are based on methodology of machine learning. A common pipeline is to project raw images to an invariant feature space, train a classifier (or regressor), and infer bounding boxes or poses from the trained model.
In this thesis, we aim to improve the accuracy in both tasks by proposing novel features, inference algorithms, and training schemes. In terms of tracking bounding boxes, we focus on multiresolution features and motion features that are aimed to robustly detect small people. In terms of tracking poses, we focus on combinatorial inference on part models and highly tuned appearance models.
We demonstrate our approaches using standard datasets and benchmarks of pedestrian de- tection and human pose estimation. Especially, our pedestrian detectors mark the top per- formance in Caltech Pedestrian Detection Benchmark among more than a dozen of recently developed detectors. We also achieve impressive performance in (upper body) pose estima- tion datasets.