Parallel Application Library for Object Recognition
- Author(s): Su, Bor-Yiing
- Advisor(s): Keutzer, Kurt W.
- et al.
Computer vision research enables machines to understand the world. Humans usually interpret and analyze the world through what they see - the objects they capture with their eyes. Similarly, machines can better understand the world by recognizing objects in images. Object recognition is therefore a major branch of computer vision. To achieve the highest accuracy, state-of-the-art object recognition systems must extract features from hundreds to millions of images, train models with enormous data samples, and deploy those models on query images. As a result, these systems are computationally-intensive. In order to make such complicated algorithms practical to apply in real life, we must accelerate them on modern massively-parallel platforms.
However, parallel programming is complicated and challenging, and takes years to master. In order to help object recognition researchers employ parallel platforms more productively, we propose a parallel application library for object recognition. Researchers can simply call the library functions, and need not understand the technical details of parallelization and optimization. To pave the way for such a library, we perform pattern mining on 31 important object recognition systems, and conclude that 15 application patterns are necessary to cover the computations in these systems. In other words, if we support these 15 application patterns in our library, we can parallelize all 31 object recognition systems. In order to optimize any given application pattern in a systematic way, we propose using patterns and software architectures to explore the design space of algorithms, parallelization strategies, and platform parameters.
In this dissertation, we exhaustively examine the design space for three application patterns, and achieve significant speedups on these patterns - 280x speedup on the eigensolver application pattern, 12-33x speedup on the breadth-first-search graph traversal application pattern, and 5-30x speedup on the contour histogram application pattern. To improve the portability and flexibility of the proposed library, we also initiate the OpenCL for OpenCV project. This project aims to provide a collection of autotuners that optimize the performance of application patterns on many different parallel platforms. We have developed two autotuners in this project. clSpMV is an autotuner for sparse matrix vector multiplication (SpMV) computation - it tunes the representation of a sparse matrix and the corresponding SpMV kernel, and is 40% faster than the vendor-optimized parallel implementation. clPaDi is an autotuner for the pair-wise distance computation application pattern - it allows users to customize their own distance functions, and finds the best blocking size for each function. clPaDi performs 320-650 giga floating point operations per second on modern GPU platforms. By employing these optimized functions in a state-of-the-art object recognition system, we have achieved 110-120x speedup compared to the original serial implementation. Now it takes only three seconds to identify objects in a query image - a much more practical and useful processing time. Our research makes it possible to deploy complicated object recognition algorithms in real applications.
With these encouraging results, we are confident that the methodology we illustrate in this dissertation is applicable to optimizing all application patterns. If we expand the parallel application library to support all 15 application patterns, the library will be a key toolkit for both existing and future object recognition systems.