The well-known wave-front parallelization is proposed for parallel H.264/AVC video
processing. Under this approach, groups of independent macro-blocks (MBs) are
simultaneously processed, one group after another. Barrier mechanism is employed
to synchronize processing of the independent MBs. This approach, however, has a
substantial synchronization overhead that significantly affects the throughput
performance. A novel dynamic scheduling scheme with recursive tail submit provides a
good throughput performance by exploiting macro-block level parallelism and alleviating the synchronization overhead and thread contention. Nevertheless, it fails to
achieve an optimal performance due to the use of a global queue, and an unawareness
of cache locality of the underlying multi-core architecture. I propose an adaptive dynamic scheduling scheme that employs distribued queues, and dynamically schedules
tasks in a cache locality-aware and load-balancing fashion.
As a graphics accelerator, GPGPU is able to off-loads compute intensive
functions. In H.264 video encoding, hierarchical search is a widely proposed for the most
expensive motion estimation. GPGPU is suitable, especially with full search-based
approaches as the process can be efficiently parallelized. However, their fixed pyramid
structure lacks a mechanism to select the best multiple-candidate schemes
considering diverse video encoding characteristics. I propose profiled-based fixed multiple candidate motion vector selection scheme, and an efficient dynamic multiple candidate
motion vector selection scheme to dynamically select best multiple-candidate motion
vector schemes at runtime.