As the latest video compression standard, H.264/AVC exhibits great compresion performance than its previous ancestors. Many new features are used to achieve much better rate-distortion eciency and subjective quality, but the high computational complexity and intensive memory access are the penalties. Such high requirement of memory and computational resources leads to long processing cycles and high power consumption. This made real-time encoding of H.264/AVC hard to implement.
To address these diculties, this thesis is focused on fast algorithm, data reuse and parallel architectures of H.264/AVC encoder. For data reuse, we proposed a partially forward processing algorithm (PFPA) which reuses the reference information to avoid duplicated reference data loading. For fast algorithms, we studied the statistical features of fractional motion estimation (FME) and proposed a FME mode reduction
scheme. For parallel architectures, we proposed two solutions for block level and MB level parallelization respectively. At the block level, we proposed a FME parallel architecture which achieved both memory and processing cycle eciency (reduced about 67%
memory accesses and about 50% processing cycles compared with most of state of the art architectures). At the MB level, we proposed wavefront architecture. Theoretically, this architecture can extend a multi-core encoder to a system with any desired number
of cores without sacricing encoding quality.
Both JM model and Tensilica XTMP are used to verify the proposed architectures. Architecture implementation detail are discussed and cycle-accurate test results show good performance improvements with very small overhead. From dual-core to three-core and quad-core, the overhead of the P-Core performance are 0.8% and 1.3%
for I-frames; 1.7% and 2.4% for P-frames. The speed-ups from dual-core to three-core and quad-core are 1.49 and 1.97 for I-frames; 1.47 and 1.95 for P-frames. System up-scaling methodologies are also covered at the end of this thesis.