Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for
high resolution and high quality video compression technologies such as H.264. Such solutions not only provide
exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based
designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency,
low power, and real-time performance in some consumer devices, many applications require a flexible and
scalable software-defined solution.
The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data
dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and
difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel
depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A
scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the
same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power
requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats.
Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like
that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor
elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software
programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that
of ASIC solutions.
This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms
such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks,
and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking
filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to
different performance needs the DFM serves the data required for the different number of DFUs, and also manages all
the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and
performance excellence required in deblocking filters.