Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for
high resolution and high quality video compression technologies such as H.264. Such solutions not only provide
exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based
designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency,
low power, and real-time performance in some consumer devices, many applications require a flexible and
scalable software-defined solution.
The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data
dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and
difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel
depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A
scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the
same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power
requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats.
Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like
that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor
elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software
programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that
of ASIC solutions.
This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms
such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks,
and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking
filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to
different performance needs the DFM serves the data required for the different number of DFUs, and also manages all
the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and
performance excellence required in deblocking filters.
Video compression algorithms such as H.264 offer much potential for parallel processing that is not always exploited by
the technology of a particular implementation. Consumer mobile encoding devices often achieve real-time performance
and low power consumption through parallel processing in Application Specific Integrated Circuit (ASIC) technology,
but many other applications require a software-defined encoder. High quality compression features needed for some
applications such as 10-bit sample depth or 4:2:2 chroma format often go beyond the capability of a typical consumer
electronics device. An application may also need to efficiently combine compression with other functions such as noise
reduction, image stabilization, real time clocks, GPS data, mission/ESD/user data or software-defined radio in a low
power, field upgradable implementation.
Low power, software-defined encoders may be implemented using a massively parallel memory-network processor array
with 100 or more cores and distributed memory. The large number of processor elements allow the silicon device to
operate more efficiently than conventional DSP or CPU technology. A dataflow programming methodology may be
used to express all of the encoding processes including motion compensation, transform and quantization, and entropy
coding. This is a declarative programming model in which the parallelism of the compression algorithm is expressed as
a hierarchical graph of tasks with message communication. Data parallel and task parallel design patterns are supported
without the need for explicit global synchronization control.
An example is described of an H.264 encoder developed for a commercially available, massively parallel memorynetwork