KEYWORDS: Image processing, Clocks, Field programmable gate arrays, Anisotropic filtering, C++, Nonlinear image processing, Computer simulations, Data modeling, Digital signal processing, Data processing
The pseudo-log image transform belongs to a class of image processing kernels that generate memory references which are nonlinear functions of loop indices. Due to the nonlinearity of the memory references, the usual design methodologies do not allow efficient hardware implementation for nonlinear kernels. For optimized hardware implementation, these kernels require the creation of a customized memory hierarchy and efficient data/memory management strategy. We present the design and real-time hardware implementation of a pseudo-log image transform IP (hardware image processing engine) using a memory management framework. The framework generates a controller which efficiently manages input data movement in the form of tiles between off-chip main memory, on-chip memory, and the core processing unit. The framework can jointly optimize the memory hierarchy and the tile computation schedule to reduce on-chip memory requirements, to maximize throughput, and to increase data reuse for reducing off-chip memory bandwidth requirements. The algorithmic C++ description of the pseudo-log kernel is profiled in the framework to generate an enhanced description with a customized memory hierarchy. The enhanced description of the kernel is then used for high-level synthesis (HLS) to perform architectural design space exploration in order to find an optimal implementation under given performance constraints. The optimized register transfer level implementation of the IP generated after HLS is used for performance estimation. The performance estimation is done in a simulation framework to characterize the IP with different external off-chip memory latencies and a variety of data transfer policies. Experimental results show that the designed IP can be used for real-time implementation and that the generated memory hierarchy is capable of feeding the IP with a sufficiently high bandwidth even in the presence of long external memory latencies.
The pseudo-log image transform is essentially a logarithmic transformation that simulates the distribution of
the eye’s photoreceptors and finds application in many important areas of real time image and video processing
such as motion detection and estimation in robots and foveated space variant cameras. It belongs to a family
of non-linear image processing kernels in which references made to memory are non-linear functions of loop
indices. Non-linear kernels need some form of memory management in order to achieve the required throughput,
to minimize on-chip memory and to maximize possible data re-use. In this paper we present the design of a
pseudo-log image processing hardware accelerator IP, integrated with different interpolation filtering techniques,
using a memory management framework. The framework can automatically generate a memory hierarchy around
the IP and a data transfer controller that facilitates data exchange with main memory. The memory hierarchy
reduces on-chip memory requirements, optimizes throughput and increases data-reuse. The design of the IP is
fully performed at the algorithmic level in C/C++. The algorithmic description is profiled within the framework
to create a customized memory hierarchy, also described at the synthesizable algorithmic level. Finally, high
level synthesis is used to perform hardware design space exploration and performance estimation. Experiments
show that the generated memory hierarchy is able to feed the IP with a very high bandwidth even in presence
of long external memory latencies.
Forward and Backward projections are two computational costly steps in tomography image reconstruction such
as Positron Emission Tomography (PET). To speed-up reconstruction time, a hardware projection/backprojection
pair has been built following algorithm architecture adequacy principles. Thanks to an original memory access
strategy based on an 3D adaptive and predictive memory cache, the external memory wall has been overcome.
Thus, for both projector architectures several units run efficiently. Each unit reaches a computational throughput
close to 1 operation per cycle.
In this paper, we present how from our hardware projection/backprojection pair, an analytic (3D-RP) and an
iterative (3D-EM) reconstruction algorithms can be implemented on a System on Programmable Chip (SoPC).
First, an hardware/software partitioning is done based on the different steps of each algorithm. Then the
reconstruction system is composed of two hardware configurations of the programmable logic resources (FPGA).
Each one corresponds mainly to the projection and backprojection step.
Our projector/backprojector has been validated with a software 3D-RP and 3D-EM reconstruction on simulated
PET-SORTEO data. A reconstruction time evaluation of these reconstruction systems are done based
on the measured performances of our projectors IPs and the estimated performances of the additional simple
hardware IPs. The expected reconstruction time is compared with the software tomography distribution STIR.
A speed-up of 7 can be expected for the 3D-RP algorithm and a speed-up of 3.5 for the 3D-EM algorithm. For
both algorithms, the architecture cycle efficiency expected is largely greater than the software implementation: 120 times for 3D-RP and 60 times for 3D-EM.