The Common Unified Device Architecture (CUDA) introduced in 2007 by NVIDIA is a recent programming
model making use of the unified shader design of the most recent graphics processing units (GPUs). The
programming interface allows algorithm implementation using standard C language along with a few extensions
without any knowledge about graphics programming using OpenGL, DirectX, and shading languages.
We apply this novel technology to the Simultaneous Algebraic Reconstruction Technique (SART), which is
an advanced iterative image reconstruction method in cone-beam CT. So far, the computational complexity of
this algorithm has prohibited its use in most medical applications. However, since today's GPUs provide a high
level of parallelism and are highly cost-efficient processors, they are predestinated for performing the iterative
reconstruction according to medical requirements.
In this paper we present an efficient implementation of the most time-consuming parts of the iterative reconstruction
algorithm: forward- and back-projection. We also explain the required strategy to parallelize the
algorithm for the CUDA 1.1 and CUDA 2.0 architecture. Furthermore, our implementation introduces an acceleration
technique for the reconstruction compared to a standard SART implementation on the GPU using
CUDA. Thus, we present an implementation that can be used in a time-critical clinical environment.
Finally, we compare our results to the current applications on multi-core workstations, with respect to both
reconstruction speed and (dis-)advantages. Our implementation exhibits a speed-up of more than 64 compared
to a state-of-the-art CPU using hardware-accelerated texture interpolation.