This paper presents initial performance results for a software toolkit that implements GPU-based parallel computation of digitally reconstructed radiographs (DRRs) from volumetric imaging data for 2D-3D registration. The computational
parallelism is achieved using NVIDIA’s CUDA implementation of general purpose computing on the graphics processing unit. The sample volumetric imaging data shown here is from CT imaging of a cadaveric foot, but the toolkit
can be applied equally well to other volumetric imaging data. An efficient implementation requires launching hundreds of simultaneous, independent computational threads and fast thread access to the global memory where they need to read and write data. We have implemented fast DRR generation by launching a computational thread for each pixel in the image, and achieve efficient memory access by using 3D texture memory to store the volumetric data and constant memory to store global information such as intensifier coordinates. The Thrust software library was used to store individual bone DRRs, which enables efficient memory transfer and use of built-in device operators during image compositing and similarity quantification. By storing individual DRRs, the toolkit can support independent kinematics for up to 32 segmented objects. We show that the algorithm scales with the number of processors and compare timings for three commercially available GPUs. Here we present our initial fast DRR computations to demonstrate that the toolkit can produce useful results for a full 160 × 339 × 439 stack of floating point density data on a high resolution 1152 × 896 pixel screen in 1.3 ms and on a 512 × 512 pixel screen in less than 0.6 ms.