Cone-beam reconstruction (CBR) is useful for producing volume images from projections in many fields including
medicine, biomedical research, baggage scanning, paleontology, and nondestructive manufacturing inspection. CBR
converts a set of two-dimensional (2-D) projections into a three-dimensional (3-D) image of the projected object. The
most common algorithm used for CBR is referred to as the Feldkamp-Davis-Kress (FDK) algorithm; this involves
filtering and cone-beam backprojection steps for each projection of the set. Over the past decade we have observed or
studied FDK on platforms based on many different processor types, both single-processor and parallel-multiprocessor
architectures. In this paper we review the different platforms, in terms of design considerations that include speed,
scalability, ease of programming, and cost. In the past few years, the availability of programmable special processors
(i.e. graphical processing units [GPUs] and Cell Broadband Engine [BE]), has resulted in platforms that meet all the
desirable considerations simultaneously.
The Maximum Likelihood Expectation Maximization (MLEM) algorithm has been shown to produce the highest quality Digital Breast Tomosynthesis (DBT) images. MLEM, however, is computationally intensive. Single-processor image reconstruction times for each breast were on the order of several hours. In order for DBT to be clinically useful, faster reconstruction times using cost-effective software/hardware solutions are needed. We have implemented the MLEM reconstruction algorithm for use with DBT on a graphics processing unit (GPU). Compared to a single optimized 2.8GHz Pentium system this enabled a 113-fold speedup in processing time, while maintaining high image quality. Subsequently, we added various additional processing steps to the reconstruction algorithm in order to improve image quality and diagnostic properties. Since the performance of commercial GPUs increases rapidly, with little change in cost, the increased sophistication in processing does not entail an increase in system cost. The use of GPUs for reconstruction represents a technical breakthrough in the cost-effective application of MLEM to Digital Breast Tomosynthesis.
Proc. SPIE. 6142, Medical Imaging 2006: Physics of Medical Imaging
KEYWORDS: Digital signal processing, Surface plasmons, Detection and tracking algorithms, Sensors, Mercury, Computing systems, Signal processing, Computed tomography, Reconstruction algorithms, Personal protective equipment
Over the last few decades, the medical imaging community has passionately debated over different approaches to implement reconstruction algorithms for Spiral CT. Numerous alternatives have been proposed. Whether they are approximate, exact or, iterative, those implementations generally include a backprojection step. Specialized compute platforms have been designed to perform this compute-intensive algorithm within a timeframe compatible with hospital-workflow requirements. Solving the performance problem in a cost-effective way had driven designers to use a combination of digital signal processor (DSP) chips, general-purpose processors, application-specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs). The Cell processor by IBM offers an interesting alternative for implementing the backprojection, especially since it offers a good level of parallelism and vast I/O capabilities. In this paper, we consider the implementation of a straight backprojection algorithm on the Cell processor to design a cost-effective system that matches the performance requirements of clinically deployed systems. The effects on performance of system parameters such as pitch and detector size are also analyzed to determine the ideal system size for modern CT scanners.
Cone-beam reconstruction (CBR) is growing in importance, but current computer systems are slower than desirable for clinical use. We have built a high-speed system for high-quality, 3D imaging. We partitioned the problem into input, filtering, backprojection, postprocessing, and output components. We mapped most of the components to standard RACE++ processing nodes. The backprojection component is very compute-intensive; we mapped it to a field-programmable gate array (FPGA)-based adjunct processor. We built a prototype FPGA card, optimized for flexibility, and implemented the backprojection in that FPGA. This strategy allows for redesigning the backprojection function when necessary, and keeps the other details of the CBR algorithm in easily programmable processors. We present a system that performs Feldkamp CBR of 300 projections into a 5123 cubical image in 38.7 seconds. The system is designed to be scalable, so that Feldkamp CBR of 21.4 seconds can be performed with two adjunct processors, and Feldkamp CBR of other regions of interest or dimensions could be performed in proportionately shorter times. Further optimization and faster-processing parts will also contribute to continual speed improvements. This system is flexible and can be extended to perform other imaging functions, such as real-time planar angiography, with the same hardware.
Proc. SPIE. 4681, Medical Imaging 2002: Visualization, Image-Guided Procedures, and Display
KEYWORDS: Digital signal processing, Imaging systems, Sensors, Image processing, Error analysis, Computing systems, Image acquisition, Field programmable gate arrays, Medical imaging, Algorithm development
Adjunct processors have traditionally been used for certain tasks in medical imaging systems. Often based on application-specific integrated circuits (ASICs), these processors formed X-ray image-processing pipelines or constituted the backprojectors in computed tomography (CT) systems. We examine appropriate functions to perform with adjunct processing and draw some conclusions about system design trade-offs. These trade-offs have traditionally focused on the required performance and flexibility of individual system components, with increasing emphasis on time-to-market impact. Typically, front-end processing close to the sensor has the most intensive processing requirements. However, the performance capabilities of each level are dynamic and the system architect must keep abreast of the current capabilities of all options to remain competitive. Designers are searching for the most efficient implementation of their particular system requirements. We cite algorithm characteristics that point to effective solutions by adjunct processors. We have developed a field- programmable gate array (FPGA) adjunct-processor solution for a Cone-Beam Reconstruction (CBR) algorithm that offers significant performance improvements over a general-purpose processor implementation. The same hardware could efficiently perform other image processing functions such as two-dimensional (2D) convolution. The potential performance, price, operating power, and flexibility advantages of an FPGA adjunct processor over an ASIC, DSP or general-purpose processing solutions are compelling.
Proc. SPIE. 3336, Medical Imaging 1998: Physics of Medical Imaging
KEYWORDS: Digital signal processing, Data storage, Magnetic resonance imaging, Image processing, Mercury, Image restoration, Computing systems, Data acquisition, Embedded systems, Functional magnetic resonance imaging
Due to the dynamic nature of brain studies in functional magnetic resonance imaging (fMRI), fast pulse sequences such as echo planar imaging (EPI) and spiral are often used for higher temporal resolution. Hundreds of frames of two- dimensional (2-D) images or multiple three-dimensional (3-D) images are often acquired to cover a larger space and time range. Therefore, fMRI often requires a much larger data storage, faster data transfer rate and higher processing power than conventional MRI. In Mercury Computer Systems' PCI-based embedded computer system, the computer architecture allows the concurrent use of a DMA engine for data transfer and CPU for data processing. This architecture allows a multicomputer to distribute processing and data with minimal time spent transferring data. Different types and numbers of processors are available to optimize system performance for the application. The fMRI reconstruction was first implemented in Mercury's PCI-based embedded computer system by using one digital signal processing (DSP) chip, with the host computer running under the Windows NTR platform. Double buffers in SRAM or cache were created for concurrent I/O and processing. The fMRI reconstruction was then implemented in parallel using multiple DSP chips. Data transfer and interprocessor synchronization were carefully managed to optimize algorithm efficiency. The image reconstruction times were measured with different numbers of processors ranging from one to 10. With one DSP chip, the timing for reconstructing 100 fMRI images measuring 128 X 64 pixels was 1.24 seconds, which is already faster than most existing commercial MRI systems. This PCI-based embedded multicomputer architecture, which has a nearly linear improvement in performance, provides high performance for fMRI processing. In summary, this embedded multicomputer system allows the choice of computer topologies to fit the specific application to achieve maximum system performance.
Digital vascular computer systems are used for radiology and fluoroscopy (R/F), angiography, and cardiac applications. In the United States alone, about 26 million procedures of these types are performed annually: about 81% R/F, 11% cardiac, and 8% angiography. Digital vascular systems have a very wide range of performance requirements, especially in terms of data rates. In addition, new features are added over time as they are shown to be clinically efficacious. Application-specific processing modes such as roadmapping, peak opacification, and bolus chasing are particular to some vascular systems. New algorithms continue to be developed and proven, such as Cox and deJager's precise registration methods for masks and live images in digital subtraction angiography. A computer architecture must have high scalability and reconfigurability to meet the needs of this modality. Ideally, the architecture could also serve as the basis for a nonvascular R/F system.
Medical imaging applications have growing processing requirements, and scalable multicomputers are needed to support these applications. Scalability -- performance speedup equal to the increased number of processors -- is necessary for a cost-effective multicomputer. We performed tests of performance and scalability on one through 16 processors on a RACE multicomputer using Parallel Application system (PAS) software. Data transfer and synchronization mechanisms introduced a minimum of overhead to the multicomputer's performance. We implemented magnetic resonance (MR) image reconstruction and multiplanar reformatting (MPR) algorithms, and demonstrated high scalability; the 16- processor configuration was 80% to 90% efficient, and the smaller configurations had higher efficiencies. Our experience is that PAS is a robust and high-productivity tool for developing scalable multicomputer applications.
The increasing computational demands of medical imaging will exceed the capacity of standard microprocessors. For the most computationally intense problems, such as real-time scanning, parallel processing will be required. We evaluate the performance of a master-slave model of coarse-grained parallel processing on examples of reconstruction and postprocessing problems. We use a commercially available multicomputer system in configurations of from one through eight processors with distributed, shared memory. We examine a variety of 2D medical imaging problems ranging from pointwise operations, such as window-level, to global operations, such as 2D FFT. Parallel processing with the master-slave model is most efficient when data transfer among processors is minimized. This can be done by a combination of high-performance computer architecture and well-designed processing algorithms.