*in vivo*were achieved, demonstrating the satisfactory image quality realized with this system. Another experiment, simultaneous PAT and US imaging of contrast agent flowing through an artificial vessel, was conducted to verify the performance of this system for imaging fast biological events. The GPU-based image reconstruction software code for this dual-modality system is open source and available for download from http://sourceforge.net/projects/patrealtime.

## 1.

## Introduction

After 20 years of development and study,^{1}^{–}^{9} biomedical photoacoustic (PA) imaging technologies are now ready for many clinical and preclinical applications. Offering a similar depth plane resolution as ultrasound (US) but with different optical contrast, photoacoustic tomography (PAT) is a natural complement to and can be integrated with US to realize dual-modality imaging. It should be a substantial advance to have such a system image the same object with PA and US functions simultaneously, both in a truly real-time manner. That is, all steps including data acquisition, image reconstruction, and image display are performed concurrently at a usefully real-time frame rate or rates. In this way, the practitioners do not have to shift back and forth between the two modalities. This might be particularly useful for diagnosis, treatment guidance, or treatment assessment when biological events occur at a less than steady repetition rate, or when a short term, usually single event occurs, such as bolus injection or intervention. Not only morphological information based on the two different contrasts is acquired, but also rich functional information, such as blood flow, perfusion, volume, and oxygenation level, all crucial physiological markers of many diseases including cancer.

Olafsson et al.^{10} proposed a real-time PAT method for imaging mouse cancer. Although the data acquisition is real time, the image reconstruction still needs to be conducted offline. The work conducted by Kolkman et al.^{11} is similar. Although PA and US images can be acquired in real time, display is offline. Li et al.^{12} introduced a study of the hemodynamics within the entire cerebral cortex of a mouse using real-time PAT, without US at the same time, at a limited frame rate of $1.6\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{s}}^{-1}$. Another system designed by Zemp et al.^{13} achieved a truly real-time imaging for PAT only by using a multicore computer. With a real-time reconstruction and display system, Kim et al.^{14} would propose a better PA and US merging image system for noninvasive sentinel lymph node imaging. The main challenge to achieve a truly real-time PA and US imaging simultaneously is the large computational cost for the reconstruction of high resolution tomographic PA images.

In this article, we propose an optimized back-projection (BP) algorithm to accelerate the reconstruction of PA images. Other than reducing the computational cost, the optimized reconstruction algorithm also facilitates parallel computation and, hence, can be executed on a recent graphics processing unit (GPU) card. Based on this development, a PA and US dual-modality imaging system has been built on an experimental US unit working with a PC containing a GPU card. Using a linear array probe, the performance of this system to acquire PA and US images simultaneously, both in a truly real-time manner, has been examined through the experiments on both tissue phantoms and human peripheral joints *in vivo*.

## 2.

## Background

For reconstruction of a two-dimensional (2-D) PA image, the following equation is used:^{15}

## Eq. (1)

$${p}_{0}(\overline{r})={\int}_{{\mathrm{\Omega}}_{0}}\frac{d{\mathrm{\Omega}}_{0}}{{\mathrm{\Omega}}_{0}}[2p({\overline{r}}^{\prime},t)-2t\frac{\partial p({\overline{r}}^{\prime},t)}{\partial t}]{|}_{t=|\overline{r}-{\overline{r}}^{\prime}|/c},$$## Eq. (2)

$$d{\mathrm{\Omega}}_{0}=\frac{d{S}_{0}}{{|\overline{r}-{\overline{r}}^{\prime}|}^{2}}\frac{\langle \overline{r}-{\overline{r}}^{\prime},{\overline{n}}_{0}\rangle}{|\overline{r}-{\overline{r}}^{\prime}|},$$## Eq. (3)

$${p}_{0}(\overline{r})=2\sum _{r\text{'}}d(\overline{r},{\overline{r}}^{\prime})\left\{p\right(\frac{\overline{r}-{\overline{r}}^{\prime}}{c})-\frac{\overline{r}-{\overline{r}}^{\prime}}{\mathrm{\Delta}\overline{r}}[p\left(\frac{\overline{r}-{\overline{r}}^{\prime}}{c}\right)-p(\frac{\overline{r}-{\overline{r}}^{\prime}}{c}-\mathrm{\Delta}t)\left]\right\},$$## Eq. (4)

$$d(\overline{r},{\overline{r}}^{\prime})=4ab\rho \text{\hspace{0.17em}}\mathrm{sinc}\left(\frac{2a}{\lambda}\mathrm{sin}\text{\hspace{0.17em}}\phi \right)\mathrm{sin}\text{\hspace{0.17em}}c\left(\frac{2b}{\lambda}\mathrm{sin}\text{\hspace{0.17em}}\psi \right),$$Using Eq. (3) to reconstruct PA images, we must perform a traversal calculation on all points repeatedly. The calculation complexity is $O(X)\times O(Y)\times O(N)$ when the reconstruction area contains $X\times Y$ pixels and the number of probe elements is $N$. To achieve real-time reconstruction of PA images, we need a computer that offers both a high level parallel computing structure and a large memory to store source data. According to our experiment, using a recent 12-core CPU-based computer (Apple Mac Pro, 2 six-core Xeon processors, 24GB DDR3 memory, cost about $5,000), it still requires 3.3 s to reconstruct a $1024\times 512$-pixel PA image from 128-channel signals each with a data length of 2048.

## 3.

## Optimized Parallel PAT Method and the Implementation

Instead of using sample data from all probe elements to reconstruct each PA image by pixels, we optimize the BP algorithm by building a time series of PA images progressively by each probe element so that more calculations can be done in parallel. The matrix form of Eq. (3) can be expressed as

## Eq. (5)

$${\mathbf{P}}_{0}(\overline{r})=2\sum _{n=1}^{N}\mathbf{D}(\overline{r},{\overline{r}}^{\prime})\times \mathbf{P}(\overline{r},{\overline{r}}^{\prime}),$$For a given linear probe and a given image plane, $\overline{r}-{\overline{r}}^{\prime}$ is fixed for each pair of probe element and pixel. Therefore, $(\overline{r}-{\overline{r}}^{\prime})/c$, $(\overline{r}-{\overline{r}}^{\prime})/\mathrm{\Delta}\overline{r}$, and $[(\overline{r}-{\overline{r}}^{\prime})/c]-\mathrm{\Delta}t$ in Eq. (3) can all be calculated in advance. To speed up the computation of $\mathbf{P}(\overline{r},{\overline{r}}^{\prime})$ for each probe element, we create an $[X,Y]$ dimension look-up table matrix $\mathbf{Q}(\overline{r},{\overline{r}}^{\prime})$, whose elements’ values are defined as $(\overline{r}-{\overline{r}}^{\prime})/c$. The matrix $\mathbf{Q}(\overline{r},{\overline{r}}^{\prime})$ gives the correspondent address of each image pixel for the signal $p({\overline{r}}^{\prime},t)$ from this element and can be calculated before the image reconstruction. Instead of calculating all the elements’ values in $\mathbf{P}(\overline{r},{\overline{r}}^{\prime})$, they can be filled quickly according to the address information in $\mathbf{Q}(\overline{r},{\overline{r}}^{\prime})$ once the PA signal $p({\overline{r}}^{\prime},t)$ from each probe element is known.

The total computational capability of a GPU chip is drastically enhanced by the large number of the GPU cores in the chip although one GPU core has limited capability compared with one CPU core.^{16} Figure 1 shows the NVidia GPU (GeForce GTX690 GPU card, 3072 CUDA cores, NVidia, cost less than $1,000) structure. The maximum number of parallel computing threads in an NVidia GeForce GTX690 is $\mathrm{2,147,483,647}\times 1024\times 2$. This number will decrease as the computation complexity or memory required in each thread increases. Every independent calculation core in the GPU needs its own copy of the data to be used. The traditional BP algorithm reconstructs PA images by pixels, which means that the sample data from all probe elements have to be copied to the GPU memory for each GPU core. In this way, the parallel reconstruction of all image pixels requires enormous memory. Restricted by the memory size of the GPU, we can only build the PA image by rows (or columns) using a traditional BP algorithm, as illustrated in Fig. 2.

With the optimized BP algorithm, we can build the whole PA image progressively as shown in Fig. 2. The total memory consumption is lowered because every GPU core only needs the data from one probe element. The comparison among different reconstruction methods for the same $1024\times 512$ PA image from 128-channel signals, each with 2048 data length, is shown in Table 1.

## Table 1

Performance in PA image reconstruction (image size: 1024×512).

Reconstruction method | CPU time | GPU time | Total time | Memory usage | Cores usage |
---|---|---|---|---|---|

CPU (without parallel) | 29.5 s | — | 29.5 s | 8 MB from host memory | 1 CPU core |

CPU (with parallel) | 3.3 s | — | 3.3 s | 96 MB from host memory | 12 CPU cores |

GPU (traditional BP) by rows | 13 ms | 352 ms | 365 ms | 256 MB from GPU | 512 GPU cores |

GPU (traditional BP) whole image | Cannot be realized due to the insufficiency in memory | Unavailable 256 GB from GPU | 524,288 GPU cores | ||

GPU (optimized BP) whole image | 8 ms | 50 ms | 58 ms | 2 GB from GPU | 524,288 GPU cores |

As shown in Table 1, the total GPU parallel calculation time when employing the optimized BP algorithm decreases by a factor of 500 compared to the result using a single CPU and traditional BP algorithm. The optimized algorithm increases the efficiency by approximately 500% compared to the traditional unoptimized algorithm. Considering the reconstruction of each image takes 58 ms, our system can achieve a frame rate of up to $17\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{s}}^{-1}$ in the PA imaging mode. The current frame rate for our 2-D PAT is usually limited by the laser’s repetition rate of 10 Hz at full power.

## 4.

## Real-Time Dual PA and US Tomography Modality

The schematic of the imaging system is shown in Fig. 3. A tunable optical parametric oscillator (OPO) laser (Vibrant B, Opotek Inc., Carlsbad, CA) pumped by the second harmonic output of a Nd:YAG pulsed laser (Brilliant B, Quantel, Bozeman, MT) is used to illuminate the sample with a repetition rate of 10 Hz and pulse duration of 5.5 ns. The laser system is tuned to 720 nm which gives the maximum output energy of $60\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mJ}/\text{pulse}$. The Verasonics US unit (Model V3, Redmond, WA) is programmed to work along with the multicore computer equipped with the NVidia GeForce GTX690 GPU card. The probe connected to the Verasonics US imaging system for this work is a linear array (CL15-7, Philips North America Corporation, Andover, MA) with 128 elements, 11.25 MHz center frequency, and 75% $-6\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{dB}$ bandwidth.

The internal pulse generator in the laser outputs a TTL trigger signal at 10 Hz. By using a function generator (3314A, HP) to create a square wave synchronized to the trigger from the laser, the Verasonics US unit works on the PA mode triggered by the rising edge of the square wave. Triggered by the falling edge, which is 40 ms after the rising edge, the US unit turns on the transmission and the rest of the US mode. By switching the transmission on and off, exactly synchronized with the laser pulses, the system employs the PA and US modes alternatively, both at a frame rate of 10 Hz. The reconstructed PA and US images from the same imaging plane are displayed simultaneously on the screen. PA and US images are reconstructed using hardware parallel computing techniques. The US image reconstruction and display are performed entirely with the multicore CPU. The PA image reconstruction is conducted mostly on the multicore GPU and takes only about 20% of the CPU computing resource. Therefore, there is no conflict in the computational and display procedures between the two imaging modes.

To examine the image quality that can be achieved with this system, noninvasive imaging of human peripheral joints *in vivo* was conducted. Figure 4 illustrates the result from the proximal interphalangeal joint of the index finger of a normal volunteer. The light energy density on the skin surface is approximately ${4\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mJ}\text{\hspace{0.17em}}\mathrm{cm}}^{-2}$, significantly lower than the American National Standards Institute (ANSI) safety limit. The image reconstructed using the GPU and the optimized BP algorithm, as illustrated in Fig. 4(a), shows no obvious differences from that computed using the CPU and the traditional BP algorithm illustrated in Fig. 4(b). This indicates that the new technique can drastically elevate the computational speed without sacrificing image quality.

Figure 4(c) shows the PA image reconstructed with the optimized BP algorithm followed by envelope detection using the equation of ${\mathbf{P}}_{e0}(\overline{r})=\sqrt{{\mathbf{P}}_{0}{(\overline{r})}^{2}+{\widehat{\mathbf{P}}}_{0}{(\overline{r})}^{2}}$, where ${\widehat{\mathbf{P}}}_{0}(r)$ is the Hilbert transform of ${\mathbf{P}}_{0}(\overline{r})$. Figure 4(d) shows the gray scale B-scan US image from the same joint consisting of 128 beams.^{17} We can discern comparable morphological features between the PA and US images, including the contour of the bone and the tendon.

To validate the performance of this system to image moving events in biological samples, artificial vessels made from an optically transparent soft tube with a diameter of 3 mm were employed. By immersing the body of the soft tube in a water tank, periodic red ink droplets and air bubbles were pushed through the tube manually by using a syringe, with a speed of about $0.5\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}/\mathrm{s}$. Using this dual-modality system, we imaged the moving of the ink droplets and the air bubbles, as illustrated in Video 1 and Fig. 5, at a frame rate of 10 Hz. In images with the optical and ultrasonic contrasts, respectively, we can see the ink droplets with satisfactory contrast-to-noise ratio. A video showing the real-time imaging results in both PA and US modes can also be found online at http://sourceforge.net/projects/patrealtime.

## 5.

## Conclusion

Based on different contrasts, US and PAT each can reveal different tissue morphological features and functional properties. A dual-modality system facilitating both imaging functions should offer more comprehensive information in biological tissues for improved diagnostic imaging and therapeutic planning and monitoring. In this work, after optimizing the code for BP-based image reconstruction and by taking advantage of the excellent parallel computational capability of the GPU, we have for the first time achieved truly real-time PA and US dual-modality imaging using a practical experimental system. On the PA imaging mode, when the image size is $1024\times 512$ pixels and the data volume of signals is 128 channels by 2048 points, the maximum frame rate achievable with this system is $17\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{s}}^{-1}$, sufficient for visually continuous real-time imaging. When the image size or the data volume is reduced, and the pulse repetition rate of the laser is increased, the dual-modality frame rate is increased correspondingly. The GPU-based image reconstruction software code for this dual-modality system is made freely available. We expect that such a system, enabling the evaluation of fast functional activities by accessing the intrinsic optical contrast in tissues, could lead to new discoveries in clinic and basic research. This technique may also prove useful in guiding therapies or interventions and in evaluating the function of organs or tissues by quantifying the delivery or perfusion of optically absorbing contrast agents. Because US imaging of the same object can also be conducted in real-time without being affected, the US, which has been accepted as a standard tool, can assist in guiding PA procedures and helping in the interpretation of PA image findings. Our next step for updating this dual-modality system is to synthesize PA and US images into one image^{18} in real time to merge individual information from both imaging methods.

## Acknowledgments

This research was supported by the National Natural Science Foundation of China under grant numbers 61201425, 61100111, and 11028408, and the National Institutes of Health of the United States under grant numbers R01AR060350, R01CA91713, and R01AR055179.