## 1.

## Introduction

As fluorescence labeling technology develops,
^{1, 2} an increasing number of biologists want to observe the distribution of fluorescence targets *in vivo*. Therefore, fluorescence molecular tomography (FMT)
^{3, 4} has been developed. Because FMT can be used to quantitatively reveal the fluorescence marker's fluorescence yield
^{5, 6} and lifetime^{7} *in vivo*, FMT is a promising, small-animal imaging method for drug development and cancer research.
^{8, 9}

The reconstruction algorithm is the core technology of FMT. The algorithm can be summarized as follows.
^{4, 10} First, the distribution of the photon density of fluorescence with a different source, which is called Green's function, is calculated. Then, by applying the Fréchet derivative, a Jacobian matrix is formed with the calculated Green's functions. Finally, the distribution of the fluorescence yield is inversely calculated from the Jacobian matrix by an optimization algorithm. The first step in the algorithm is called the forward problem, which investigates the propagation of light in tissue; the second and third steps are called the inverse problem, which reconstructs the distribution of the fluorescence yield. A traditional reconstruction algorithm has been proposed based on a diffusion approximation (DA) and using the finite-element method (FEM).
^{11, 12, 13, 14} The traditional method is very fast, but it is valid only when the reduced scattering coefficients of the tissue are far greater than the absorption coefficients. Although the FMT reconstruction method based on the high order of the radiative transport equation (RTE) is suitable for heterogeneous media with complex distributions of optical coefficients, low computational efficiency limits its application. For example, Joshi proposed a radiative transport-based frequency-domain fluorescence tomography with the most accurate angular discretization of RTE, but it requires more than 3.5 h on a 16-node Beowulf cluster.^{15}

Monte Carlo (MC) methods were introduced in FMT. They are used to calculate the Green's functions with which the Jacobian matrix is formed to reconstruct the distribution of the fluorescence yield. Because MC methods are the gold standard for simulating the light propagation in tissue and are valid for all kinds of tissue,
^{16, 17, 18} they are suitable for small-animal imaging, which has a complex distribution of optical coefficients. Kumar proposed a reconstruction method based on a MC method that reconstructs the distribution of the fluorescence lifetime in time-domain fluorescence tomography for small-animal imaging. A phantom experiment showed that this method can clearly separate two fluorochromes with 6-mm spacing.^{19} Zhang also introduced MC methods into the reconstruction of fluorescence tomography for small-animal imaging.^{20} However, because of the low computational efficiency and the large number of Green's functions requiring calculation, more than 6 h are required to reconstruct the result, even with a parallel computing cluster of central processing units (CPUs).

Graphics processing units (GPUs) have been introduced in MC simulations to accelerate the simulation of the propagation of light in tissue when the distribution of the optical coefficient is already determined. The GPU is the core of a graphics card. In the past, all software ran on CPU; GPUs were only used in image processing and output until general-purpose computing on graphics processing units were proposed and unleashed the power of GPUs for parallel computation. The peak floating-point operations per/s (flop/s) of marketavailable graphics hardware can reach 1000 Gflop/s, which is 10 times higher than that of a Harpertown CPU with a 3.2-GHz frequency.^{21} By using the power of the GPU for parallel computation, MC methods, which simulate the propagation of light in tissue, can be accelerated to speeds that are 100 to 1000 times faster than those obtained with the CPU only. For example, Alerstam used a single GPU to accelerate an MC simulation by a factor of 1000 for multilayered tissues;^{22} Lo proposed a multi-GPU-accelerated MC simulation for photodynamic therapy treatment;^{23} Fang and Boas sped up an MC simulation with a single GPU by a factor of 100 to 300 for complex 3-D turbid media;^{24} and Fang also released their code named Monte Carlo eXtreme (MCX).^{25}

Until now, the FMT reconstruction method based on DA with FEM is still limited in high-scattering tissue; however, many kinds of tissues in small animals do not satisfy the condition of high-scattering, such as rabbit muscle, colon adenocarcinomas,^{26} the cerebral spinal fluid layer in the brain, and cysts in the human breast.^{24} FMT reconstruction methods based on the high orders of both RTE and MC are suitable for heterogeneity media with a complex distribution of optical coefficients, but poor computational efficiency has tremendously limited their applications. GPU can tremendously accelerate the speed of MC; however, it has not yet been used in FMT reconstruction. Furthermore, single GPU-accelerated FMT reconstruction cannot satisfy the demand for fast reconstruction because of the large number of source-detector pairs.

In this paper, we propose a fast MC-based FMT reconstruction, which is suitable for 3-D heterogeneity media with refractive-index-unmatched boundaries and a complex distribution of optical coefficients. Phantom experiments with either high or low-scattering are used to demonstrate the accuracy and speed of the method. By analyzing the reconstructed localization and concentration of the fluorochromes, we compare the accuracy of our method with a traditional method based on DA with FEM. By the load-balancing strategy, we optimize the performance of the GPU cluster for faster reconstruction and compare with that based on CPU and single GPU.

## 2.

## Method

## 2.1.

###
*Theory of FMT*

In this section, a brief introduction to the theory of FMT reconstruction based on MC simulations accelerated by a GPU cluster will be given.

The distribution of fluorescence photons in tissue can be expressed as follows:
^{3, 7}

## Eq. 1

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \phi _f (R_d,R_s) = \int {g^{\lambda em} (R_d,r)} x(r)\phi ^{\lambda ex} (r,R_s)dr, \end{equation}\end{document} $${\phi}_{f}({R}_{d},{R}_{s})=\int {g}^{\lambda em}({R}_{d},r)x\left(r\right){\phi}^{\lambda ex}(r,{R}_{s})dr,$$_{f}(

*R*

_{d},

*R*

_{s}) is the photon density of fluorescence at

*R*

_{d}with a source

*R*

_{s},

*g*

^{λem}(

*R*

_{d},

*r*) is Green's function with a source at

*r*(whose wavelength is λ

*em*) and a detector at

*R*

_{d}, ϕ

^{λex}(

*r*,

*R*

_{s}) is the excitation photon density with a source at

*R*

_{s}(whose wavelength is λ

*ex*) and a detector at

*r*and

*x*(

*r*) can be expressed as follows:

## Eq. 2

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} x(r) = \eta \mu _{\alpha f} (r)\frac{{1 - j\omega \tau (r)}}{{1 + (\omega \tau (r))^2 }}, \end{equation}\end{document} $$x\left(r\right)=\eta {\mu}_{\alpha f}\left(r\right)\frac{1-j\omega \tau \left(r\right)}{1+{\left(\omega \tau \left(r\right)\right)}^{2}},$$_{αf}(

*r*) is fluorescence yield, η is the quantum efficiency, μ

_{αf}(

*r*) is the absorption coefficient of the fluorochrome, τ(

*r*) is the lifetime of the fluorochrome at

*r*,

^{11}and ω is the modality frequency of the light source. Because our system is a continuous wavelength system, therefore in this paper, ω = 0 and

*x*(

*r*) = ημ

_{αf}(

*r*).

When the light source is a narrow collimation laser, the light source can be expressed as a δ function, so ϕ^{λex}(*r*, *R*
_{s}) = *g*
^{λex}(*r*, *R*
_{s}) (Refs. 7 and 27), and, according to reciprocity,^{28} *g*
^{λem}(*R*
_{d}, *r*) = *g*
^{λem}(*r*, *R*
_{d}). Equation 1 can be expressed as follows:

## Eq. 3

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \phi _f (R_d, R_s) = \int {g^{\lambda em} (r,R_d)} x(r)g^{\lambda ex} (r,R_s)dr. \end{equation}\end{document} $${\phi}_{f}({R}_{d},{R}_{s})=\int {g}^{\lambda em}(r,{R}_{d})x\left(r\right){g}^{\lambda ex}(r,{R}_{s})dr.$$^{4, 7}

## Eq. 4

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \hspace*{-8pt}J(R_d,R_s) = \frac{{\partial [\phi _f (R_s,R_d)]}}{{\partial x}} =\! \int {g^{\lambda em} (r,R_d)g^{\lambda ex} (r,R_s)} dr, \end{equation}\end{document} $$J({R}_{d},{R}_{s})=\frac{\partial \left[{\phi}_{f}({R}_{s},{R}_{d})\right]}{\partial x}=\int {g}^{\lambda em}(r,{R}_{d}){g}^{\lambda ex}(r,{R}_{s})dr,$$*J*(

*R*

_{d},

*R*

_{s}) is the Jacobian function with a source at

*R*

_{s}and a detector at

*R*

_{d}.

The number of reconstructed fluorescence yields of FMT was determined by the number of discrete points (*Np*). To reduce the ill-posed number of coefficients for FMT, the measurement points referring to the number of sources (*Ns*) multiplied by the number of detectors (*Nd*) must be greater than the number of reconstructed fluorescence yield, which is equal to *Np*. Therefore, the number of source-detector pairs is very large. When there are *Ns* sources, *Nd* detectors, and *Np* discrete points in the experiment, the Jacobian matrix can be expressed as follows:

## Eq. 5

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} J = \left[ {\begin{array}{*{20}c} {J\big(R_d^1,R_s^1\big)} \\ {......} \\ {J\big(R_d^i,R_s^j\big)}\vspace*{3pt} \\ {J\big(R_d^{Nd},R_s^{Ns}\big)} \\ \end{array}} \right]. \end{equation}\end{document} $$J=\left[\begin{array}{c}J\left({R}_{d}^{1},{R}_{s}^{1}\right)\\ ......\\ J\left({R}_{d}^{i},{R}_{s}^{j}\right)\\ J\left({R}_{d}^{Nd},{R}_{s}^{Ns}\right)\end{array}\right].$$In this paper, the Green's function in Eq. 4 was calculated by the kernel of MCX. According to Eq. 5, there are *Ns* + *Nd* Green's functions to be calculated to construct a (*Ns* × *Nd*) × *Np* Jacobian matrix.

Combining the Jacobian matrix in Eq. 5, and according to Tikhonov regularization,
^{29, 30} the distribution of the fluorescence yield can be reconstructed. Tikhonov regularization can be expressed as

## Eq. 6

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \min (||Jx - y||^2 + \lambda ||\Gamma x||), \end{equation}\end{document} $$\mathrm{min}\left(\right||Jx-y|{|}^{2}+\lambda \left|\right|\Gamma x\left|\right|),$$*J*is the Jacobian matrix shown in Eq. 5, Γ is the Tikhonov matrix (generally, Γ =

*I*where

*I*is the identity matrix) and λ = 10

^{−20}, where

*λ*is the relaxation factor, which was determined by the

*L*-curve method.

^{31}The variable

*y*is the photon density of fluorescence in the experiment, and

*x*= ημ

_{af}represents the reconstructed fluorescence yield.

^{11}The conjugate gradient method was used to solve Eq. 6.

^{32}

## 2.2.

###
*Acceleration by a GPU Cluster*

To further improve the speed of the reconstruction method based on MC simulations, with the goal of satisfying the demand for fast reconstruction in FMT, a GPU cluster was constructed with MPICH2, which is a high-performance and widely portable implementation of the MPI standard (both MPI 1.0 and MPI 2.0). The entire calculation task of the Green's functions needed by the inverse problem was distributed to the GPU cluster. The hardware configuration is listed in Table 1. Three personal computers in a local area network equipped with a total of six GPUs of the G200 framework (which supports both double and single precision) were used to construct a GPU cluster. A flowchart of the GPU cluster-accelerated FMT reconstruction method based on MC simulations is shown in Fig. 1.

## Table 1

The hardware of the GPU cluster.

CPU | Graphic card | Processing power (peak) Gflop/s (Ref. 40) | |
---|---|---|---|

Computer 1 | Pentium(R) Dual-Core E5300 | NVIDIA GTX 295 with two GPUs | 1788.48 |

Computer 2 | Intel(R) Core™ i7 920 | NVIDIA GTX 295 with two GPUs | 1788.48 |

Computer 3 | Intel(R) Xeon(R) X5570 | NVIDIA Quadro FX4800 with one GPU | 693.50 |

NVIDIA Tesla C1060 with one GPU | 933.12 |

The processes in Fig. 1 can be briefly summarized as follows:

(1) The host computer receives the file containing the position and direction of the sources and the detectors, the average optical coefficients, and the phantom size from the experiment. The computer then transmits this information to each computing node in the GPU cluster.

(2) The CPU compute nodes receive all of the experimental information from the host computer and distribute the tasks of Green's function computation into different GPUs. Because the calculation time for Green's functions is different for different GPUs, the time consumption is determined by the GPU that would have the lowest performance if the computing tasks were equally distributed into each GPU of the cluster. Therefore, a load-balancing strategy is used to achieve maximum parallel efficiency. All tasks are distributed into GPUs according to the following equation:

where## Eq. 7

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} N_i = T_{{\rm task}} *\frac{{P\_{\rm GPU}_i }}{{\sum\limits_i^{N_{{\rm GPU}} } {P\_{\rm GPU}_i } }}, \end{equation}\end{document} $${N}_{i}={T}_{\mathrm{task}}*\frac{P\_{\mathrm{GPU}}_{i}}{\sum _{i}^{{N}_{\mathrm{GPU}}}P\_{\mathrm{GPU}}_{i}},$$*N*_{i}is the number of computing tasks for Green's functions for the*i*’th GPU and*T*_{task}is the total number of Green's function tasks. [TeX:] $P\_{\rm GPU}_i$ $P\_{\mathrm{GPU}}_{i}$ is the processing power of the*i*’th GPU, which is listed in Table 1.*N*_{GPU}is the total number of GPUs. The load-balancing of CPU is not considered because the main compute task in the compute node of CPU is to calculate the Green's function by GPU, so CPU is rarely used.(3) According to Eq. 7, the start and end indices of the computing task (

*i*_{start},*i*_{end}) for each GPU are calculated. The MC simulation is used to calculate the Green's function from*i*_{start}to*i*_{end}, and the fluences produced by MC are saved in a file named i.mc2. The MC simulation refers to MCX.(4) When the entire list of Green's functions has been calculated by the GPU cluster, all of the files that record the results of the Green's functions are transformed to the host computer to construct the Jacobian matrix required for reconstruction.

(5) Tikhonov regularization is used to reconstruct the distribution of the fluorescence yield. The expression is shown in Eq. 6.

^{24}In addition, all of the codes based on the CPU were compiled in Visual Studio 2008 with the −02 (MaxSpeed) option on a Windows Server 2008 R2 X64 system. The codes based on the GPU were compiled with the NVCC 3.0 (beta) which is a compiler in CUDA.

## 3.

## Phantom Experiments

In this section, two phantom experiments for both low- and high-scattering media are used to demonstrate the accuracy and speed of MC-based FMT reconstruction accelerated by a GPU cluster.

## 3.1.

###
*Reconstruction Accuracy*

Four glass tubes (2.2-mm in diameter) with different concentrations of Dir-Boa fluorescence dye^{33} were placed in a rectangular box (13.5×40.5×60 mm) with a mixed solution of Intralipid and India ink. The concentration in the four glass tubes ranged from 1.8 μmol/*L* to 1.2 μmol/*L*. Two phantom experiments with different kinds of optical coefficients for the mixed solution of Intralipid and India ink were conducted. The glass tubes in the phantom experiment are heterogeneous media because the optical coefficients of the solution of Dir-Boa fluorescence dye (high-absorption, low-scattering) are different from those for the mixed solution of Intralipid and India ink, and the refractive index of glass is also mismatched with the mixed solution. The heterogeneity of the fluorescence solution is considered in the calculation of the Green's function for MC-based FMT reconstruction but not in that based on DA with FEM because it is not accurate for DA when μ_{a} > μ_{s}′.

*Case 1*. The absorption coefficient of the first experiment was 0.23 mm^{−1}, and the reduced scattering coefficient was 0.68 mm^{−1}. In this case, μ_{α} ≈ μ_{s}′, which indicates a low-scattering case. In theory, the DA will be invalid. However, the MC simulation was accurate.

*Case 2*. The absorption coefficient of the second experiment was 0.07 mm^{−1}, and the reduced scattering coefficient was 1.8 mm^{−1}. In this case, μ_{α} ≪ μ_{s}
^{′}, which indicates the high-scattering case. In theory, the DA should be as accurate as the MC simulation.

These two types of optical coefficients refer to the muscle tissue and the skin of a rabbit at a wavelength of 790 nm.^{26} A dual-modality imaging system combined with micro-computer tomography (micro-CT) and FMT ^{34} was then used to obtain the fluorescence images. A micro-CT slice was used to prove the localization accuracy of the reconstruction.

The reconstruction area (1.35×40.5×20 mm) was smaller than the real size of the phantom because the fluorescence signal only appeared in that area. The spatial resolution of the FMT is about 1 mm;^{35} therefore, by applying a 1-mm^{3} discrete accuracy, the reconstruction area was separated into 14×41×20 voxels. In this experiment, there were 99 sources and 220 detectors, so 319 Green functions were calculated. Therefore, the amount of data from the experiment (99×220 = 21,780) was larger than the number of reconstructed fluorescence yield, which was equal to the number of voxels (14×41×20 = 11,480). As a result, the ill-posed FMT reconstruction could be relieved.

Six GPUs were used to calculate the Green's functions, and 10^{5} photon moves were simulated at each thread of the GPU. There were 2560 threads in total and 256 threads in each block of the GPU. We repeated this calculation 12 times to increase the total number of simulation photons and to avoid the kernel launch time-out error.^{25} In each calculation of a Green's function, there were 1.36 × 10^{7} photons to be simulated on average, and 4.3410^{9} photons were simulated for the total reconstruction process.

The reconstruction method based on a DA with the FEM refers to NIRFAST
^{36, 37}and TOAST.
^{27, 38} By applying a 1-mm^{3} mesh, the reconstruction area was discrete, with 12,054 points and 50,100 tetrahedrons.

Tikhonov regularization was solved with a conjugate gradient method that was stopped at the 18th iteration, when the difference between two consecutive iterations was less than 10^{−6}. The reconstructed distribution of the fluorescence yield is shown in Fig. 2.

The average value of the reconstructed fluorescence coefficient of the four fluorochromes for both case 1 and case 2 was calculated from Figs. 2b, 2c, 2e, 2f. These values were calculated with different FMT reconstruction methods based on both MC simulations and DA. Because the fluorescence yield was in direct proportion to the concentration of the fluorochromes, the reconstructed concentrations of four fluorochromes were calculated through a linear fit of the fluorescence yield against the real concentration of the fluorochromes. The results are shown in Fig. 3.

The error between the linear fit and the real reconstructed concentration was used to determine the coefficient of the linear fit, which refers to the linearity of the reconstructed concentration. The resulting values are listed in Table 2. The sum of squares error (SSE) refers to the error between the linear fit and the reconstructed concentration, and *R*-squared indicates the linearity of the fit. Under ideal conditions, the SSE is zero and *R*-squared is 1.

## Table 2

The coefficients of the linear fit.

Case 1 | Case 2 | |||
---|---|---|---|---|

FEM | MC | FEM | MC | |

SSE | 0.0657 | 0.0031 | 0.0093 | 0.0055 |

R-squared | 0.7076 | 0.9837 | 0.9543 | 0.9755 |

## 3.2.

### Acceleration Performance

In this section, the effect of the load-balancing method based on the calculation power of GPU is compared with equal-load-balancing. The acceleration performance for different numbers of GPUs in the GPU cluster is compared for both case 1 and case 2.

We distribute all the tasks into GPU clusters with two different balancing methods: load-balancing with Eq. 7 and equal loading, which equally distributes the task into GPUs. We use the experimental data for case 1 and case 2. The configuration of the MC and GPU is mentioned in Sec. 3.2, and the reconstruction time is recorded for comparison. The results are shown in Table 3.

## Table 3

Performance of load-balancing.

Case 1 | Case 2 | |||
---|---|---|---|---|

Number of GPUs | Equal load(s) | Load-balancing(s) | Equal load(s) | Load-balancing(s) |

4 | 867 | 704 | 1133 | 920 |

6 | 585 | 465 | 697 | 554 |

From Table 3, we find that load-balancing based on Eq. 7 increases the computational efficiency.

We separately tested the performance of the FMT reconstruction method based on MC simulations with a GPU cluster containing one, two, four, or six GPUs. In every thread of the GPU, 10^{5} photon moves were simulated. There were 7680 threads in total in computer 1 and computer 2, but there were 6144 threads in total in computer 3. Every block in the GPU contained 256 threads. Each GPU in a GTX295 graphics card contains 30 independent multiprocessors (MPs), which contain 16384 registers, and MCX occupies 54 registers at each thread. Therefore, 256 threads could be run simultaneously in each MP. For the entire GPU, 30×256 = 7680 threads could be run simultaneously. However, the GPU on the NVIDIA Quadro FX4800 graphics card of computer 3 only contained 24 MPs, so 24×256 = 6144 threads in total could be run simultaneously. With these settings, the computing power of a graphics card with a G200 framework can be fully unleashed. Each calculation of a Green's function was repeated 12 times to avoid kernel launch time-out errors^{25} and to increase the total number of simulated photons.

For each calculation of a Green's function, there were 4×10^{7} simulated photons on average and 1.3×10^{10} in total for the entire reconstruction. Because the total number of photons in the reconstruction process was only determined by the total number of calculated Green's functions, the photon moves in each thread, the total number of threads, and the different numbers of GPUs in the GPU cluster did not influence the total number of simulation photons. However, there were small differences in the total number of photons when the GPU of the Quadro FX4800 in computer 3 was used because the total number of available threads for the Quadro FX4800 was only 6144.

The influence of the number of GPUs in the GPU cluster on the amount of time consumed for the FMT reconstruction method based on MC simulations is shown in Fig. 4. The time data for one and two GPUs in the GPU cluster came from the computing node of computer 1. Four GPUs refers to the computing nodes of computers 1 and 2, and six GPUs in the GPU cluster refers to computing nodes of computers 1, 2, and 3.

We also compared the acceleration ratio between the speeds of the CPU code and the GPU cluster code. The speed of the CPU code was estimated by the tMCimg code,^{39}which was compiled with –O2(maxspeed) in Visual Studio 2008. Because it takes a long time to use the tMCimg in FMT reconstruction, we only tested the speed of tMCimg for one calculation of a Green's function. This speed was approximately equal to the speed for the whole reconstruction because the Green's function calculation consume 95% of the time needed for the total reconstruction. The results showed that for case 1, the calculation speed of a Green's function based on CPU code was 75 photons/ms, and for case 2, the speed was 7.6 photons/ms. The performance using different numbers of GPUs in the GPU cluster is shown in Fig. 5, and the acceleration ratio compared with the CPU is shown in Table 4.

## Table 4

The acceleration ratio for different numbers of GPUs.

Number of GPUs | 1 | 2 | 4 | 6 |
---|---|---|---|---|

Acceleration ratio in case 1 | 66x | 132x | 246x | 373x |

Acceleration ratio in case 2 | 580x | 871x | 1859x | 3088x |

## 4.

## Discussion and Conclusion

For the low-scattering case (μ_{a} ≈ μ_{s}′), Figs.
2a, 2b, 2c, and Fig. 3a illustrate that the GPU-cluster-accelerated FMT reconstruction method based on MC simulations accurately reconstructed the localization and concentration of fluorochromes. The traditional reconstruction method based on a DA with the FEM failed in this case. This superiority of our method in reconstructing the concentration is quantitatively demonstrated in Table 2.

In the experiment for the high-scattering case (μ_{α} ≪ μ_{s}
^{′}), which has typically been reconstructed with the traditional reconstruction method based on a DA with the FEM, the GPU cluster–accelerated FMT reconstruction method based on MC simulations also accurately reconstructed the localization and concentration of fluorochromes. Figures
2d, 2e, 2f, Fig. 3b, and Table 2 show that, in the high-scattering case, the two methods achieve almost identical results. The only disparity is a very small difference in the linearity of the reconstructed concentration because the heterogeneity of the fluorescence solution is considered in the MC-based FMT reconstruction method accelerated by GPU clusters and the size of the fluorescence solution is small, which cannot cause a large error in the calculation of the Green's function. This small difference indicates the advantage of our method for heterogeneous media with refractive-index-unmatched boundaries.

The GPU cluster-accelerated FMT reconstruction method based on MC simulations can accurately reconstruct the localization and concentration of fluorochromes for heterogeneous media under both low- and high-scattering conditions. In contrast, the traditional FMT reconstruction method based on a DA with the FEM was only valid for the case in which μ_{α} ≪ μ_{s}
^{′}.

Figure 4 shows that the FMT reconstruction method based on MC simulations accelerated by a single GPU for case 1 and case 2 requires 44 and 49 min to finish the whole process, which does not satisfy the demand for fast reconstruction in FMT. However, the FMT reconstruction method based on MC simulations accelerated with six GPUs for case 1 and case 2 requires only 7.7 and 9.2 min, which is the same as the time required by the FMT reconstruction method based on a DA with the FEM.^{12} We also found that the more GPUs in the GPU cluster, the less time was required for the FMT reconstruction based on MC simulations, which predicts that a GPU cluster with about 15 Nvidia GTX 480 graphic cards (about 3 times faster than Nvidia GTX 295) may require only 1 min to finish the entire FMT-reconstruction process. Note, however, that the larger reduced scattering coefficient required more processing time in the FMT reconstruction based on MC simulations because a larger reduced scattering coefficient means that the photons need more steps to escape from the tissue, which requires more time to calculate.

The more GPUs in the GPU cluster, the faster the speed of the FMT reconstruction method based on MC simulations. Compared with the CPU code, the FMT reconstruction method based on MC simulations accelerated with 6 GPUs was 3088 times faster for case 2, but only 373 times faster for case 1. This difference is caused by the speed of the FMT reconstruction method based on MC simulations decreasing more quickly for the CPU code than for the GPU cluster when the reduced scattering coefficient increased.

In summary, with the MPI standard and load-balancing method, we propose a fast FMT reconstruction method which is based on MC simulation and accelerated by a GPU cluster. Through two phantom experiments on both high- and low-scattering media, we prove that this method can accurately and rapidly reconstruct the fluorochrome localization and concentration for 3-D heterogeneous media with refractive-index-unmatched boundaries and complex distributions of optical coefficients.

## Acknowledgments

We gratefully acknowledge the financial support of the National Major Scientific Research Program of China (Grant No. 2011CB910400) and the National Natural Science Foundation of China (Grant No. 60828009).