GPU-accelerated mesh-based Monte Carlo photon transport simulations

The mesh-based Monte Carlo (MMC) algorithm is increasingly used as the gold-standard for developing new biophotonics modeling techniques in 3-D complex tissues, including both diffusion-based and various Monte Carlo (MC) based methods. Compared to multi-layered and voxel-based MCs, MMC can utilize tetrahedral meshes to gain improved anatomical accuracy, but also results in higher computational and memory demands. Previous attempts of accelerating MMC using graphics processing units (GPUs) have yielded limited performance improvement and are not publicly available. Here we report a highly efficient MMC – MMCL – using the OpenCL heterogeneous computing framework, and demonstrate a speedup ratio up to 420× compared to state-of-the-art single-threaded CPU simulations. The MMCL simulator supports almost all advanced features found in our widely disseminated MMC software, such as support for a dozen of complex source forms, wide-field detectors, boundary reflection, photon replay and storing a rich set of detected photon information. Furthermore, this tool supports a wide range of GPUs/CPUs across vendors and is freely available with full source codes and benchmark suites at http://mcx.space/#mmc.


Introduction
Modeling photon-tissue interactions accurately and efficiently is essential for an array of emerging biophotonics applications, such as diffuse optical tomography (DOT), fluorescence molecular tomography (FMT) and functional near-infrared spectroscopy (fNIRS). Due to the complex nature of photon-tissue interactions, significant effort has been dedicated towards developing computationally efficient methods that not only properly consider the underlying physical processes, such as light scattering, absorption and emission, but also accurately model the complex-shaped anatomical boundaries that delineate tissues.
Over the past decade, Monte Carlo (MC) based modeling has seen increasing use, thanks to two recent breakthroughs in MC algorithm development. The first breakthrough takes advantage of the parallel computing capability of modern * Address correspondence to: Qianqian Fang, E-mail: q.fang@neu.edu graphics processing units (GPUs) and dramatically shortens the computational time by 2 to 3 orders of magnitude. 1,2 The second breakthrough allows MC to model complex tissue boundaries using increasingly sophisticated representations, such as 3-D voxels, 1, 3 non-uniform grids, and triangular 4 and tetrahedral meshes. 5,6 Combined with the generality and intuitive domain settings, these new advances has made MC not only a choice for gold-standard solutions, but also a powerful research tool increasingly involved in routine optical data processing and instrument parameter optimizations.
Nearly a decade ago, our group proposed one of the first mesh-based MC (MMC) methods. 6 Compared with the traditional voxel-based MC, MMC reports better accuracy because meshes are more anatomically accurate in modeling arbitrarily-shaped 3-D tissues, which are often delineated by curved boundaries. Since then, a number of works have been published to further extend MMC for faster and more accurate simulations. In several works, 7, 8 the ray-tracing calculation and random number generation were vectorized and ported to single-instruction multiple-data (SIMD). This data level parallelism significantly improves the simulation speed on modern CPUs. Yao et al. 9 reported a generalized MMC to efficiently support wide-field illumination and camerabased detection. In Yan et al., 10 a dual-grid MMC, where a coarsely tessellated mesh and a fine grid are used for raytracing and output storage, respectively, managed to enhance both performance and solution accuracy. Leino et al. 11 developed an open-source MMC software which incorporates MATLAB interface for improved usability.
Despite the steadily growing user community, a highly efficient GPU-accelerated MMC implementation remains missing. While there has been a number of attempts to accelerate MMC using GPU computing, only limited success has been reported. For example, Powell et al. 12 reported a CUDA-based GPU-MMC for acoustic-optics (AO) modeling. The authors reported that, in the context of different application focuses, an ∼8x speedup was achieved when comparing their GPU algorith with our single-threaded MMC simulation. In another work, 13 Zoller et al. reported a GPU-based MMC (MCtet) to model anisotropic light propagation in aligned structures. Unfortunately, the software codes in both works are not publicly available to allow further testing or comparison. While these results are encouraging, compared to highly accelerated voxel-based MC, 1, 14 the lower magnitude in speedup presents an opportunity for further improvement.
The difficulties of accelerating MMC in modern GPU processors are associated with both the computation and memory characteristics of MMC photon modeling. Compared to voxel-and layer-based MC algorithms, MMC requires more geometric data (mesh nodes, elements, and surfaces) to advance a photon per propagation step. Because GPUs typically have scarce high-speed shared memory and constant memory, 15,16 the memory latency becomes a major barrier towards allowing MMC to benefit from the parallel hardware. Also, the ray-tracing calculations in MMC involve more complex tests to calculate ray-triangle intersections within the enclosing tetrahedra. This imposes higher computational demands compared to MC models that only handle simple geometries.
Here, we report a highly accelerated MMC imple-mentation, MMCL, developed using the Open Computing Language (OpenCL) framework. MMCL supports almost all advanced simulation features seen in our open-source CPU MMC, such as the support of wide-field complex sources and detectors, 9 photon replay, 17 dual-grid simulations 10 and storage capabilities for rich sets of detected photon data. Thanks to the excellent portability of OpenCL, the MMCL is capable of running on a wide range of commodity GPUs. From our tests, MMCL has shown about 2 orders of magnitude speedup compared to our highly optimized single-thread CPU implementation. Although we recognize the 2-to 5-fold speed disadvantage for OpenCL compared to CUDA on NVIDIA hardware, as shown in our previous study, 14 in this work, our decision of prioritizing the development of OpenCL MMC is largely motivated by 1) the open-source ecosystem of OpenCL allowing the developed software to be rapidly and widely disseminated through public software repositories, and 2) the upcoming highperformance GPUs from Intel and AMD being expected to attract more development attention to OpenCL libraries and drivers, likely resulting in boosts in performance.
In the following sections, we will first discuss the key algorithm steps and optimizations that enable high-throughput MC simulations, and then report our validation and speed benchmarks using simulation domains covering a wide range of complexities and optical properties. Finally, we discuss our plans to further develop this technique.

GPU-accelerated photon propagation in tetrahedral meshes
As we discussed previously, 7 at the core of MC light transport modeling is a ray-tracing algorithm that propagates photons through complex media. In our publicly available MMC software, 6 we have implemented 4 different raytracers to advance photons from one tetrahedron to the next. These ray-tracers are based on 1) the Plücker coordinates, 6 2) a fast SIMD based ray-tracer, 7 3) a Badouel ray-tracing algorithm, 5, 18 and 4) an SIMD-based branchless-Badouel ray-tracer. 19 As we demonstrated before, 7 the branchless-Badouel ray-tracing algorithm reported the best performance among the above methods; it also requires the least amount of memory and computing resources. For these reasons, we specifically choose the branchless-Badouel ray-tracer in this work. In our CPU based MMC software, we explored SIMD parallelism using SIMD Extensions 4 (SSE4). 7 In this work, we ported our manually tuned SSE4 computation to OpenCL, resulting in both improved code-readability and efficiency. The ray-tracing calculation in our GPU MMC algorithm can be represented by the below 5-step 4-component-vector operations: where ×, /, max are element-wise multiplication, division and maximum value, respectively; v = {v x , v y , v z } and p = {p x , p y , p z } are the current photon direction and position, respectively; N x,y,z denotes the x/y/z components (4 elements in each vector) of the surface normal vectors N i (i = 1, 2, 3, 4) at the 4 facets of the current tetrahedron; where the 3-D position P 0 i can be any node of the i-th face as the dot product with N i is a constant per face; 18 0 is an all-zero vector; and hmin performs the "horizontal" minimum to extract the lowest value from the T vector. Both N x,y,z and D are pre-computed. Vectors T and S in Eqs. 1 to 4 are intermediate variables, and T in Eq. 5 denotes the distances from p to the intersection points of the 4 facets of the current tetrahedron (intersections in the − v direction are ignored). The index in T where the distance has the lowest value, i.e. t in Eq. 5, indicates the triangle that the ray intersects first. Notice that above calculations consist of only short-vector (3 or 4 elements) operations with no branching. Such calculations can be efficiently optimized on the modern GPUs or CPUs, resulting in high computational throughput. The above vectored formulation is also illustrated in Fig.1. The "photon ray" with origin p and direction v intersects with the 4 facets at 4 distances, t i : a positive distance indicates intersection in the forward direction and a negative distance indicates an intersection in the − v direction. The task of finding the intersection point becomes finding the minimum positive distance in T = {t i } i=1,2,3,4 . This is efficiently achieved by first replacing all negative values in T by +∞ in Eq. 4 and then taking a horizontal minimum in Eq. 5.
The above MMCL algorithm readily supports a variety of wide-field sources and detectors via our mesh retesselation approach. 9 By storing various photon-packet related parameters, such as partial-pathlength, scattering event count, momentum transfer etc, MMCL is also capable of exporting a rich set of detected photon information, similar to its CPU counterpart. Our previously developed "photon replay" approach 17 for constructing the Jacobian matrix is also implemented in this OpenCL code.

Dual-grid MMC GPU optimization
Recently, we reported a significantly improved MMC algorithm, the dual-grid MMC or DMMC, 10 combining the strengths of both voxel-and mesh-based simulations. It reduces the ray-tracing overhead significantly by utilizing a coarse tessellation of the surface nodes without losing anatomical accuracy while a dense voxelated storage grid provides higher spatial resolution and lower discretization error. In this work, we have successfully ported the DMMC algorithm from SIMD to an OpenCL-based computing model. The outputs are written in a voxelated grid in the GPU/CPU, similar to our voxel-based MC methods. 14 To reduce global memory operations, we only write to memory when a photon packet attempts to move out of a voxel in the output grid.

GPU memory characterization and optimization
Compared to voxel-based Monte Carlo algorithms, MMC has different memory characteristics. These can greatly impact the computational efficiency on GPUs because high speed memory is limited on GPUs. To advance a photon packet by one step in a tetrahedral mesh, MMC requires reading more geometric data, including node coordinates, node indices of the current element, normal vectors of the tetrahedron facets etc. Although most of these data can be pre-computed on the host (CPU) and copied to the GPU, such data are too big to be stored into the high-speed shared or constant memory. Therefore, one has to store the bulk of the mesh data in the "global memory" which is known to have high latency (roughly 100x slower than the shared memory). 16 It is important to minimize such latency for an efficient GPU MMC implementation.
The key to overcoming this memory latency issue on the GPU is to launch a large number of threads and ensure that the GPU streaming multi-processors (SM) have abundant active thread-blocks (divided into "warps" in the NVIDIA literature and "wavefronts" in OpenCL literature). Therefore, while some of the threads wait for data from the global memory, the GPU scheduler can switch on other threads and keep the SMs busy. When there are sufficient wavefronts in the waiting queue, the global memory latency can be effectively hidden. Typically, a minimum of 10-20 wavefronts per SM is required 15,16 to effectively hide the memory latency on the NVIDIA and AMD GPUs. However, the maximum wavefronts per SM is strongly dependent on how the kernel is programmed, especially in regards to the number of registers and size of shared memory as SMs have only limited resources and must share them among all active wavefronts. As a result, the key to accelerate the GPU based MMC is to 1) launch a large number of threads that produce sufficiently large waiting queues per SM, and 2) minimize the register/shared memory needs per thread so that many wavefronts can simultaneously run on the SM. Following these insights, we are able to dramatically improve the simulation speed on tested GPU devices, independent of its vendors.

Simulations on heterogeneous computing platforms
As discussed in our previous investigation, 14 the high scalability of OpenCL permits simultaneous use of multiple computing devices. This includes both simultaneous use of different generations of GPU architecture and mix between GPUs and CPUs. In such a heterogeneous computing environment, it is crucial to ensure that the simulation algorithm has a flexible workload distribution strategy to assign appropriate workloads to each GPU device according to their capabilities. Several device-level load-balancing strategies have been implemented in this work. A manual workload distribution can be specified by a user. The manual workload partition can be guided using relative speeds obtained from a small workload running on each device. In addition, we also provide a heuristic method to automatically determine an effective workload partition according to the persistent thread (PT) 20 count on each invoked device. The PT count is determined by the number of threads that can "fully occupy" all SMs on the device. For example, for an Intel GPU, the PT count is computed by block size × 7 × #EU (execution unit, 1 EU can run 7 threads 21 ); for AMD GPUs, we estimate PT count by block size × 40 × #CU (compute unit, 1 CU can run 40 active wavefronts 16 ).

Results
In this section, we first validate our massively parallel MMCL algorithm using standard benchmarks and then systematically characterize and compare speedup ratios over a range of CPU/GPU devices using single-and multi-threaded MMC on CPUs as references. The simulation speeds are reported for both conventional single-grid and dual-grid MMC (DMMC). For all simulations, 10 8 photons are simulated with atomic operations 1 enabled. All benchmarks were performed on Ubuntu Linux 16.04 with the latest stable GPU drivers. All simulation input data and scripts are provided in our open-source code repository on Github (http://github.com/fangq/mmc) for reproducibility and future comparisons.
In the first set of benchmarks, we focus on validating the accuracy of MMCL in two simple geometries: B1 (cube60) -a cubic homogeneous domain (see Fig. 2 in Ref. 6 ) and B2 (sphshells) -a heterogeneous domain made of multi-layered spherical shells (Fig. 1c in Ref. 10 ).
Briefly, the B1 benchmark contains a 60 × 60 × 60 mm 3 homogeneous domain with an absorption coefficient µ a = 0.005 mm −1 , scattering coefficient µ s = 1 mm −1 , anisotropy g = 0.01, and refractive index n = 1.0. The medium outside of the cubic domain is considered as air. In the B2 benchmark, the optical properties and dimensions of each layer are described previously. 10 In both cases, a pencil beam source injects photons at [30, 30, 0] mm, with the initial direction pointing along the +z-axis. Boundary reflection is considered in B2 but not in B1.
Both MMC and MMCL share the same tetrahedral mesh for each simulation. For B1 and B2, two mesh densities were created to separately compare MMC and MMCL in the single-grid and dual-grid 10 (denoted as B1D and B2D) simulation methods. The node and element numbers for the generated meshes along with the pre-accelerated MMC simulation speeds are summarized in Table 1. We want to mention that MMC simulation speed is correlated more strongly with the mesh density (normalized by the local scattering coefficient) within the high-fluence regions than the total mesh element/node sizes. For a fixed domain, however, increased mesh density results in higher element/node numbers. Tab. 1 Summary of the meshes and baseline simulation speeds (in photon/ms, the higher the faster) for the selected benchmarks. The baseline speeds were measured using a single-thread (MMC-1) or 8-thread (MMC-8) SSE4-enabled MMC on an i7-7700K CPU.
In Fig. 2(a), we show the cross-sectional contour plots of the fluence distributions (mm −2 ) in log 10-scale using the fine-mesh model along the plane y = 30.5 mm. We also overlap the result from the DMMC output with those from MMCL to show the agreements between different MC methods.
In the 2 nd set of tests, we expand our comparisons to more challenging cases involving realistic complex domains. Two simulations are compared: B3 (colin27) -simulations on a complex brain atlas -Colin27 (Fig. 4 in 6 ), and B4 (skinvessel) -the skin-vessel benchmark. 22 Briefly, the B3 benchmark contains a 4-layer brain mesh model derived from an atlas; 6 B4 contains a 3-layer skin-model with an embedded vessel. The generated meshes for the two complex examples are also summarized in Table 1. The optical properties for B3 are described in Ref., 6  In Fig. 2(c-d), we show the cross-sectional contour plots from the two complex cases. For B3, we show the sagittal slice along the source plane; for the B4 benchmark, the cross-section is perpendicular to the y-axis aligned vessel. The curved tissue boundaries are shown as black-dashed lines.
Next, our focus is to benchmark the simulation speeds across a wide range of CPU/GPU processors using MMCL, and compare those with the baseline, i.e. single-threaded ("MMC-1" in Table 1) or multi-threaded ("MMC-8" in Table 1) SSE4-enabled MMC on the CPU. The branchless Badouel ray-tracing algorithm 7, 19 and a parallel xorshift128+ 23 random number generator are used across all MMC and MMCL simulations.
In Fig. 3(a), we plot the speedup ratios over singlethreaded MMC (MMC-1) across a list of benchmarks and processors. In Fig. 3(b), we also included a comparison to MCX-CL 14 -a voxel-based MC software using OpenCL. The voxelated simulation domain in MCX-CL matches the DMMC output grid of the corresponding MMCL simulations. The simulation speed numbers (in photon/ms) corresponding to Fig. 3(a) are also summarized in Table 2.
In addition, we also test MMCL using multiple GPU devices simultaneously. Two AMD GPUs of different computing capabilities -Vega10 (Vega64) and Vega20 (Vega II)are used. When running the B1D test on both GPUs, we distributed the workload based on the ratio between their single-GPU speeds. When using both Vega GPUs, we have obtained a speed of 4,583 photon/ms in the B1D benchmark; the speeds using Vega10 and Vega20 alone are 2,171 and 2,580 (in photon/ms), respectively.

Discussions and Conclusion
As expected, the contour plots generated from MMC and MMCL are nearly indistinguishable from each other in all four tested cases in Fig. 2. Similarly, in Figs. 2(a-b), the dualgrid MMC (DMMC) and MMCL (D-MMCL) outputs also excellently match the single-grid outputs. In Fig. 2(b), The previously observed fluence differences inside the spherical shell between mesh-based simulations and voxel-based MC simulations (MCX-CL) 10 are also reproduced in MMCL outputs, emphasizing the importance of using mesh models when simulating domains with curved boundaries and highcontrast heterogeneities.
Many observations can be made from the speed results reported in Fig. 3(a). First, the high scalability of our GPU accelerated MMC algorithm is indicated by the increasing speedups obtained from more recent and capable GPUs. The simulation speed achieved on the NVIDIA Titan V GPU is 117x to 421x higher than that of the CPU-based MMC on a single thread; the highest acceleration on AMD GPUs is achieved on the Vega20 GPU, reporting a 20x to 77x speedup. Moreover, the high portability of the algorithm is evident by the wide range of NVIDIA/AMD/Intel CPUs and GPUs tested, owing to OpenCL's wide support. There is a steady and significant increase in computing speed between differ-  . 3 Speeds of MMCL in 6 benchmarks (B1-B4: single-grid; B1D/B2D: dual-grid): (a) speedup ratios over a single-threaded (on i7-7700K) SSE4 MMC, and (b) speeds in dual-grid simulations compared to MCX-CL. In (a), we also report the speed (photon/ms, light-blue) and speedups over the single-(red) and multi-threaded (green) MMC in the labels for Benchmark-B3 (Colin27). Tab. 2 Simulation speed (in photon/ms, the higher the faster) of MMCL in 6 benchmarks (B1-B4: single grid; B1D and B2D: dual-grid); we also report the voxel-based MCX-CL speed in benchmarks B1D and B2D. The master script to reproduce the above results can be found in the "mmc/examples/mmclbench/" folder of our software. ent generations of GPUs made by these vendors, similar to our previous findings in the voxel-based MC algorithm. 14 Before comparing the speed differences in mesh and voxel-based simulations, as shown in Fig. 3(b), one must be aware that these two algorithms result in differing levels of accuracy, as suggested in Fig. 2(b), even when using the same grid space for output. Nevertheless, several interesting findings can be drawn from this figure. In all NVIDIA GPUs, MMCL is about 30% to 50% of the speed compared to voxel-based MCX-CL. 14 On AMD GPUs, MMCL is only ∼8% of MCX-CL's speed. However, Intel CPUs show an opposite result -MMCL is about 12% to 31% faster than MCX-CL in B1D. The sub-optimal speed on AMD GPUs is also noticeable in Fig. 3(a), where the speedup from Vega20 is only a fraction of the speedup from NVIDIA RTX2080, despite the former having 30% higher theoretical throughput.

Device
To understand this issue further, we performed a profiling analysis and discovered that the sub-optimal speed on AMD GPUs results from extensive register allocation by the compiler -the AMD compiler produces ∼200 vector registers compared to only 69 by the NVIDIA compiler. The high register count limits the total active blocks to only 4 on Vega20 compared to 28 on NVIDIA GPUs, making it extremely difficult for the GPU to "hide" memory latency. 16 We are currently collaborating with AMD to investigate this issue.
Moreover, based on our previous observations in voxel-based MC using OpenCL and CUDA on NVIDIA devices, 14 we noticed that CUDA-based MC simulation is about 2-to-5-fold faster than the OpenCL implementation due to driver support differences. As a result, we anticipate that if we further port MMCL to the CUDA programming language, one may achieve further speed improvement, with the resulting software limited to NVIDIA GPUs only. In summary, we report a massively-parallel mesh-based Monte Carlo algorithm that offers a combination of both high speed and accuracy. The OpenCL implementation allows one to run high-throughput MC photon simulations on a wide range of CPUs and GPUs, showing excellent scalability to accommodate increasingly more powerful GPU architectures. We describe the insights that we have learned regarding GPU memory utilization and vendor differences. In addition, we provide speed benchmarks ranging from simple homogeneous domains to highly sophisticated real-world models. We report the speed comparisons between CPUs and GPUs made by AMD, NVIDIA and Intel, and show excellent portability between different devices and architectures. Our accelerated open-source software, including MATLAB/Octave support, is freely available at http://mcx.space/#mmc.

Disclosures
No conflicts of interest, financial or otherwise, are declared by the authors.