
1.IntroductionThe development of innovative biophotonics techniques relies on accurate and efficient photon propagation models, especially when imaging complex human anatomy. The importance of developing fast and accurate light propagation algorithms in general media is further highlighted by the increasing utility of modelbased methods in optical image acquisition and image processing. The radiative transport equation (RTE) most realistically describes the light propagation in a general random media, such as human tissues. Nevertheless, directly solving the RTE is computationally expensive and memory intensive due to the high dimensionality of the solution. On the other hand, the diffusion equation (DE) provides a good approximation to the RTE in a scatteringdominant media^{1}^{,}^{2} and can be computed efficiently using finiteelement (FE)based numerical solvers.^{3}^{,}^{4} However, it has been shown that solving the DE in regions that contain lowscattering media, such as cerebral spinal fluid (CSF) in the brain and other voidlike regions, can lead to erroneous solutions.^{5}^{,}^{6} Unlike other RTE solvers that rely on variational principles, the Monte Carlo (MC) method is a stochastic solver that runs a large number of independent random trials of photon packets to obtain light intensity distributions.^{7} Although the steps needed for simulating a singlephoton random movement are relatively simple to implement, tens of millions of photons are often needed to obtain results of sufficient quality, taking up to several hours computing time when a traditional serial MC algorithm is used.^{8} To improve the computational efficiency, a number of hybrid models have been studied over the years, combining DEbased solutions for diffusive regions with MCsolutions for lowscattering regions.^{9}^{,}^{10} Over the past decade, the rapid advancements in graphics processing units (GPU) have offered a new opportunity to accelerate MC simulations. Massively parallel MC algorithms have been proposed for simple homogeneous,^{11} layered,^{12} and threedimensional (3D) heterogeneous media.^{13}^{–}^{15} Due to the independence between random photons, GPUbased MC algorithms have demonstrated significant speed improvement, ranging from a few hundred fold to over thousand fold, when compared with serial MC modeling.^{11}^{,}^{13}^{,}^{15} This has shortened the MC runtime from hours to seconds. Despite this dramatic improvement in speed, the desires to simulate an even larger number of photons in extended, heterogeneous volumes and to model arrays of sources and detectors in tomographic settings continue to motivate researchers toward further reduction of MC runtimes. While it is always feasible to reduce the MC modeling time by simulating less photons, stochastic noise can become dominant in those lowphoton simulations, resulting in loss of accuracy. To reduce the intrinsic stochastic noise without running a large number of photons, applying signal processing techniques to “denoise” a lowphoton MC solution has been investigated in limited areas of research such as radiation dosage estimation^{16}^{–}^{20} and, more recently, computer graphics rendering.^{21}^{–}^{24} Conventional MC denoising techniques have primarily focused on removing global noise.^{21} Only in recent years, noise adaptive filters, considering the spatially dependent noise models, have been proposed.^{21}^{,}^{25} Nonetheless, denoising MC simulations of electron beams or ionizing photon beams in the context of radiation dose calculations typically involve more complicated filter designs and are computationally demanding;^{19} the efficacy of these filters varies substantially depending on both anatomy and noise level.^{19} The recent progress in MCdenoising in computer graphics focuses on machinelearning (ML)based denoising techniques;^{24}^{,}^{26} however, these methods were largely designed for denoising twodimensional (2D) low bitdepth image data. In comparison, MC photon simulations in diffusive media typically produce fluence maps that are smooth with high dynamic range and spatially varying noise, typically represented by floatingpoint numbers. These major differences between image and noise characteristics render most of the existing MLbased denoisers unsuited for processing fluence images without modifications. As far as we know, there are no reported studies on denoising images produced from lowenergy photon transport MC simulations. The noise models of the MC photon simulation outputs, such as fluence and partial pathlengths,^{8} are generally known to be complex and not well understood.^{27} However, it has been generally agreed that the dominant form of noise in the fluence images after simulating many individual photons or photon packets^{7} follows a scaled Poisson^{27}^{,}^{28} or Gaussian^{29} distribution. It is our opinion that an ideal MC denoising filter should possess the following characteristics: (C1) it must be effective in removing the noise of the expected MC noise distributions (Poisson or Gaussian), (C2) it must be adaptive to spatially varying noise, (C3) it should not remove sharp features from the underlying images or introduce a systematic bias, and (C4) it must have good computational speed so that it is faster to achieve the desired image quality than running more photons. Properties C1 and C2 in the above wishlist are related to the fact that Poisson or shotnoise^{28} is the dominant noise form in MC photon transport simulations. In the regions traversed by more than a few dozens of photons, the noise is well approximated by the Gaussian distribution; in the lowphoton regions, the noise has a Poisson distribution. The shotnoise is also known to be intensitydependent as the standarddeviation of the noise at a given spatial location is equal to the squareroot of the number of photons traversing through it.^{30}^{,}^{31} Therefore, spatial adaptivity is crucial. Property C3 is important to preserve the high contrast profiles next to a point source and the fluence discontinuity across the boundary of an inclusion with a refractive index contrast. Property C4 is currently quite challenging to achieve, especially given the drastically accelerated MC simulation speeds achieved over the past decade using GPUs. Among common filters proposed for 3D image denoising, simple Gaussian filters are fast to compute (C4), effective for highphoton regions (C1), but do not have spatial adaptiveness (C2) or preserve sharp edges (C3). Gaussian filters combined with the Anscombe transform (AT) extend effectiveness to the lowphoton regions but still limited in adaptiveness and edgepreservation. The nonlocal means (NLM) filter^{30} was shown to be highly effective in filtering Gaussiantype noise (C1) with the additional benefit of excellent edge preservation (C3). In recent years, an adaptive NLM (ANLM) filter was proposed^{32}^{–}^{34} for processing MRI images with adaptive noise (C2). Similar characteristics were found in the recently developed blockmatching and fourdimensional filtering (BM4D) algorithm.^{35}^{,}^{36} However, the slow computation speeds (C4) in the ANLM and BM4D filters restrict their use, especially when processing GPUaccelerated MC simulations.^{35} GPUaccelerated 3D adaptive filters can potentially bring excellent computational efficiency (C4) to the stateoftheart 3D filters and make them suitable for denoising GPU MC simulations. Granata et al.^{37} reported significant speed improvements using a GPUbased ANLM filter. However, a number of simplifications were found when comparing it with the original ANLM filter,^{34} including removal of the preselection of nonlocal patches, 2D instead of 3D estimation of noise variance,^{34}^{,}^{38} and reduced data precision (2byte integers for MRI data). For a typically sized volume, the filtering speed requires further improvement in order for it to be useful in most typical MC simulations (${10}^{6}$ to ${10}^{8}$ photons). Although the GPUBM3D filters^{39}^{,}^{40} reported excellent speed, they are designed for filtering twodimensional (2D) images and are not suited for 3D denoising. As far as we know, there is no publication on GPUBM4D filters. Developing a more general GPUbased 3D noiseadaptive filter with higher working precision and efficiency could benefit a wide variety of medical image data processing tasks, including improving 3D MC simulation outcomes. In this work, we describe a significantly improved GPUaccelerated ANLM filter and study its applications in denoising 3D MC photon transport simulation images. The new filter shows a twofold to threefold speed improvement from the stateofart GPU implementations and can work with higher data precisions. We have also systematically quantified the image quality improvement in denoising MC generated image data. We show that the denoising step can generate an average 5dB SNR improvement; this is equivalent to the result of running 3 to 3.5fold more photons. The robustness of the proposed methods is demonstrated using 3D simulations from various photon numbers in both homogeneous and heterogeneous domains. The remainder of this paper is organized as below. In Sec. 2, we present the basic formulation of the ANLM filter and the details of our improved 3D GPU ANLM filter. The procedures to apply the proposed filter to enhance MC image quality and evaluation metrics are also described. In Sec. 3, we compare the filtered and unfiltered MC simulations, including results from both homogeneous and heterogeneous domains at various photons numbers, and quantify the improvement using the developed metrics. In addition, we also compare the computation time of the ANLM filtering using GPU versus CPU. Overall runtimes combining GPUbased MC simulations with GPUbased ANLM filters are calculated and discussed for three benchmark problems. In Sec. 4, we summarize our findings and describe future research plans. 2.Method2.1.Adaptive Nonlocal Means Filters and Feature ComparisonsThe original CPUbased ANLM filter^{34} contains a number of key features, such as calculation of the weighted averages of nonlocal patches, preselection of nonlocal patches^{38} for better image quality, spatial noise adaptivity, and wavelet subband mixing.^{38}^{,}^{41} A comparison between the CPUbased ANLM,^{34} the previously published GPUANLM filter,^{37} and the GPUANLM filter developed in this work is shown in Table 1. Table 1A featurematrix comparison between the published and proposed ANLM filters. The algorithmrelated features are explained in Sec. 2.1. The GPUrelated features are explained in Sec. 2.2.
The algorithm details listed in Table 1 can be found in Ref. 34. Briefly, in a 3D image, the value of any voxel located at ${x}_{i}$ in the filtered image, ${u}^{\prime}({x}_{i})$, is determined by the weighted average of all the voxels, ${x}_{j}$, in a 3D cubic search volume ${V}_{i}$ in the neighborhood of ${x}_{i}$, i.e., where $w({x}_{i},{x}_{j})$ is the weight for voxel ${x}_{j}$ and is calculated using two small cubic volumes, one centered at ${x}_{i}$, referred to as the “local patch” ${P}_{i}$, and the other one centered at ${x}_{j}$, referred to as the “nonlocal patch” ${P}_{j}$; the local and nonlocal patches have the same size, which is smaller than the size of ${V}_{i}$. The weight can then be expressed asEq. (2)$$w({x}_{i},{x}_{j})=\frac{1}{{Z}_{i}}\mathrm{exp}[\frac{{\Vert u({P}_{i})u({P}_{j})\Vert}_{2}^{2}}{h{({x}_{i})}^{2}}],$$Eq. (3)$${\sigma}^{2}({x}_{i})=\underset{{x}_{j}\in {V}_{i}}{\mathrm{min}}[R({x}_{i}),R({x}_{j})],$$A few modifications to the above algorithm were introduced by Manjón et al.^{34} and Coupé et al.^{38} First, the “blockwise” implementation updates all voxels in the localpatch instead of only the center voxel as shown in Eq. (1). While this method allows skipping of every other voxel, it requires data exchange between different voxels within the local patch, which makes it difficult to implement in the GPU. In both Ref. 37 and this work, the voxelbased update scheme is used. Second, the nonlocal patches with low similarities to the local patch can be ignored, referred to as “preselection,”^{34} to improve image quality and accelerate computation. Preselection is not considered in the previously published GPUANLM work^{37} but is fully implemented in this work. Third, a wavelet transformation combining two different patch sizes can be used to improve filtering performance.^{41} Previously, the two filtering steps were performed sequentially on the CPU.^{34} In this work, we process the two steps in the GPU in a streamlined fashion without redundant data transfer between the host and the device. Moreover, the ANLM filters reported in Refs. 34 and 37 were designed to denoise MRI images, which feature a Rician noise. Although our reported work is primarily targeted to denoise Gaussian/Poisson noise in the MC outputs, we also added support for the Rician noise, so it can be readily used for processing MRI images. 2.2.GPUaccelerated Adaptive Nonlocal Means Algorithm and OptimizationsA workflow diagram for the developed GPUaccelerated ANLM filter is shown in Fig. 1. Overall, the GPUANLM filter is performed in two steps: (1) preprocessing: to compute $R(x)$, $\stackrel{\u203e}{u({B}_{x})}$, and the image variance within ${B}_{x}$, i.e., ${\sigma}^{2}({B}_{x})$; the former two are needed in Eq. (3) and the latter two needed for patch preselection; (2) ANLM filtering: update image values according to Eqs. (1) and (2), respectively. In this work, both steps are implemented in parallel computing on the GPUs. As demonstrated previously,^{37}^{,}^{42} caching of image data using the highspeed shared memory in the GPU is crucial for efficient GPU implementations. In this work, a number of extra optimization steps have been implemented over the work by Granata et al.^{37} to further improve the computational efficiency. These improvements include: (1) loading the preprocessing outputs, $R(x)$, to the shared memory at the beginning of step 2 (referred to as “O1”), (2) using 3D blocks instead of 2D blocks^{37} to maximize cache reusability between threads (referred to as “O2”), (3) precomputing $R(x)$, $\stackrel{\u203e}{u({B}_{x})}$, and ${\sigma}^{2}({B}_{x})$ using the GPU instead of a CPU (referred to as “O3”), and (4) streamlining the two filtering steps for wavelet subband mixing^{34} on the GPU without redundant data transfer (referred to as “O4”). The partition of the raw image volume to the GPU thread/block space and mapping to the shared memory are shown in Fig. 2. Here we want to highlight the benefit of moving from a 2D thread block^{37} to a 3D thread block design. For example, to filter a total of 4096 voxels using a filter of ${r}_{V}=3$, ${r}_{P}=2$, an isotropic 3D block configuration of ${T}_{3\mathrm{D}}^{3}$ threads, where ${T}_{3\mathrm{D}}={T}_{x}={T}_{y}={T}_{z}={4096}^{1/3}=16$ requires reading data from a total of ${[{T}_{3\mathrm{D}}+2({r}_{V}+{r}_{P})]}^{3}=\mathrm{17,576}$ voxels. Specifically, $2({r}_{V}+{r}_{P})$ here represents the additional layers of the marginal voxels, in each dimension, that are to be loaded into the shared memory, referred to as “aprons” in Fig. 2(b), along with the voxels needed for the filtering calculation. If one uses an isotropic 2D block configuration of ${T}_{2\mathrm{D}}^{2}$ threads, where ${T}_{2\mathrm{D}}=\sqrt{4096}=64$, this operation needs data from ${[{T}_{2\mathrm{D}}+2({r}_{V}+{r}_{P})]}^{2}\times [1+2({r}_{V}+{r}_{P})]=\mathrm{60,236}$, about 3.5× the size of the 3D block case. Thus, the 3D thread block design can effectively reduce the shared memory usage by reducing the apron size. 2.3.MetricsWe use the signaltonoise ratio (SNR) to evaluate the image quality improvement due to the denoising filter. Please note that the MC simulation noise is spatially and photoncount dependent. The SNR can be calculated by running multiple (here we use $N=1000$) independently seeded MC simulations, each with $K$ photons, and computing the mean, ${\mu}_{K}(r)$, and standard deviation, ${\sigma}_{K}(r)$, at any given voxel located at $r$, then converting to dB as Eq. (4)$${\mathrm{SNR}}_{K}(r)=20\text{\hspace{0.17em}}{\mathrm{log}}_{10}\frac{{\mu}_{K}(r)}{{\sigma}_{K}(r)}.$$On the other hand, if we assume the MC noise has a shotnoiselike behavior, by increasing photon numbers from $N$ to $c\times N$, where $c>1$, we can anticipate an overall dB SNR improvement of $\mathrm{\Delta}\mathrm{SNR}=20\text{\hspace{0.17em}}{\mathrm{log}}_{10}\sqrt{\frac{c\times N}{N}}=10\text{\hspace{0.17em}}{\mathrm{log}}_{10}c$. For example, a 10fold increase in photon number in an MC simulation brings a roughly 10dB SNR improvement. Once ${\mathrm{\Delta}\mathrm{SNR}}_{\mathrm{filter}}$ is calculated, the above equation allows us to estimate a filterequivalent photon number multiplier ${M}_{F}$ as The larger the ${M}_{F}$ value, the better the performance of the filter.Furthermore, when assessing improvements due to various optimization strategies proposed in Sec. 2.2, we incrementally add the optimization steps and calculate the average runtime after each addition. To compare the speed improvement from the CPUtoGPUbased ANLM filters, we also run multiple ($N=3$) independent trials for each simulation setting and calculate the average runtime differences. Although it is desirable to compare the GPUANLM reported in this work with the one published previously,^{37} this prior work does not contain the full ANLM implementation^{34} (see Table 1). In this case, when an optimization strategy involves a feature not available in Ref. 37, we fall back to the original ANLM algorithm^{34} to make the comparison. 3.ResultsTo evaluate the improvement of MC image quality using the denoising filter described above, a cubic domain of $100\times 100\times 100$ grid with $1\text{\hspace{0.17em}\hspace{0.17em}}{\mathrm{mm}}^{3}$ isotropic voxels is used. Three benchmarks are tested: (B1) a homogeneous domain with absorption coefficient ${\mu}_{a}=0.02/\mathrm{mm}$, scattering coefficient ${\mu}_{s}=10/\mathrm{mm}$, anisotropy $g=0.9$, and refractive index $n=1.37$, (B2) same as B1, with the addition of a $40\times 40\times 40\text{\hspace{0.17em}\hspace{0.17em}}{\mathrm{mm}}^{3}$ cubic absorber centered at (50, 50, and 30) mm with $5\times $ absorption contrast, i.e., ${\mu}_{a}=0.1/\mathrm{mm}$, and (B3) same as B2, but featuring a cubic inclusion of fivefold refractive index contrast instead of absorption, i.e., $n=6.85$. The last benchmark was designed to test for the edgepreservation capability of this filter. In all benchmarks, a pencil beam pointing toward the $+z$axis is located at (50, 50, and 0) mm. Photon counts ranging from ${10}^{5}$ to ${10}^{8}$, with 10fold increments, are simulated. For each photon count, 1000 independently seeded simulations are performed. For the ANLM filter, the patch radii ${r}_{P}$ for the two independent filtering processes for wavelet subband mixing^{34} are 1 and 2, respectively. The boxfilter domain radius ${r}_{B}$ is set to 1 as used in Ref. 34. The search radius ${r}_{V}$ is set to 3, resulting in a total search volume of ${(2{r}_{V}+1)}^{3}=343$ voxels. We also tested a larger search radius ${r}_{V}=5$ (not shown) and observed only a marginal SNR improvement. For better computational efficiency, ${r}_{V}=3$ is used in all examples here. The MC simulation was performed using Monte Carlo eXtreme (MCX).^{13} Two GPUs—NVDIA GTX 980Ti and 1080Ti—are used to test both the MC simulation and the ANLM filtering. The CPU used for these simulations is an Intel i76700K. In Figs. 3(a)–3(d), we show the MC output images of benchmark B1 (homogeneous) at two selected photon counts—${10}^{6}$ and ${10}^{7}$. The raw continuouswave (CW) MC fluence outputs along the plane $y=50\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{mm}$ are shown in Figs. 3(a) and 3(c) for the two photon counts, and their corresponding denoised versions are shown in Figs. 3(b) and 3(d), respectively. Similarly, in Figs. 3(e)–3(h), the cross sections along the same plane ($y=50\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{mm}$) for the absorbing inclusion (B2) and the refractive inclusion (B3) cases, respectively, are reported before and after denoising for simulations with ${10}^{7}$ photons. To quantitatively assess the image quality improvement, in Fig. 4(a), we show the SNR (in dB) profiles at various photon numbers before and after filtering along the midline of Fig. 3. Only the homogeneous domain (Benchmark B1) is shown here as a representative example. To assess the potential bias imposed by the denoising filter, we also plot the mean values of the fluence in Fig. 4(b). In this case, the mean fluence profiles for all three benchmarks (B1—green, B2—blue, and B3—red) are compared with a sample simulation at a photon count of ${10}^{6}$. The light and dark colorshaded regions represent the variations of the fluence within one standard deviation (calculated from 1000 repeated simulations) before and after filtering, respectively. To demonstrate the edgesmoothing effect of the conventional Gaussian filter, we also show the mean fluence along the same cross section for the refractive inclusion (B3) case after a $5\times 5\times 5$ Gaussian filter with $\sigma =0.67\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{mm}$. In addition, we found that the SNR improvement due to filtering is not strongly correlated to the presence of heterogeneities. As shown in Fig. 4(c), the SNR difference inside the refractive inclusion (B3) is similar to those in the background region, as well as in the homogeneous case (B1), despite the differences in fluence intensity. To demonstrate the application of our GPUANLM filter in more complex tissue structures, a 19.5yearold brain atlas with a voxeldomain size of $166\times 209\times 223$ and $1\text{\hspace{0.17em}\hspace{0.17em}}{\mathrm{mm}}^{3}$ voxels is simulated.^{44} The atlas is segmented into five layers: scalp/skull, CSF, gray matter, and white matter. The optical properties for scalp/skull and CSF are based on literature,^{6} similarly for gray and white matters.^{45} A pencil beam is placed on the scalp with source direction perpendicular to the head surface. A total of ${10}^{8}$ photons are simulated. An ANLM filter of the same parameters as the above benchmarks is applied. In Fig. 5, we show the coronal section of the MC fluence output before and after the ANLM filtering. Next, we investigate the computational speed improvements by incrementally incorporating the optimization strategies outlined in Sec. 2.2. All simulation runtimes are divided into two parts—the total pre/postprocessing runtimes and the GPU ANLM filter kernel runtimes. In Fig. 6, the runtime comparison is reported as a stackbar plot for the four different optimization methods and three benchmarks. As a reference, we designed a baseline (B) simulation by combining the features from the previously published CPU and GPU ANLM filters: we use the previously reported filter settings^{37} if it is implemented; otherwise, we use the settings from the CPUANLM filter.^{34} Features not described in the former include the nonlocal patch preselection and the wavelet subband mixing. In Fig. 6, the blue bars represent cumulative runtimes for the pre and postprocessing steps; the orange bars denote the total runtimes of the ANLM GPU kernel. For this particular comparison, we use a photon number of ${10}^{7}$. The reported runtimes are averaged results from three runs. 4.Discussions and ConclusionFrom visual inspection of Fig. 3, we found that application of the proposed GPU ANLM filter results in noticeable improvement in the smoothness of the MC fluence images. From comparing Figs. 3(a)–3(c), the denoised image in (b) shows quality improvement similar to that due to increased photon counts as shown in (c). A similar improvement can also be found between (d) and (c). We also found that image smoothness improvements in regions near the source are not as significant as those distal to the source. This is indicative of the adaptiveness of the filter—the denoising filter smoothens the image less in regions with high intensity (i.e., high SNR) than in regions of relatively high levels of noise. Additional findings can be concluded from interpreting Figs. 3(e)–3(h). For both of the tested heterogeneous domains—one with an absorbing inclusion, shown in (e) and (f), and the other one with a refractive inclusion, shown in (g) and (h), the denoised images also demonstrate a noticeable improvement in overall image smoothness. Despite the overall smoothed contours, the image features, due to the inclusions, are well preserved when comparing the images from before and after filtering. The images in Figs. 3(g) and 3(h) are particularly notable because the sharp edges, a result of the discontinuity of fluence due to refractive index mismatch, show little sign of blurriness after the denoising. From Fig. 4(a), it is clear that the ANLM filter (dotted lines) helps improve the SNR in all tested photon counts. However, it appears that such improvements are limited to regions distal from the source region, labeled as the “effective region” in Fig. 4(a)—the higher the photon count, the further the distance between the effective region from the source. This finding has mixed implications. On the positive side, it confirms the adaptiveness of the ANLM filter, as observed above, and ensures that the regions with (relatively) high SNRs receive minimal distortions. On the other hand, one can also anticipate that the effectiveness of the adaptive filter gradually reduces when processing MC outputs from increasing numbers of photons. This behavior is clearly different from running 10fold more photons (solid lines), where the SNR enhancement appears to be relatively uniform. The tested CPUBM4D filter (dashed lines) shows a wider effective region compared to ANLM except near the source, despite a much slower speed (20 s total runtime on a CPU) due to the lack of GPU implementation. The average SNR improvement in the BM4D algorithm is also slightly higher than that of ANLM. While we demonstrate significant improvements in MC photon simulation via denoising with the ANLM filter, it is not our intent to claim it is optimal for this task. Based on this figure, exploring and accelerating other contemporary 3D adaptive filters, such as BM4D, could be promising future directions for this research. To quantify the improvement in SNR due to the denoising filter, we heuristically determined the effective region boundary using an SNR improvement threshold, above which the improvement is considered. In addition, to minimize the bias due to the inaccurate SNR values in the lowphoton region (for example, $\mathrm{SNR}<0\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{dB}$), we use median instead of mean to calculate the average improvement.^{46} The median SNR improvement in the entire volume ranges from 4.5 to 5.4 dB for various photon counts; this improvement increases to 5.2 to 5.5 dB if we only consider the effective regions where $\mathrm{\Delta}\mathrm{SNR}>3\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{dB}$. According to Eq. (5), such SNR improvements are equivalent to those produced by increasing the photon number by about 3.5fold (i.e., ${M}_{F}=3.5$). We also estimated the median dB increments from ${10}^{5}$ to ${10}^{6}$, ${10}^{6}$ to ${10}^{7}$, and ${10}^{7}$ to ${10}^{8}$, which are 8.7, 9.6, and 9.8 dB, respectively. These results are close to the 10 dB increase as expected for shotnoise. Similarly, we estimated that the BM4D filter has a median 6.8 to 7.4 dB increase in SNR in regions where $\mathrm{\Delta}\mathrm{SNR}>3\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{dB}$, which yields an ${M}_{F}$ around 5. The averaged fluence profiles after denoising align well with those of raw MC outputs, as shown in Fig. 4(b), in both homogeneous and heterogeneous cases. This confirms that the proposed denoising filter does not add noticeable bias to the data. Particularly, the sharp fluence change near the boundary of the refractive inclusion is well preserved after ANLM filtering according to Fig. 4(b) (redlines). The efficacy of the denoising filter can also be seen from the notable shrinkage of signal variations from the light and darkshaded areas, particularly in regions far away from the source. The ANLM filter gives a significantly better result than the 3D Gaussian filter, which tends to smooth all sharp features, as seen from Fig. 4(b) and its inset. Similar improvement can be found in Fig. 5 when using a complex head model. From Fig. 4(c), it also appears that the denoising SNR improvement is not strongly influenced by either the fluence magnitude or the background optical property variations. This allows us to extrapolate these assessments to more complex domains. From Fig. 6, the total runtimes of the filtering step were reduced from 2.5 s to around 1 s—a nearly $2.5\times $ speedup. On average, the filtering runtimes (orange) speedup because utilization of shared memory (O1) is around 11%; the addition of the 3D block configuration (O2) further reduces the GPU filtering kernel runtimes by 50% from the baseline. Furthermore, by moving the preprocessing step from the CPU to the GPU (O3), we cut the preprocessing runtimes (blue) by 53%. Finally, the use of streamlined GPUbased wavelet subband mixing (O4) further reduces the postprocessing time by another 13%, yielding a nearly 60% total time reduction. The overall runtimes, as well as the speedup ratios due to various optimization strategies appear to be independent of the types of the inclusions in the media. Two major observations can be made from the runtime data in Table 2. First, by comparing the runtimes between the CPUbased ANLM filter obtained from Refs. 34 and 47 and our GPUbased ANLM filter, we can observe a threefold to fourfold improvement in speed, with slight variations across different media configurations and photon numbers. Second, the relatively constant filtering runtimes, when combined with the proportionally increasing GPU MC runtime with respect to increasing photon numbers, suggest that the overall efficiency of combining an MC simulation with a denoising process depends on the simulated photon numbers—the larger the photon number, the greater the overall improvement in speed. According to our above discussions regarding Fig. 4(a), the denoising filter produces an SNR improvement similar to that of running a simulation with 3.5fold photons. By multiplying the MCX runtimes by a factor of 3.5 and comparing the results with the summation of MCX and GPUANLM filter runtimes, we can conclude that the denoised MC simulation can reduce processing time in photon counts above ${10}^{7}$, with the maximum acceleration achieved at the highest photon counts. It is interesting to note that traditional CPUbased MC simulation has been known for slow computation.^{8} Combining the proposed denoising filter with the sequential MC simulation could have maximized the speed improvement (about $3.5\times $) in most of the photon counts if one continues to use CPUs for MC simulations. Highly parallel and efficient GPU MC codes, while being highly desired for biophotonics simulations, raise the thresholds for which this MC denoising method becomes effective. Despite that, according to Table 2, our proposed method could benefit a wide range of MC simulation settings, in particular, when combined with traditional MC simulations. Table 2The average total runtimes (in second) of the denoised MC simulations for different benchmarks and photon counts. Tests are performed using both NVIDIA 980Ti and 1080Ti GPUs. TMC and Tf stand for the MC simulation and ANLM filtering runtimes, respectively.
In summary, we reported a GPUaccelerated noiseadaptive nonlocal means (ANLM) filter to denoise lowphoton MC simulation images and demonstrated an over 5dB SNR improvement. This is equivalent to the SNR enhancement from running roughly 3.5 times more photons. The developed denoising filter shows excellent edgepreservation characteristics and low bias to the underlying fluence distribution, independent of photon numbers and the heterogeneities in the media. By developing a number of GPU optimization strategies, our GPU ANLM filter shows an overall 2.5fold speed improvement from previous reported GPUANLM implementations, and a threefold to fourfold speed improvement from previously published CPU ANLM implementations. For a given domain size, we observed that the proposed denoised MC simulation gives the highest acceleration when the MC simulation runtime becomes dominant, such as at large photon counts or in media of high scattering coefficients. The reported wholehead denoising result also shows the promise of applying this technique to simulations in complex domains. With the support of both Gaussian and Rician noises, our GPU ANLM filter can also be broadly applied toward denoising other types of 3D volumetric data including those from MRI scans. Both our GPUMC simulation software (MCX^{13} and MCXCL^{15}) and the GPUANLM filter are opensource and can be downloaded from http://mcx.space/mcfilter/. AcknowledgmentsThis research was supported by the National Institutes of Health (NIH) Grant Nos. R01GM114365, R01CA204443, and R01EB026998. ReferencesD. Contini, F. Martelli and G. Zaccanti,
“Photon migration through a turbid slab described by a model based on diffusion approximation. I. Theory,”
Appl. Opt., 36
(19), 4587
–4599
(1997). https://doi.org/10.1364/AO.36.004587 APOPAI 00036935 Google Scholar
A. Ishimaru,
“Diffusion of light in turbid material,”
Appl. Opt., 28
(12), 2210
–2215
(1989). https://doi.org/10.1364/AO.28.002210 APOPAI 00036935 Google Scholar
S. Arridge et al.,
“A finite element approach for modeling photon transport in tissue,”
Med. Phys., 20
(2), 299
–309
(1993). https://doi.org/10.1118/1.597069 MPHYA6 00942405 Google Scholar
M. Schweiger et al.,
“The finite element method for the propagation of light in scattering media: boundary and source conditions,”
Med. Phys., 22
(11), 1779
–1792
(1995). https://doi.org/10.1118/1.597634 MPHYA6 00942405 Google Scholar
A. H. Hielscher, R. E. Alcouffe and R. L. Barbour,
“Comparison of finitedifference transport and diffusion calculations for photon migration in homogeneous and heterogeneous tissues,”
Phys. Med. Biol., 43
(5), 1285
–1302
(1998). https://doi.org/10.1088/00319155/43/5/017 PHMBA7 00319155 Google Scholar
A. Custo et al.,
“Effective scattering coefficient of the cerebral spinal fluid in adult head models for diffuse optical imaging,”
Appl. Opt., 45
(19), 4747
–4755
(2006). https://doi.org/10.1364/AO.45.004747 APOPAI 00036935 Google Scholar
L. Wang, S. L. Jacques and L. Zheng,
“MCML—Monte Carlo modeling of light transport in multilayered tissues,”
Comput. Methods Programs Biomed., 47
(2), 131
–146
(1995). https://doi.org/10.1016/01692607(95)01640F CMPBEK 01692607 Google Scholar
D. A. Boas et al.,
“Three dimensional Monte Carlo code for photon migration through complex heterogeneous media including the adult human head,”
Opt. Express, 10
(3), 159
–170
(2002). https://doi.org/10.1364/OE.10.000159 OPEXFF 10944087 Google Scholar
L. Wang and S. L. Jacques,
“Hybrid model of Monte Carlo simulation and diffusion theory for light reflectance by turbid media,”
J. Opt. Soc. Am. A, 10
(8), 1746
–1752
(1993). https://doi.org/10.1364/JOSAA.10.001746 JOAOD6 07403232 Google Scholar
T. Hayashi, Y. Kashio and E. Okada,
“Hybrid Monte Carlodiffusion method for light propagation in tissue with a lowscattering region,”
Appl. Opt., 42
(16), 2888
–2896
(2003). https://doi.org/10.1364/AO.42.002888 APOPAI 00036935 Google Scholar
E. Alerstam, T. Svensson and S. AnderssonEngels,
“Parallel computing with graphics processing units for highspeed Monte Carlo simulation of photon migration,”
J. Biomed. Opt., 13
(6), 060504
(2008). https://doi.org/10.1117/1.3041496 JBOPFO 10833668 Google Scholar
E. Alerstam et al.,
“Nextgeneration acceleration and code optimization for light transport in turbid media using GPUs,”
Biomed. Opt. Express, 1
(2), 658
–675
(2010). https://doi.org/10.1364/BOE.1.000658 BOEICL 21567085 Google Scholar
Q. Fang and D. A. Boas,
“Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units,”
Opt. Express, 17
(22), 20178
–20190
(2009). https://doi.org/10.1364/OE.17.020178 OPEXFF 10944087 Google Scholar
N. Ren et al.,
“GPUbased Monte Carlo simulation for light propagation in complex heterogeneous tissues,”
Opt. Express, 18
(7), 6811
–6823
(2010). https://doi.org/10.1364/OE.18.006811 OPEXFF 10944087 Google Scholar
L. Yu et al.,
“Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms,”
J. Biomed. Opt., 23
(1), 010504
(2018). https://doi.org/10.1117/1.JBO.23.1.010504 JBOPFO 10833668 Google Scholar
J. O. Deasy,
“Denoising of electron beam Monte Carlo dose distributions using digital filtering techniques,”
Phys. Med. Biol., 45
(7), 1765
–1779
(2000). https://doi.org/10.1088/00319155/45/7/305 PHMBA7 00319155 Google Scholar
J. O. Deasy, M. V. Wickerhauser and M. Picard,
“Accelerating Monte Carlo simulations of radiation therapy dose distributions using wavelet threshold denoising,”
Med. Phys., 29
(10), 2366
–2373
(2002). https://doi.org/10.1118/1.1508112 MPHYA6 00942405 Google Scholar
M. Fippel and F. Nüsslin,
“Smoothing Monte Carlo calculated dose distributions by iterative reduction of noise,”
Phys. Med. Biol., 48
(10), 1289
–1304
(2003). https://doi.org/10.1088/00319155/48/10/304 PHMBA7 00319155 Google Scholar
I. El Naqa et al.,
“A comparison of Monte Carlo dose calculation denoising techniques,”
Phys. Med. Biol., 50
(5), 909
–922
(2005). https://doi.org/10.1088/00319155/50/5/014 PHMBA7 00319155 Google Scholar
B. De Smedt et al.,
“Denoising of Monte Carlo dose calculations: smoothing capabilities versus introduction of systematic bias,”
Med. Phys., 33
(6 Part1), 1678
–1687
(2006). https://doi.org/10.1118/1.2198188 MPHYA6 00942405 Google Scholar
N. K. Kalantari and P. Sen,
“Removing the noise in Monte Carlo rendering with general image denoising algorithms,”
Comput. Graphics Forum, 32
(2), 93
–102
(2013). https://doi.org/10.1111/cgf.2013.32.issue2pt1 CGFODY 01677055 Google Scholar
M. Boughida and T. Boubekeur,
“Bayesian collaborative denoising for Monte Carlo rendering,”
Comput. Graphics Forum, 36
(4), 137
–153
(2017). https://doi.org/10.1111/cgf.2017.36.issue4 CGFODY 01677055 Google Scholar
P. Sen et al.,
“Denoising your Monte Carlo renders: recent advances in imagespace adaptive sampling and reconstruction,”
in ACM SIGGRAPH Courses,
11
(2015). Google Scholar
S. Bako et al.,
“Kernelpredicting convolutional networks for denoising Monte Carlo renderings,”
ACM Trans. Graphics, 36
(4), 1
–14
(2017). https://doi.org/10.1145/3072959 ATGRDF 07300301 Google Scholar
Y. Yuan, L. Yu and Q. Fang,
“Denoising in Monte Carlo photon transport simulations using GPUaccelerated adaptive nonlocal mean filter,”
in Biophotonics Congress: Biomedical Optics Congress,
(2018). Google Scholar
C. R. A. Chaitanya et al.,
“Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoder,”
ACM Trans. Graphics, 36 1
–12
(2017). https://doi.org/10.1145/3072959 ATGRDF 07300301 Google Scholar
F. Beaujean, H. C. Eggers and W. E. Kerzendorf,
“Bayesian modelling of uncertainties of Monte Carlo radiativetransfer simulations,”
Mon. Not. R. Astron. Soc., 477
(3), 3425
–3432
(2018). https://doi.org/10.1093/mnras/sty812 MNRAA4 00358711 Google Scholar
G. Bohm and G. Zech,
“Statistics of weighted Poisson events and its applications,”
Nucl. Instrum. Methods Phys. Res., 748 1
–6
(2014). https://doi.org/10.1016/j.nima.2014.02.021 NIMRD9 01675087 Google Scholar
M. Kromer and S. A. Sim,
“Timedependent threedimensional spectrum synthesis for Type Ia supernovae,”
Mon. Not. R. Astron. Soc., 398
(4), 1809
–1826
(2009). https://doi.org/10.1111/mnr.2009.398.issue4 MNRAA4 00358711 Google Scholar
A. Buades, B. Coll and J.M. Morel,
“Nonlocal image and movie denoising,”
Int. J. Comput. Vision, 76
(2), 123
–139
(2008). https://doi.org/10.1007/s1126300700521 IJCVEQ 09205691 Google Scholar
J. R. Janesick, Photon Transfer, SPIE Press, San Jose
(2007). Google Scholar
J. V. Manjón et al.,
“MRI denoising using nonlocal means,”
Med. Image Anal., 12
(4), 514
–523
(2008). https://doi.org/10.1016/j.media.2008.02.004 Google Scholar
N. WiestDaesslé et al.,
“Rician noise removal by nonlocal means filtering for low signaltonoise ratio MRI: applications to DTMRI,”
Lect. Notes Comput. Sci., 5242 171
–179
(2008). https://doi.org/10.1007/9783540859901 LNCSD9 03029743 Google Scholar
J. V. Manjón et al.,
“Adaptive nonlocal means denoising of MR images with spatially varying noise levels,”
J. Magn. Reson. Imaging, 31
(1), 192
–203
(2010). https://doi.org/10.1002/jmri.22003 Google Scholar
M. Maggioni et al.,
“Nonlocal transformdomain filter for volumetric data denoising and reconstruction,”
IEEE Trans. Image Process., 22
(1), 119
–133
(2013). https://doi.org/10.1109/TIP.2012.2210725 IIPRE4 10577149 Google Scholar
M. Maggioni and A. Foi,
“Nonlocal transformdomain denoising of volumetric data with groupwise adaptive variance estimation,”
Proc. SPIE, 8296 82960O
(2012). https://doi.org/10.1117/12.912109 PSISDG 0277786X Google Scholar
D. Granata, U. Amato and B. Alfano,
“MRI denoising by nonlocal means on multiGPU,”
J. RealTime Image Process., 1
–11
(2016). https://doi.org/10.1007/s1155401605662 Google Scholar
P. Coupé et al.,
“An optimized blockwise nonlocal means denoising filter for 3D magnetic resonance images,”
IEEE Trans. Med. Imaging, 27
(4), 425
–441
(2008). https://doi.org/10.1109/TMI.2007.906087 ITMID4 02780062 Google Scholar
S. Sarjanoja, J. Boutellier and J. Hannuksela,
“BM3D image denoising using heterogeneous computing platforms,”
in Conf. on Design and Architectures for Signal and Image Processing (DASIP),
1
–8
(2015). https://doi.org/10.1109/DASIP.2015.7367257 Google Scholar
D. Honzátko and M. Kruliš,
“Accelerating blockmatching and 3D filtering method for image denoising on GPUs,”
J. RealTime Image Process., 1
–15
(2017). https://doi.org/10.1007/s1155401707379 Google Scholar
P. Coupé et al.,
“3D wavelet subbands mixing for image denoising,”
Int. J. Biomed. Imaging, 2008 1
–11
(2008). https://doi.org/10.1155/2008/590183 Google Scholar
F. P. X. De Fontes et al.,
“Real time ultrasound image denoising,”
J. RealTime Image Process., 6
(1), 15
–22
(2011). https://doi.org/10.1007/s1155401001585 Google Scholar
NVIDIA Corporation,
“CUDA C programming guide,”
(2017). Google Scholar
J. E. Richards et al.,
“A database of ageappropriate average mri templates,”
Neuroimage, 124 1254
–1259
(2016). https://doi.org/10.1016/j.neuroimage.2015.04.055 NEIMEF 10538119 Google Scholar
A. Yaroslavsky et al.,
“Optical properties of selected native and coagulated human brain tissues in vitro in the visible and near infrared spectral range,”
Phys. Med. Biol., 47
(12), 2059
–2073
(2002). https://doi.org/10.1088/00319155/47/12/305 PHMBA7 00319155 Google Scholar
D. S. Moore and S. Kirkland, The Basic Practice of Statistics, 2 WH Freeman, New York
(2007). Google Scholar
J. V. Manjón and P. Coupé,
“naonlm3D—MRI image denoising software version 2,”
(2012) https://sites.google.com/site/pierrickcoupe/ Google Scholar
BiographyYaoshen Yuan is a doctoral candidate at Northeastern University. He received his BE degree from Southeast University, China, in 2014 and MSE from Tufts University in 2016. His research interests include Monte Carlo simulation for photon transport, GPU algorithm enhancement and signal processing. Leiming Yu is a doctoral candidate at Northeastern University. His research interests focus on performance optimization for parallel computing systems and workload scheduling for GPU clouds. He received his MS degree from the University of Bridgeport, United States, in 2010. Zafer Doğan is a postdoctoral research associate at Northeastern University, and jointly affiliated at Harvard University. His main research interests are at the intersection of data analytics, optimization, inverse problems, and machine learning with applications in emerging optical imaging modalities and nonlinear tomography. He received his PhD and MSc degrees in electrical engineering from EPFL, Switzerland, in 2015 and 2011, respectively, and his BSc degree in electrical and electronics engineering from METU, Turkey, in 2009. Qianqian Fang is currently an assistant professor in the Bioengineering Department, Northeastern University, Boston, USA. He received his PhD degree from Dartmouth College in 2005. He then joined Massachusetts General Hospital and became an assistant professor in 2012, before he joined Northeastern University in 2015. His research interests include translational medical imaging systems, lowcost pointofcare devices for resourcelimited regions, and high performance computing tools to facilitate the development of nextgeneration imaging platforms. 