## 1.

## Introduction

Acousto-optics (AO), or ultrasound-modulated optical tomography, is a hybrid technique which combines the optical contrast of biological tissues at visible or near-infrared wavelengths with the spatial resolution of focused ultrasound fields to provide highly localized measurements of the optical properties of biological tissues. Such measurements are of clinical significance since they may indicate underlying pathologies, e.g., increased vascularization due to tumor angiogenesis, or serve as an indicator of the progress of treatments which themselves alter the optical properties of tissues, e.g., thermal necrosis by therapeutic high-intensity focused ultrasound (HIFU).^{1}

## 1.1.

### Acousto-Optic Modeling

Leutz and Maret^{2} first developed a model of the AO effect in turbid media based upon the approach of diffusing wave spectroscopy (DWS)^{3}^{,}^{4} with the addition of sinusoidally oscillating optical scatterers. Kempe et al.^{5} extended the analysis of Leutz and Maret to consider a narrow beam of ultrasound and compared this model to experimental data. Mahan et al.^{6} proposed the acoustic tagging process to be that of Brillouin scattering, proceeding to develop an expression of the intensity of the modulated signal and the signal-to-noise ratio in the case of an ultrasound field focused into a small volume. Theoretical understanding of the technique was advanced by Wang,^{7} who jointly modeled the two coherent mechanisms of interaction previously investigated: the acoustically induced displacement of optical scatterers and the change in the index of refraction due to compression and rarefaction. Theoretical research continued with the extension of Wang’s model to the case of anisotropically scattering media^{8} and to account for strong correlations in the interaction at successive scattering sites.^{9} Blonigen et al.^{10} further investigated the nature of the phase shifts applied to ultrasound modulated light in the context of a photorefractive crystal detection regime. The inherent limitations of theory based upon the DWS approach were overcome by Sakadžić and Wang through the development of a correlation transfer equation (CTE)^{11} which provides a treatment of the acoustic continuous-wave (CW) AO problem in the style of a transport equation. A diffusion approximation (DA) is presented in a subsequent work,^{12} and later the CTE is reformulated to account for an ultrasonic pulse.^{13} Recently Bratchenia et al.^{14} developed sensitivity maps of the acousto-optic technique based upon a pair of coupled diffusion equations accounting for the unmodulated and modulated distributions of light. Throughout the initial theoretical investigations of the technique, a number of CPU-based Monte-Carlo (MC) models were demonstrated to support the analytical solutions under development.^{8}^{,}^{9}^{,}^{11}12.^{–}^{13}^{,}^{15}

## 1.2.

### GPU-Accelerated Monte-Carlo Technique

In recent years, considerable advances have been made in executing MC simulations on the highly parallel architecture of graphics processing units (GPUs). The first demonstration by Allerstam et al.^{16} considered the problem of light transport in an homogeneous semi-infinite medium; increases in speed over a CPU implementation of up to three orders of magnitude were achieved. Fang and Boas^{17} subsequently developed a general purpose code to perform MC simulations in heterogeneous voxelized geometries. Under specific circumstances speed increases over a CPU implementation of up to $\times 350$ were noted.

We recently demonstrated the acceleration of a MC AO simulation,^{18} achieving performance improvements of up to $\times 100$ over a CPU implementation equivalent to that presented in Ref. 15. We also demonstrated that our GPU-based AO simulations involved an order of magnitude of extra computational effort over a pure optical simulation.

## 1.3.

### Aims

Given the demonstrable improvements in efficiency which can be achieved using a GPU-based parallel implementation, our aim is the development of a robust, efficient and flexible simulation tool that has the potential to provide significant guidance in the application and optimization of the AO technique as both an imaging and a sensing modality for a broad range of tissue geometries and detection mechanisms. A fast forward model may also serve as the basis of future developments in inversion techniques for AO.

While our previous implementation^{18} was limited to homogeneous semi-infinite slab geometries with plane-wave ultrasound, here we develop a code capable of simulating the AO effect in optically heterogeneous domains, with arbitrary (monochromatic) acoustic field distributions. Previous CPU-based simulations reported in the literature^{11}^{,}^{13} have implemented a similar level of flexibility by the use of a voxelized simulation domain for the spatial distribution of the optical and acoustic parameters. Here we will demonstrate the implementation of optical heterogeneity via a mesh-based system which avoids the deleterious effects of a voxelized geometry,^{19} and exploits the computational power of the GPU to avoid approximations in the treatment of the AO phase shifts demonstrated in previous work.^{11}

It is our intention to release the described simulation code publicly in the near future; if you wish to be contacted upon its release, please contact the corresponding author by E-mail.

## 1.4.

### Overview

We begin in Sec. 2 with an overview of the relevant theory of AO, and the variance-reduction technique employed for point-source, point-detector (PSPD) simulations. The implementation of the simulation program and its post processing techniques are detailed in Sec. 3, following a brief overview of the pertinent considerations of parallel programming for GPU architectures. In Sec. 4 we demonstrate the validation of the simulation program, consider its performance relative to other simulation codes, and demonstrate one application in the calculation of explicit spatial sensitivity maps corresponding to experimental studies currently being undertaken. This work closes in Sec. 5 with a discussion of the limitations and possibilities for future development of the simulation code.

## 2.

## Theory

## 2.1.

### Acousto-Optic Interaction within Turbid Media

Three modes of AO interaction have been proposed in the literature: displacement of optical scatterers in a turbid medium by the acoustic field,^{2} perturbation of the refractive index of a medium due to mechanical strain induced by the acoustic field,^{7} and perturbation of the optical properties of the medium due to compression and rarefaction of the medium by the acoustic field.^{6} The latter mechanism could in principle be detected with incoherent light, but this effect has been too weak to be detected in practice (without the use of fluorescence). We consider only the two former mechanisms, which require the use of coherent light.

## 2.1.1.

#### Variation of the index of refraction

An acoustic wave propagating through a medium generates a strain which alters the local permittivity of the medium. The adiabatic piezo-optic coefficient $\eta =\delta n/\delta p$ ($\approx 1.466\times 1{0}^{-10}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{Pa}}^{-1}$ in water) relates the change in the index of refraction $n$ to the acoustic pressure $p$.^{20} Consider a plane electromagnetic wave propagating between two points $\stackrel{}{{\overrightarrow{r}}_{a}}$ and $\stackrel{}{{\overrightarrow{r}}_{b}}$ with wavenumber ${k}_{0}={\omega}_{o}/{c}_{0}$, where ${\omega}_{o}$ is the angular frequency and ${c}_{0}$ the speed of light *in vacuo*. The incremental phase angle accrued on this path over that of an optically homogeneous medium of refractive index ${n}_{0}$ is given by

## (1)

$${\varphi}_{n}(\stackrel{}{{\overrightarrow{r}}_{a}},\stackrel{}{{\overrightarrow{r}}_{b}},t)={k}_{0}{n}_{0}\eta {\int}_{{\overrightarrow{r}}_{a}}^{{\overrightarrow{r}}_{b}}p(\overrightarrow{r},t)\mathrm{d}\overrightarrow{r}.$$## (2)

$$\stackrel{}{{\stackrel{\sim}{\varphi}}_{n}}({\overrightarrow{r}}_{a},{\overrightarrow{r}}_{b},t)={k}_{0}{n}_{0}\eta {\int}_{{\overrightarrow{r}}_{a}}^{{\overrightarrow{r}}_{b}}{P}_{0}(\overrightarrow{r})\phantom{\rule{0ex}{0ex}}\mathrm{exp}[j({\varphi}_{a}(\overrightarrow{r})-{\omega}_{a}t)]\mathrm{d}\overrightarrow{r}.$$## 2.1.2.

#### Displacement of optical scatterers

An acoustic wave propagating through a turbid medium displaces the endogenous optical scatterers from their rest positions. The acoustic particle displacement at an arbitrary point in space and time is

## (3)

$${\overrightarrow{\xi}}_{a}(\overrightarrow{r},t)=p(\overrightarrow{r},t)\stackrel{}{\stackrel{}{{\hat{\overrightarrow{k}}}_{a}}}/({\omega}_{a}\rho {\upsilon}_{a}),$$## (4)

$${\stackrel{\sim}{\overrightarrow{\xi}}}_{s}(\overrightarrow{r},t)={P}_{0}(\overrightarrow{r})\widehat{\overrightarrow{{k}_{a}}}/({\omega}_{a}\rho {\upsilon}_{a})\mathrm{exp}[j({\varphi}_{a}(\overrightarrow{r})-{\omega}_{a}t)],$$^{10}

## (5)

$${\stackrel{\sim}{\overrightarrow{\xi}}}_{s}(\overrightarrow{r},t)=S{P}_{0}(\overrightarrow{r})\widehat{\overrightarrow{{k}_{a}}}/({\omega}_{a}\rho {\upsilon}_{a})\phantom{\rule{0ex}{0ex}}\mathrm{exp}[j({\varphi}_{a}(\overrightarrow{r})-{\omega}_{a}t+{\varphi}_{s})].$$## (6)

$$\stackrel{}{{\tilde{\varphi}}_{d}}(\stackrel{}{{\overrightarrow{r}}_{a}},\stackrel{}{{\overrightarrow{r}}_{b}},t)={n}_{0}\stackrel{}{{\overrightarrow{k}}_{0}}\xb7[{\stackrel{\sim}{\overrightarrow{\xi}}}_{s}(\stackrel{}{{\overrightarrow{r}}_{b}},t)-{\stackrel{\sim}{\overrightarrow{\xi}}}_{s}(\stackrel{}{{\overrightarrow{r}}_{a}},t)].$$## 2.1.3.

#### Acousto-optic interaction in a multiple scattering regime

We may think of the electric field at a point within, or on the boundary of a coherently illuminated turbid medium to consist of the summation of an unscattered field resulting from the incident radiation and numerous partial fields which result from the multiple scattering process,

## (7)

$$\stackrel{\sim}{E}(\overrightarrow{r},t)={\stackrel{\sim}{E}}_{i}(\overrightarrow{r},t)+\sum _{m}{\stackrel{\sim}{E}}_{m}(\overrightarrow{r},t),$$^{21}which sum together to approach the solution to the underlying transport equation. Our problem thus reduces to defining the manner in which these rays propagate, and their associated phase perturbations due to the acousto-optic interaction.

Over $N-1$ sequential scattering events at $\overrightarrow{r}{}_{j}$, $\overrightarrow{r}{}_{j},\phantom{\rule{0ex}{0ex}}j=1,\dots ,N-1$, the phase increment of Eq. (6) becomes a summation $\sum _{j=1}^{N-1}\stackrel{}{{\tilde{\varphi}}_{d}}({\overrightarrow{r}}_{j-1},{\overrightarrow{r}}_{j},t)$. With ${\overrightarrow{\xi}}_{s}({\overrightarrow{r}}_{0},t)=0$, we can write the phase increment upon incidence with a scatterer located at $\overrightarrow{r}$,

## (8)

$${\stackrel{\sim}{\varphi}}_{d}(\overrightarrow{r},t)=-{n}_{0}({\overrightarrow{k}}_{0,j+1}-\stackrel{}{{{\overrightarrow{k}}_{}}_{0,j}})\xb7{\stackrel{\sim}{\overrightarrow{\xi}}}_{s}(\overrightarrow{r},t),$$## (9)

$${\stackrel{\sim}{\varphi}}_{s}(t)=\sum _{j=1}^{N-1}{\stackrel{\sim}{\varphi}}_{d}(\stackrel{}{{\overrightarrow{r}}_{j}},t)+\sum _{j=1}^{N}{\stackrel{\sim}{\varphi}}_{n}(\stackrel{}{{\overrightarrow{r}}_{j-1}},\stackrel{}{{\overrightarrow{r}}_{j}},t),$$^{10}

## (10)

$${\varphi}_{s}(t)=\mathrm{Re}\text{\hspace{0.17em}}\{{m}_{s}\text{\hspace{0.17em}}\mathrm{exp}[j({\phi}_{s}-{\omega}_{a}t)]\}.$$## 2.2.

### Observables and Measures

Following multiple scattering in a turbid medium, the field variable $E(\overrightarrow{r},t)$ will be a random variable fluctuating over time as effects such as Brownian motion alter the contributions from its constituent scattering paths. Nonetheless, we may investigate the effects of the AO modulation by means of the temporal-field autocorrelation function, ${G}_{1}(\tau )$, which is derived from the observable temporal-intensity autocorrelation function, ${G}_{2}(\tau )$, by the Siegert relationship,

## (11)

$$|{G}_{1}(\tau )|={[{G}_{2}(\tau )-1]}^{1/2}=|{\int}_{-\infty}^{\infty}\stackrel{\sim}{E}(\overrightarrow{r},t){\stackrel{\sim}{E}}^{*}(\overrightarrow{r},t-\tau )\mathrm{d}t|.\phantom{\rule{0ex}{0ex}}$$## (12)

$${G}_{1}(\tau )\approx \sum _{p}{\int}_{-\infty}^{\infty}{\stackrel{\sim}{E}}_{p}(\overrightarrow{r},t){\stackrel{\sim}{E}}_{p}^{*}(\overrightarrow{r},t-\tau )\mathrm{d}t.$$^{22}

## (13)

$${G}_{1}(\tau )=\sum _{p}{a}_{p}^{2}[{J}_{0}^{2}({m}_{s,p})+2\sum _{n=1}^{\infty}{J}_{n}^{2}({m}_{s,p})\mathrm{cos}(n{\omega}_{a}\tau )],$$## 2.3.

### An Adjoint Method for AO

MC models of light transport are unable to directly simulate PSPD regimes which commonly occur in studies involving the use of coherent light, since the random walk process is unlikely ever to impinge directly upon the detection point. One approach to estimating the value of a quantity at a point with the MC method is the use of the adjoint method, as applied in the case of standard photon and neutron transport.^{23}^{,}^{24} We now expand this concept to the case of AO where energy will be distributed not only at the input optical frequency, but at sidebands shifted by multiples of the acoustic frequency.

A pencil beam incident normal to the boundary of the simulation domain at some point ${\overrightarrow{r}}_{s}$ illuminates the system with a given irradiance, $I({\overrightarrow{r}}_{s})$. We denote the probability of optical power at some point ${\overrightarrow{r}}_{a}$ in the $k$’th sideband reaching some point ${\overrightarrow{r}}_{b}$ in the $n$’th side band as ${P}_{k}^{n}({\overrightarrow{r}}_{b},{\overrightarrow{r}}_{a})$. The probability distribution of optical power at each sideband throughout the domain resulting from the source at ${\overrightarrow{r}}_{s}$ is given by the unitless normalized fluence rate ${P}_{0}^{n}(\overrightarrow{r},{\overrightarrow{r}}_{s})={\mathrm{\Phi}}^{n}(\overrightarrow{r})/{I}_{0}({\overrightarrow{r}}_{s})$, where ${\mathrm{\Phi}}^{n}(\overrightarrow{r})$ is the fluence rate at a point $\overrightarrow{r}$ in the medium in the $n$’th side-band. In replacing our detector with an adjoint source of equivalent directivity at ${\overrightarrow{r}}_{d}$, we may derive another set of optical power probability distributions ${P}_{0}^{n}(\overrightarrow{r},{\overrightarrow{r}}_{d})$.

The Chapman-Kolmogorov equation states^{25}

## (14)

$$P({\overrightarrow{r}}_{d},{\overrightarrow{r}}_{s})=\int P({\overrightarrow{r}}_{d},\overrightarrow{r})P(\overrightarrow{r},{\overrightarrow{r}}_{s}){\mathrm{d}}^{3}\overrightarrow{r}.$$## (15)

$${P}_{0}^{n}({\overrightarrow{r}}_{d},{\overrightarrow{r}}_{s})=\int \sum _{k=0}^{\infty}{P}_{k}^{n}({\overrightarrow{r}}_{d},\overrightarrow{r}){P}_{0}^{k}(\overrightarrow{r},{\overrightarrow{r}}_{s}){\mathrm{d}}^{3}\overrightarrow{r}.$$## (16)

$${W}_{0}({\overrightarrow{r}}_{d})={I}_{0}(\stackrel{}{{\overrightarrow{r}}_{s}})\int {P}_{0}^{0}({\overrightarrow{r}}_{d},\overrightarrow{r}){P}_{0}^{0}(\overrightarrow{r},{\overrightarrow{r}}_{s}){\mathrm{d}}^{3}\overrightarrow{r},$$## (17)

$${W}_{1}(\stackrel{}{{\overrightarrow{r}}_{d}})={I}_{0}(\stackrel{}{{\overrightarrow{r}}_{s}})\int {P}_{0}^{1}({\overrightarrow{r}}_{d},\overrightarrow{r}){P}_{0}^{0}(\overrightarrow{r},{\overrightarrow{r}}_{s})\phantom{\rule{0ex}{0ex}}+{P}_{1}^{1}({\overrightarrow{r}}_{d},\overrightarrow{r}){P}_{0}^{1}(\overrightarrow{r},{\overrightarrow{r}}_{s}){\mathrm{d}}^{3}\overrightarrow{r}.$$## (18)

$${\mathrm{MD}}^{(1)}\approx \frac{\int {P}_{0}^{1}({\overrightarrow{r}}_{d},\overrightarrow{r}){P}_{0}^{0}(\overrightarrow{r},{\overrightarrow{r}}_{s})+{P}_{0}^{0}({\overrightarrow{r}}_{d},\overrightarrow{r}){P}_{0}^{1}(\overrightarrow{r},{\overrightarrow{r}}_{s}){\mathrm{d}}^{3}\overrightarrow{r}}{\int {P}_{0}^{0}({\overrightarrow{r}}_{d},\overrightarrow{r}){P}_{0}^{0}(\overrightarrow{r},{\overrightarrow{r}}_{s}){\mathrm{d}}^{3}\overrightarrow{r}}.$$## 3.

## Implementation

## 3.1.

### GPU Parallel Programming in the CUDA Framework

The design of an efficient GPU-accelerated parallel algorithm requires careful consideration of the underlying computational architecture. An extensive discussion of best practices is provided by the authors of the CUDA architecture.^{26} Our implementation follows three broad strategies.

1. Exposing the parallelism of the algorithm: the problem at hand is inherently parallel as each photon packet is propagated independently of every other. In our algorithm, cases will occur when individual photon packets require different treatments. For example some may be reflected or refracted at an optical boundary, whereas others may be incident upon a scattering object. It is important to minimize this divergence since each execution branch will be serialized by the hardware.

2. Use of special-purpose memory: While such devices are capable of a very high arithmetic instruction throughput, this comes at the expense of the caching of memory. Various special-purpose memory spaces exist on the device which are optimized to store particular types of data.

3. Use of intrinsic arithmetic functions: CUDA exposes a number of functions which map directly to the hardware level; such instructions are considerably faster than the standard C-library implementations, but often have reduced accuracy. Judicious use of these functions (particularly the trigonometrical replacements) can considerably increase performance.

## 3.2.

### Architecture

The simulation program is built around the light-propagation and phase-accrual algorithms of the propagation kernel. Many thousands of instances of the kernel each execute in parallel on the GPU, the configuration and execution of which is managed by the driver. The driver consists of a MATLAB frontend to the simulation executable. The user configures the simulation by building a set of MATLAB variables and structures with the data required to define the simulation domain; this includes the geometry, medium properties, acoustic-field distribution and various simulation constants. The simulation is executed by passing this configuration structure to the MATLAB driver function. The driver validates the input data, constructs any dependent data, and calls the simulation executable. The executable uploads the data to the graphics card and repeatedly calls the kernel to propagate a certain number of photon packets until the desired number have been propagated. The output data is collected and stored in a MATLAB data file. The architecture of the simulation program is depicted in Fig. 1.

## 3.2.1.

#### Geometry

Many MC simulation programs which support heterogeneous optical properties, including GPU-accelerated programs^{17} and those which consider AO,^{12} implement heterogeneity by discretizing the domain into a voxelized grid. As a photon packet takes a step through multiple volume elements (voxels), each must be examined to ascertain if there has been a change in optical properties. While this approach has the benefit of being conceptually straightforward, representations of complex curved features require a very fine voxelization. Even with a fine voxelization, many of the problems discussed in Ref. 19 cannot be overcome.

A different approach is taken here whereby the simulation domain is represented by a tetrahedral mesh. This representation is typically encountered in applications of, e.g., the finite element method, but has also been applied to a CPU-based MC photon propagation problem.^{27} By employing this method, photons move around within tetrahedral volume elements which subdivide the simulation geometry. On each photon step, we must check for intersection with any of the four facets of the current element; whilst we must still interrogate a dataset to determine the location of these facets, the amount of data to be retrieved, and the nature of the processing to be performed, is now fixed across each thread, thus avoiding issues of divergence and potentially reducing the amount of data which must be retrieved from memory relative to a voxelized solution (dependent upon the complexity of the model). Many of the advantages of this approach are discussed in Ref. 28.

The meshed geometry is represented by four sets of data. The first is the node list, which records the three-dimensional (3-D) Cartesian co-ordinate of every node, or vertex, in the mesh. The node list is used by the element connectivity matrix, which defines individual tetrahedral elements by indexing four such nodes. In addition to this data, we also provide a list of element neighbors that for each element provides an index into the four neighboring elements. Finally, the properties table indicates the medium of which an element is composed by providing an index into the medium array.

Each of the geometry tables is stored on the device in texture memory, which, whilst residing in the slow global memory, is spatially cached. This will convey significant advantages at the start of simulations when all photon packets reside in the same element.

## 3.2.2.

#### Geometry intersection

The intersection of a line segment with a triangle is a well understood geometric operation. The most basic form of this test involves testing for intersection of the line with the plane defined by the three points of the triangle; many fast algorithms are available in the literature, one of the more popular being that of Möller and Trumbore.^{29} When such tests are implemented with finite precision, errors in the intersection tests may occur; this can often be tolerated in graphics rendering, for which such ray-tracing algorithms are typically developed. In this application the consequences of such an error is that a photon packet will end an iteration of the algorithm at a position which does not correspond to its current element; all subsequent intersection tests will fail and the photon will propagate indefinitely, or be fully absorbed within an incorrect element. This problem is exacerbated by the use of single-precision floating-point arithmetic within the kernel (the most efficient data type for current GPU architectures) rather than the double precision typically employed in a CPU-based algorithm.

A more robust approach is based upon the triple scalar product. In this case the majority of the tests are based upon the sign, rather than on the absolute value of the arithmetic operations. This approach is mathematically equivalent to the use of Plücker coordinates (as described by Fang^{27}), but avoids the associated complexity of analysis and notation. The signed volume of a tetrahedron defined by four vertices ${\overrightarrow{p}}_{0}\u2013{\overrightarrow{p}}_{3}$ is calculated by^{30}

## (19)

$$V(\stackrel{}{{\overrightarrow{p}}_{0}},\overrightarrow{p}{}_{1},\stackrel{}{{\overrightarrow{p}}_{2}},\stackrel{}{{\overrightarrow{p}}_{3}})=\frac{1}{6}[({\overrightarrow{p}}_{1}-{\overrightarrow{p}}_{0})\times ({\overrightarrow{p}}_{2}-{\overrightarrow{p}}_{0})\xb7({\overrightarrow{p}}_{3}-{\overrightarrow{p}}_{0})],$$For each of the four facets we calculate three signed volumes $V({\overrightarrow{r}}_{a},{\overrightarrow{r}}_{b},{\overrightarrow{p}}_{1},{\overrightarrow{p}}_{2})$, $V({\overrightarrow{r}}_{a},{\overrightarrow{r}}_{b},{\overrightarrow{p}}_{2},{\overrightarrow{p}}_{0})$, and $V({\overrightarrow{r}}_{a},{\overrightarrow{r}}_{b},{\overrightarrow{p}}_{0},{\overrightarrow{p}}_{1})$. In the case that the sign of each of these volumes is equal, the ray intersects the tetrahedron. To determine if the photon packet will intersect the tetrahedron given its current step size, we calculate the two signed volumes ${V}_{a}=V({\overrightarrow{p}}_{0},{\overrightarrow{p}}_{1},{\overrightarrow{p}}_{2},{\overrightarrow{r}}_{a})$ and ${V}_{b}=V({\overrightarrow{p}}_{0},{\overrightarrow{p}}_{1},{\overrightarrow{p}}_{2},{\overrightarrow{r}}_{b})$; in the case that the signs of these volumes are opposite, and that ${V}_{a}>0$, intersection occurs at the parametric distance ${V}_{a}/({V}_{a}+{V}_{b})$. If any interactions occur, the smallest intersection distance is used to scale the current photon step size.

## 3.2.3.

#### Acoustic field

As discussed in the theory, simulation of the AO modulation process requires knowledge of the local acoustic pressure amplitude, ${P}_{0}$, the local acoustic wave-vector, ${\overrightarrow{k}}_{a}$, and the local acoustic phase offset relative to some point, ${\varphi}_{a}$. We pack this data into a four-element array by storing, for each point, $[{P}_{0}{\widehat{k}}_{ax},{P}_{0}{\widehat{k}}_{ay},{P}_{0}{\widehat{k}}_{az},{\varphi}_{a}]$, where ${\widehat{k}}_{ax}$ is the $x$ component of the normalized wavevector. The magnitude of ${k}_{a}$ is stored in constant memory to permit recovery of the wave-vector when required.

We store the acoustic field data in a 3-D texture to exploit hardware caching, hardware-based trilinear interpolation (interpolation of the acoustic-field dataset is only possible when the phase angle of the field is unwrapped; if discontinuities around $2\pi $ exist, interpolation will produce undesired effects), and normalized addressing. It is thus possible to read all needed parameters of the acoustic field at a given location in one texture access.

## 3.2.4.

#### Medium

Each medium is represented by a structure stored in constant memory containing the absorption coefficient, ${\mu}_{a}$, the scattering coefficient, ${\mu}_{s}$, the anisotropy factor, $g$, the index of refraction, $n$, the piezo-optic coefficient, $\eta $, and the speed of sound, ${\upsilon}_{a}$.

## 3.2.5.

#### Random number generator

On numerous occasions within the light-transport algorithm a uniformly distributed random number must be generated. This application uses an implementation by Alerstam^{16} of Marsaglia’s multiply-with-carry random number generator.^{31} As described in the references, the algorithm requires only eight bytes of state information and has a period of greater than ${2}^{60}$.

## 3.3.

### The Monte-Carlo Procedure

Many of the fundamental calculations performed are of a standard form for MC light-transport codes; Wang’s description^{32} of his *de facto* standard code, MCML, will serve as a reference for these processes. We implement the following general procedure for light transport:

1. A photon packet is launched into the simulation domain according to the location and angular distribution of the source. In the case of mismatched refractive indices, the photon packet-weight is reduced according to specular reflection [Eqs. (3.2)–(3.6) in Ref. 32].

2. The potential photon packet step size is sampled from the appropriate distribution [Eq. (3.13) in Ref. 32].

3. If the sampled step size will cause the the photon packet to intersect the boundary between two optically dissimilar regions (see Sec. 3.2.2), the step size is truncated to the distance to the boundary.

4. If the photon packet intersected an optical boundary:

a. The angle of incidence between the photon packet and boundary is calculated. With knowledge of the differing refractive indices, the angle of transmission is calculated.

b. The internal reflectance is calculated [Eq. (3.30) in Ref. 32] or set to unity in the case of total internal reflection. Reflection or transmission is determined [Eq. (3.31) in Ref. 32].

c. The next direction of the photon packet is calculated according to Snell’s law, depending if reflection or transmission (refraction) has occurred.

Otherwise, the photon packet has been incident upon a scatterer:

a. The next direction of the photon packet is calculated according to the Henyey-Greenstein phase function.

^{33}b. Some proportion of the current photon-packet weight is absorbed into the medium [Eq. (3.18) in Ref. 32]. The fluence at the absorption event is recorded on the nodes of the current tetrahedral element according to linear weightings of the distance of the photon packet to each node, with a proportion ${J}_{0}^{2}({m}_{s})$ distributed to the zeroth side band, and a proportion $2{J}_{1}^{2}({m}_{s})$ to the first. Following simulation the fluence rates are normalized by the total input optical power, and the voronoi volume of each node in the mesh.

c. The photon packet is subjected to the roulette procedure for variance reduction [Sec. (3.9) in Ref. 32].

5. The photon packet is moved according to the (potentially truncated) step size [Eq. (3.17) in Ref. 32], and the phase increment due to the AO modulation of the refractive index of the medium is calculated. The simulation offers two methods to calculate the index of refraction phase increment based upon Eq. (2).

In the first formulation we approximate Eq. (2) with a numerical integration over the path of the current photon-packet step. We divide the photon-packet path length into $N$ subintervals of equal widths $h=\phantom{\rule{0ex}{0ex}}|{\overrightarrow{r}}_{b}-{\overrightarrow{r}}_{a}|/N$ and employ the composite midpoint rule,

where ${\overrightarrow{{r}_{}}}_{k}=\stackrel{}{{\overrightarrow{r}}_{a}}+(k+0.5)({\overrightarrow{r}}_{b}-{\overrightarrow{r}}_{a})/n$. The number of subintervals is defined at run-time as a particular number of points per acoustic wavelength and thus differs for each photon packet under propagation; the divergence of this approach leads to sub optimal performance. Furthermore, a large number of reads from the acoustic dataset are required to minimize the error in the numerical integral. This approach is still viable owing to the performance benefits of the parallel implementation and provides excellent accuracy irrespective of the nature of the underlying acoustic field.## (20)

$$\stackrel{\sim}{{\varphi}_{n}}({\overrightarrow{r}}_{a},{\overrightarrow{r}}_{b},t)\approx {k}_{0}{n}_{0}\eta h\sum _{k=0}^{N-1}{P}_{0}(\stackrel{}{{\overrightarrow{r}}_{k}})\phantom{\rule{0ex}{0ex}}\mathrm{exp}[j({\varphi}_{a}(\stackrel{}{{\overrightarrow{r}}_{k}})-{\omega}_{a}t)],$$An alternative approach to the calculation of the index of refraction phase increments is to assume that over the length of a single photon-packet step the acoustic field can be represented by a plane wave of given pressure amplitude, propagation direction and phase offset as defined at a point ${\overrightarrow{{r}_{}}}_{c}$,

## (21)

$${p}_{c}(\overrightarrow{r},t)={P}_{0}(\stackrel{}{{\overrightarrow{r}}_{c}})\text{\hspace{0.17em}}\mathrm{exp}[j({\varphi}_{a}({\overrightarrow{r}}_{c})\phantom{\rule{0ex}{0ex}}+{\overrightarrow{k}}_{a}({\overrightarrow{r}}_{c})\text{\hspace{0.17em}}\xb7(\overrightarrow{r}-{\overrightarrow{r}}_{c})-{\omega}_{a}t)].$$This expression is substituted into Eq. (1) and the integration completed over the straight line path of the photon-packet step from ${\overrightarrow{r}}_{a}$ to ${\overrightarrow{r}}_{b}$,

## (22)

$$\stackrel{}{{\tilde{\varphi}}_{n}}(\stackrel{}{{\overrightarrow{r}}_{a}},\stackrel{}{{\overrightarrow{r}}_{b}},t)\phantom{\rule{0ex}{0ex}}\approx \frac{{k}_{0}{n}_{0}\eta {P}_{0}(\stackrel{}{{\overrightarrow{r}}_{c}})\mathrm{exp}[j({\varphi}_{a}({\overrightarrow{r}}_{c})-{\overrightarrow{k}}_{a}({\overrightarrow{r}}_{c})\xb7{\overrightarrow{r}}_{c}-{\omega}_{a}t)]}{j{\overrightarrow{k}}_{a}(\stackrel{}{{\overrightarrow{r}}_{c}})\xb7({\overrightarrow{r}}_{b}-{\overrightarrow{r}}_{a})}\phantom{\rule{0ex}{0ex}}\times (\mathrm{exp}[j({\overrightarrow{k}}_{a}(\stackrel{}{{\overrightarrow{r}}_{c}})\xb7{\overrightarrow{r}}_{b})]-\mathrm{exp}[j({\overrightarrow{k}}_{a}({\overrightarrow{r}}_{c})\xb7{\overrightarrow{r}}_{a})]).$$In the case that the acoustic wave-vector is a slowly changing function of space, this approximation can offer significant performance improvements over the numerical integration when the photon-packet step size is large compared to an acoustic wavelength. This is achieved due to reduced memory access, and its non divergent execution.

6. The photon-packet direction is updated according to the calculated next direction. If the photon packet is incident upon a scatterer, the phase increment due to the AO modulation of the scatterer location is calculated by application of Eq. (8).

7. If the photon packet has been completely absorbed, its propagation is terminated. If the photon packet has escaped the simulation domain, the location, weight, absolute path length and phase-modulation terms [${m}_{s}$, ${\phi}_{a}$ of Eq. (10)] are recorded in a buffer residing in the device memory of the GPU, subsequently being transferred to the host and saved to disk. In all other cases the photon packet continues to propagate by returning to step two.

## 4.

## Results

## 4.1.

### Validation

The optical model of the simulation code has been validated against numerous analytical formulations based upon the DA to the radiative transfer equation (RTE). Simulations have been performed to compare the code to the total diffuse transmittance and total reflectance of van de Hulst,^{34} the total reflectance as calculated in the multi layer diffusion model of Liemert and Kienle,^{35} the temporal point-spread functions (TPSFs) presented by Martelli et al.,^{36} and the internal fluence for an optically inhomogeneous sphere inside a semi-infinite slab as derived by Boas et al.^{37} In each case, excellent agreement is found in regions where the DA is valid.

The optical model has also been compared against the output of MCML.^{32} A variety of simulations were performed in optically heterogeneous multi layered geometries. In all cases, excellent agreement was found in the total diffuse transmittance, total reflectance, and the radially resolved output.

Fewer analytical formulations are available with which to validate the AO model. Sakadžić and Wang^{8} present a set of figures [Figs. 2(a)–2(d)] demonstrating the ${\mathrm{MD}}^{(1)}$ as derived by the analytical solution for AO in a slab geometry with plane-wave ultrasound. These figures have been recreated and demonstrate excellent agreement over each demonstrated permutation of the acoustic and optical set up. In subsequent work Sakadžić and Wang^{12} employ their correlation diffusion equation to calculate the ${\mathrm{MD}}^{(1)}$ within a highly scattering, optically homogeneous semi-infinite slab of width $x=20\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$, insonified by a cylinder of plane-wave ultrasound.

We recreate this simulation by generating a cuboid mesh of width $x=20\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$. The depth and height of the mesh ($y=100\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$, $z=40\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$) are chosen to minimize boundary effects. A cylinder of ultrasound is created of radius 3.175 mm. Ten million photon packets were launched into the domain at $x=0$, $y=10\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$ and allowed to propagate until they were completely absorbed into the medium or exited the domain via one of the boundaries. The simulation program recorded the internal fluence distribution on each node throughout the medium at the input optical frequency, ${\mathrm{\Phi}}^{(0)}(\overrightarrow{r})$, and at the first shifted side band, ${\mathrm{\Phi}}^{(1)}(\overrightarrow{r})$. The internal first side-band modulation depth was calculated as ${\mathrm{IMD}}^{(1)}=\phantom{\rule{0ex}{0ex}}{\mathrm{\Phi}}^{(1)}(\overrightarrow{r})/{\mathrm{\Phi}}^{(0)}(\stackrel{\to}{r})$ prior to interpolation to a regular grid using standard MATLAB functions.

Figure 3 plots the internal first-side band-modulation depth ${\mathrm{IMD}}^{(1)}$ inside the slab in the plane of the source. Figure 2 compares the MC simulation ${\mathrm{IMD}}^{(1)}$ against that of the analytical solution along the transmission and reflection planes of the slab. Excellent agreement is found over the region where the DA is valid. Near to the source location we note a large dip in the ${\mathrm{IMD}}^{(1)}$, as would be expected when the large unmodulated intensity is considered. The MC results agree excellently with those presented in the original work.^{12}

In each of the simulations performed for the validation procedure, the explicit acoustic integration technique was employed since this is exact in the case of plane-wave insonification. Use of the numerical integration method with six points per wavelength produced identical results for this particular set of acoustic and optical properties.

## 4.2.

### Performance

To illustrate the performance of the simulation code, we compare its optical performance against a recent CPU mesh-based MC model, MMCM.^{27} We execute the simulation of the inhomogeneous sphere in a scattering slab using the same mesh (MMCM Mesh 2) as used by Fang,^{27} who reports a speed of 5.6 photon packets per millisecond ($\mathrm{pp}/\mathrm{ms}$) for a single thread, scaling to $21.2\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{pp}/\mathrm{ms}$ when executed over four threads. Our GPU implementation processes $39.8\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{pp}/\mathrm{ms}$ while executing on a single nVidia GTX 250 graphics card.

The GPU implementation executes at almost eight times the speed of the single core CPU implementation, or almost twice that of the four-core parallel implementation of Fang.^{27} The GPU implementation may be executed on a server incorporating multiple GPUs, where performance will scale linearly according to the number of devices available. While calculation of the AO phase shifts was disabled at run time, the overhead of variable definitions at compile time still has considerable performance implications owing to the number of threads which may execute simultaneously. Removal of the AO aspects of the simulation code should be expected to significantly improve performance if only an optical simulation is required.

Very little data has been published with which to compare the performance of the code in the AO case. We recently presented the results of a GPU-based slab geometry plane-wave AO code and compared these to a CPU implementation^{18} of the same. We now compare the performance of the current simulation code to that of both the CPU and GPU codes demonstrated previously.

A semi-infinite slab of depth 2 cm (represented in the current simulation code as a slab of lateral extent 10 cm) with optical properties ${n}_{0}=1.33$, ${\mu}_{a}=0\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{cm}}^{-1}$, $g=0$, $\eta =0.321$ and varying ${\mu}_{s}$ is illuminated on one surface and the AO ${\mathrm{MD}}^{(1)}$ determined given insonification by a plane-wave acoustic field of displacement amplitude $A=1.7\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{nm}$ and varying frequency. Table 1 details the speed of the simulations in numbers of photon packets per millisecond. The previously reported simulation figures are given in the column labeled GPU Orig., the speed of the current simulation is described for two frequencies and for both the numerical integration of phase of Eq. (20), and the explicit integration of Eq. (22).

## Table 1

Execution speed (photon packets/ms) of alternative acousto-optic simulation programs and configurations. Numbers in parentheses demonstrate the speed-up relative to the CPU implementation.

μs(cm−1) | Execution speed (photon packets/ms) | |||||
---|---|---|---|---|---|---|

1 MHz | 10 MHz | |||||

CPU18 | GPU Orig.18 | GPU (Num. Int.) | GPU (Explicit) | GPU (Num. Int) | GPU (Explicit) | |

5 | 9.43 | 1449.27 (×154) | 232.15 (×25) | 251.10 (×27) | 71.85 (×8) | 294.43 (×31) |

10 | 4.60 | 1030.93 (×224) | 166.17 (×36) | 193.24 (×42) | 62.08 (×13) | 191.86 (×42) |

20 | 2.25 | 568.18 (×253) | 106.22 (×47) | 109.80 (×49) | 49.51 (×22) | 109.24 (×49) |

50 | 0.89 | 253.44 (×285) | 44.54 (×50) | 46.77 (×53) | 30.38 (×34) | 46.19 (×52) |

100 | 0.44 | 87.57 (×199) | 20.83 (×47) | 20.64 (×47) | 16.38 (×37) | 19.52 (×44) |

200 | 0.22 | 29.18 (×133) | 8.71 (×40) | 7.96 (×36) | 8.29 (×38) | 8.18 (×37) |

To understand the resultant performance data, we must consider the limiting factors in each of the simulation algorithms. The performance of the original CPU and GPU codes is primarily a function of the scattering coefficient, ${\mu}_{s}$, since this dictates the number of calculations which will be performed for a given photon packet’s simulation (as discussed further in our earlier work^{18}).

The new simulation code retains the same performance dependency while introducing two further factors which may affect performance.

1. If the average size of the mesh describing the simulation domain becomes smaller than the average step size of a photon packet (i.e., at low values of ${\mu}_{s}$ and with a dense mesh), photon-packet steps will routinely be truncated and geometry intersection tests will begin to dominate the processing workload. In this region of operation, the performance will be dependent upon the dimensions of the mesh.

2. When operating in the AO numerical integration mode a large photon-packet step size relative to the acoustic wavelength will result in multiple memory reads and sets of arithmetic operations to determine the value of the integral [i.e., $N>1$ in Eq. (20)]. In this region, performance will depend upon the relative sizes of the average step size of a photon packet and the acoustic wavelength. This limitation does not affect the AO explicit integration mode, which currently performs only one calculation per photon packet step.

The aforementioned qualities of the performance of the new simulation code are evident in the performance data of Table 1. Under insonification by ultrasound at a frequency of 1 MHz (when the acoustic wavelength is long compared to a mean free path) geometry intersection tests (and their associated memory access) dominate the execution time and the new simulation code performs similarly under both integration regimes.

The increased flexibility and associated complexity of the new simulation decrease the performance of the new code relative to the previously reported GPU implementation. The new code achieves a performance varying between $1/8$th and $1/{3}^{}$th of the previously reported GPU code; as ${\mu}_{s}$ increases, the speeds of the two GPU codes approach one another, since fewer geometrical intersection tests are processed in the new code relative to the total computational workload.

Under 10 MHz insonification the explicit integration mode of the simulation maintains similar execution speed to that of 1 MHz insonification, though the numerical integration shows a significant relative degradation in performance at low values of ${\mu}_{s}$.

In all but one of the examples investigated here the new simulation code is at least an order of magnitude faster than the CPU implementation; though as discussed, performance could be degraded by introducing a finer mesh, or in the case of the numerical integration mode, a higher frequency of insonification.

## 4.3.

### Sensitivity Mapping

In recent work, Gunadi and Leung^{38} demonstrated the measurement of spatial sensitivity maps for the AO sensing technique. In this work an optical absorber of ${\mu}_{a}=42\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{cm}}^{-1}$, ${\mu}_{s}=12\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{cm}}^{-1}$, and dimensions $5\times 5\times 12\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$ was moved around a tank of intralipid of negligible absorption and equal scattering to that of the absorber. In the first case a transmission-mode geometry was investigated, with an optical source and detector separated by 35 mm. The absorber was moved around the central region in 1-mm steps over a $25\times 25$ grid, in each case the detected optical intensity (in the absence of ultrasound) and the AO ${\mathrm{MD}}^{(1)}$ (in the presence of ultrasound) were recorded. A reference measurement was taken in the absence of an absorber such that the sensitivity can be represented as $J=({X}_{\mathrm{ref}}-X)/{X}_{\mathrm{ref}}\times 100\%$, where $\mathrm{X}$ may be optical intensity or AO ${\mathrm{MD}}^{(1)}$, and ${\mathrm{X}}_{\mathrm{ref}}$ is the corresponding reference measurement.

We recreate this experiment in simulation though reduce the absorber size to a $1\times 1\times 12\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$ cube. The linear acoustic field is generated by distributing monopoles over the surface of a spherical cap of equal dimension to the transducer used in practice, and summing their contributions in the field. The amplitude is scaled to match that recorded by Gunadi and Leung.^{38} The mesh and absorptive inclusion is built on the fly using the iso2mesh toolbox. Following simulation, the adjoint method described in Sec. 2.3 is used to calculate the detected ${\mathrm{MD}}^{(1)}$. Figure 4(a) and 4(b) depict the optical and AO spatial sensitivity recovered by 1250 independent MC simulations, each of two million photon packets.

While the smaller absorptive inclusion leads to differences in the absolute sensitivity demonstrated by measurement and simulation, the form of the sensitivity distribution retains the same qualities as described by the experimental observations of Gunadi and Leung.^{38} As would be expected, the optical sensitivity remains concentrated around the source and detector locations. The AO sensitivity demonstrates localization to the acoustic field and, more significantly, a very low level of sensitivity close to the source and detector.

The second case studied in Ref. 38 is a reflection-mode geometry with source-detector spacing of 30 mm. The ultrasound field is located centrally at a depth of 20 mm from the input and detection surface. The same optical absorber was moved around a scanning region of $30\times 20\text{\hspace{0.17em}}\text{\hspace{0.17em}}\mathrm{mm}$, located centrally and starting at a depth of 5.5 mm from the input surface (owing to the dimensions of the experimental absorber). This experiment was recreated in simulation; Fig. 4(c) and 4(d) depict the optical and AO spatial sensitivity recovered by 1800 independent MC simulations of two million photon packets.

Once again, the absolute sensitivities of the simulation differ slightly to that of the experimental work owing to the differing absorber dimensions. The optical sensitivity has the expected ‘banana’ shape between the source and detector, with sensitivity rapidly declining at distances greater than around 10 mm. The AO sensitivity is concentrated at a depth of around 17 mm from the input/output plane; the peak relative sensitivity is not within the ultrasound focus region, as the lack of light at these depths outweighs the higher acoustic pressures. Close to the source/detector plane a region of negative AO sensitivity is found. Negative sensitivity suggests that an absorptive perturbation has the effect of increasing the measurement quantity (${\mathrm{MD}}^{(1)}$). This can occur when the perturbation reduces the amount of ‘unmodulated’ light reaching the measurement position by an amount proportionally greater than it reduces light which has been modulated into the first side band. The form of the simulated sensitivities shows excellent agreement with the experimental work of Gunadi and Leung.^{38}

## 5.

## Conclusions and Future Work

The implementation of a highly parallel AO MC simulation code has been demonstrated. The optical model has been extensively validated against various analytical and numerical solutions, and the AO model against a complex AO analytical example.

The performance of the simulation code has been compared against CPU and GPU implementations of pre-existing optical and AO simulation codes. While our previous GPU implementation of an AO simulation code provides better performance, this is at the cost of significantly reduced flexibility in the nature of the simulation domain. The flexibility and performance of the code have also been demonstrated in the generation of spatial sensitivity maps which support experimental work currently under way.

The primary limitation of the current model is that it is designed purely for monochromatic acoustic excitation. The implementation of a polychromatic acoustic-field distribution would permit the simulation of modulated continuous-wave inputs, or pulsed ultrasound, as employed to gain spatial selectivity in the axis of the ultrasound. This could potentially be implemented in the style of Sakadžić’s voxelized CPU simulation.^{13}

Further analysis of the relative validity of the explicit and numerical integration methods may allow the program to determine automatically the appropriate technique based upon the form of the supplied acoustic field.

It would also be advantageous to develop a theoretical perturbation model for AO which follows from the adjoint formulation presented in this work. This would then require only two MC simulations to be executed in order to provide an approximation to the actual spatial sensitivity of the technique. The efficiency of the simulation tool described here would permit such a perturbation model to be validated against multiple explicit sensitivity maps such as those described earlier.

## Acknowledgments

The authors would like to thank Erik Alerstam for his GPU-accelerated CUDAMCML code and Sava Sakadžić for his solution to the correlation diffusion equation for the example provided in the validation of this work. Our thanks is also extended to Ben Cox, Jeremy Hebden and Simon Arridge for their useful discussions and assistance in this work. This research is funded by EPSRC (Grant No. EP/G005036/1).