Efficient inversion strategies for estimating optical properties with Monte Carlo radiative transport models

Abstract. Significance: Indirect imaging problems in biomedical optics generally require repeated evaluation of forward models of radiative transport, for which Monte Carlo is accurate yet computationally costly. We develop an approach to reduce this bottleneck, which has significant implications for quantitative tomographic imaging in a variety of medical and industrial applications. Aim: Our aim is to enable computationally efficient image reconstruction in (hybrid) diffuse optical modalities using stochastic forward models. Approach: Using Monte Carlo, we compute a fully stochastic gradient of an objective function for a given imaging problem. Leveraging techniques from the machine learning community, we then adaptively control the accuracy of this gradient throughout the iterative inversion scheme to substantially reduce computational resources at each step. Results: For example problems of quantitative photoacoustic tomography and ultrasound-modulated optical tomography, we demonstrate that solutions are attainable using a total computational expense that is comparable to (or less than) that which is required for a single high-accuracy forward run of the same Monte Carlo model. Conclusions: This approach demonstrates significant computational savings when approaching the full nonlinear inverse problem of optical property estimation using stochastic methods.


Introduction
Inverse problems arise in many areas within biomedical optics, both for global characterization of optical properties of media and for image reconstruction, among other applications. 1 Inverse problems are often considered as optimization problems, solved by deriving the gradient of an objective function and iteratively descending through the solution space. This process requires repeated solutions of forward and corresponding adjoint problems that are often computationally demanding in their own right. If the forward problem is given by the solution to a partial differential equation (PDE), then one appealing approach is to solve the forward and inverse problems simultaneously so that the forward problem is only approximately solved at intermediate stages in the algorithm (i.e., before it has finally converged); this approach (which has its basis in optimal control) is known as PDE-constrained optimization. [2][3][4] In this work, we seek an equivalent coupled-physics imaging (CPI) modalities, namely, quantitative photoacoustic tomography (QPAT) and ultrasound-modulated optical tomography (UMOT). 14,15 Both of these problems are nonlinear and entail the RTE for an accurate description, but exhibit different degrees of ill-posedness and resolution; thus they serve to demonstrate the generality of our approach. We evaluate the performance of various Monte Carlo inversions using simulated QPAT and UMOT data in Sec. 4, and discuss what practical lessons can be taken from this in Sec. 5.

Modeling and Inversion Problems in Optical Tomography
A common problem in biomedical optics involves finding the internal distribution of some optical properties x within a medium using various measurements made around and/or within the medium y obs . To do this, we can employ some forward model of the underlying physical problem A, which produces an output y, given some estimate of the internal properties x, E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 5 7 2 y ¼ Ax; (1) where in this case, the forward model A could represent the RTE and all relevant aspects of the optical setup (geometry of sources and detectors). In cases where A is not directly invertible, then to solve for an unknown distribution of properties x, we can formulate a cost function as a measure of the quality of an estimate. This could for example be the L2-norm of the residual between the real measured data, y obs , and our forward modeled data, y.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 4 8 0 From this point, the problem now becomes one of minimization, where we will qualify our solution x Ã as that which minimizes the cost function, x Ã ¼ arg min x FðxÞ. It is worthy to note that the ground-truth parameters x true may differ from the minimizer x Ã leading to reconstruction error. This minimization problem can be approached via iterative GD, where we start with some estimate x 0 , and each successive iterate, x n , is determined by subtracting a (scaled) gradient of our cost function ∇F (relative to the internal optical properties) from the previous iterate E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 6 7 x n ¼ x n−1 − α n ∇Fðx n−1 Þ; where α n is the step size which scales the update term. If we have access to some computation or set of computations (sometimes referred to as a "first-order oracle") which we can call to compute Fðx n−1 Þ and ∇Fðx n−1 Þ, then this algorithm can be implemented and is said to converge if lim n→∞ Fðx n Þ ¼ 0. In practice, the descent may be terminated early once the cost function reaches some acceptable value, for example, when the norm of the difference between observed and model data is of the same order as measurement noise, a criterion known as the discrepancy principle. 16

Stochastic-Gradient Descent
In a stochastic setting, for instance, when our forward model A is a Monte Carlo model of radiative transport, then the true cost F and gradient ∇F given in Eq. (3) are not directly available. Instead, we may only have access to estimates of the cost function and gradient (provided by a stochastic first-order oracle). In Sec. 3, we detail the nature of these stochastic Monte Carlo computations in the radiative transport setting. In the interest of generality, for now, we simply assume such models exist, and we can make a call to a "stochastic oracle" to attain F S n and ∇F S n which we assume are nonbiased approximations., i.e., learning, this may refer to a particular training example (or group of training examples) to be used during one learning iteration. 17 In Monte Carlo modeling of radiative transport, the sample refers to the set of virtual photons (and their associated random number seeds) that are initiated in the simulation to represent an optical source, which are subsequently used to estimate Fðx n Þ and ∇Fðx n Þ. The stochastic version of gradient descent (SGD) thus attempts to minimize a sampled objective function, F S n , by updating the previous iterate with a scaled sampled gradient.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 6 6 3 x n ¼ x n−1 − α n ∇F S n ðx n−1 Þ: As with any computation, a call to a stochastic oracle at each iteration comes with a certain computational cost. The particular cost may depend on a number of factors, including the sample size, jS n j. This is one of the reasons why the study of SGD is of such importance in modern machine learning, where training datasets may be of an enormous size, which means that computing a gradient based on all available data at each iteration could be very costly. Rather, individual samples (jS n j ¼ 1) or batches of samples (jS n j > 1) may be used instead at each iteration. While this degrades the quality of any individual gradient estimate compared to using all available data, if the variance of these estimates is maintained below an acceptable value, the overall trade-off may be net positive. What this means in a Monte Carlo radiative transport context is that we may be able to allow low-quality gradient estimations (simulating only a small number photons) for a large part of the inversion process when estimating optical properties, saving on per iteration computational resources, and leading to an overall efficiency improvement. This is in contrast to typical implementations of iterative Monte Carlo solvers in the biomedical optics community, where each iteration is computed with large numbers of photons that are deemed sufficient to produce "stable" and "smooth" (low variance) forward model data. [18][19][20][21][22][23][24] In some cases, where a linearized approximation is assumed for the inverse problem, the cost of rerunning the forward model can be avoided using techniques such as perturbation Monte Carlo (PMC) methods. 11,25,26 However, for the full nonlinear problem, although PMC can be used for calculation of the problem Jacobian, this has to be recomputed at each iteration of, for example, a Gauss-Newton optimization scheme. 27 In this study, if we are to accept a level of variance and imperfection in our forward/adjoint models, this of course raises the question of how much variance is acceptable in order for SGD to be successful? Furthermore, what measure of the variance is the best indicator in terms of efficiency/performance for common Monte Carlo solvers? To begin to answer this, it is important to first note that fixed-step SGD does not in general converge to a solution. 28,29 That is, if α n is fixed for all n, eventually there will come a point where the next update of the estimate [with the term α n ∇F S n ðx n−1 Þ] will reliably "undo" the work of the prior step, which will effectively halt the descent. The point at which this occurs depends on the variance of ∇F S n . We can see this by rewriting the sampled stochastic gradient estimate as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 2 8 1 where ϵ is a random vector with E½ϵ S n ðx n Þ ¼ 0 for all n. As GD progresses successfully, the "true" gradient ∇F will eventually begin reducing in size as we near the minimum. Once the magnitude of the true gradient reduces to a point at which it is comparable to the randomness of ϵ S n , the problem arises. The larger the expected magnitude of ϵ S n , the sooner the minimization of the cost function reaches this limiting scenario, where further iterations will only lead to a random walk about this point. To prevent this from happening, we may take one of two actions (or a combination thereof): (i) reduce the step size at each iteration such that we can avoid "backtracking" in the descent, more on this in Sec. 2.3 or (ii) gradually improve the accuracy of our sampled gradient such that the variance of the sampled gradient remains below some threshold value compared to the norm of the true gradient ∇F. In other words, we may wish to ensure the inequality E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 1 1 5 where γ tot is a positive coefficient describing the acceptable threshold. The aforementioned inequality is known as the "norm test." 30 It is worthy to note that, since for any vector of random variables the variance of its length is the sum of the variances parallel and orthogonal to any fixed vector, this test equally penalizes the components of randomness parallel and perpendicular to the true gradient. Recent studies, however, have demonstrated that controlling the component of randomness parallel to ∇F is potentially a more relevant objective, as the component of the sampled gradient orthogonal to the true gradient is zero in expectation. An alternative measure of acceptable variance in ∇F S n has thus been introduced as the "inner product test," 30 which only aims to restrict the component of variance in the sampled gradient parallel to the true gradient ∇F.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 6 1 5 This inner product test imposes a less restrictive limitation of the overall variance in the sampled gradients, particularly in cases where the variance may be higher in directions orthogonal to the true gradient than in the direction parallel to ∇F. However, either of these metrics will be able to exploit the fact that an increased expected error, E½kϵ S n k, will correlate to a cheaper computation of the estimated gradient. Thus, setting larger values of γ tot or γ k in the inequalities will correspond to cheaper computational requirements for each step, but also a more pronounced random walk component to the GD. In many cases, it may be found that the penalty paid by increasing the random walk component is acceptable (up to a point) compared to the penalty paid in computational cost for reducing the expected norm of ϵ to a negligible value. For example, using Monte Carlo RTE simulations to compute ∇F with a negligible level of variance (i.e., setting γ tot ≪ 1) may take billions of simulated photons at each step. Whereas, it may be possible to compute a gradient that passes the norm test or inner product with larger values of γ tot or γ k with many orders of magnitude less photons, particularly during the early stages of GD, where we may be far from the minimum. The ideal choice of γ tot or γ k will depend on the specific application.

Adaptive Sample Size
We have discussed two different measures of the variance in the sampled gradient ∇F S n that we wish to investigate in the context of Monte Carlo estimation of media properties, viz., the norm test [Eq. (7)] and the inner product test [Eq. (8)]. To satisfy the inequalities defining these tests as the GD progresses, we will be required to reduce the variance in the sampled gradients ∇F S n whenever the norm test or inner product test fail. This can be done by increasing the sample size (number of photons used jS n j) when making a call to the stochastic oracle. Two practical considerations are still required: first, how to compute the "true" gradient ∇F, which is needed to evaluate the norm test and inner product test; and second, by how much we should increase the sample size in a situation where one of the tests fails?
The true gradient ∇F is only calculable in the limit that an infinite number of photons are used in the Monte Carlo model. This limit can equivalently be represented as an average over independent repeated outputs of the sampled gradient E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 2 0 1 ∇Fðx n Þ ¼ lim Using a finite value of N rep in the evaluation of Eq. (9) provides an approximation to the true gradient, and when this is used to compute the norm and inner product tests [Eqs. (7) and (8)], the inequalities will fail before they would if N rep ¼ ∞, thus acting as a conservative approximation. It is noted that this method of approximating the true gradient is computationally taxing. However, in practice, the inner product test and norm tests can still be conducted efficiently if they are only computed occasionally (not at every iteration) of the descent. For example, using N rep ¼ 100 repeated computations of the sampled gradient to conduct the tests once every 100 iterations (thus only updating our sample size every 100 iterations) would only double the total number of simulated photons required for the inversion. In this study, we evaluate these metrics once every 10 iterations using N rep ¼ 100 repeated sampled gradients. While this is a significant computational burden, we do so in this study as we are interested in assessing the best case scenarios for such methods. It is worthy to note that although we compute the aforementioned approximation to the true gradient to evaluate the inner product and norm tests, we only ever update our estimate using the sampled gradient.
In terms of increasing the sample size in the event where the inner product and/or norm tests fail, this can be done in a number of ways. A simple method we will employ in this study is to scale the current sample size by some factor κðnÞ to increase the number of photons used in the next iteration E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 6 ; 6 0 3 jS nþ1 j ¼ κðnÞjS n j: One option for κðnÞ is to use the same factor by which the variance exceeds our imposed limit at a given point in the descent. For instance, upon failure of the inner product test for a chosen value of γ k , we can increase the sample size on the next iteration using κðnÞ ¼ V 2 k ðx n Þ∕γ 2 k . However, we also investigate other forms of κðnÞ in Sec. 4, which better cope with statistical variations that can lead to overestimating the required sample size increase.

Step Size
In cases where we are not taking actions to bound the error in the sampled gradient (such as enforcing successful outcomes of an inner product test or norm test), fixed-step SGD may only converge to a region around the solution. Reducing the step size sufficiently at each step is usually required to allow convergence. 31 However, it can be shown that if we are bounding the error in the sampled gradient, e.g., by increasing the sample size, then fixed-step SGD may converge so long as the following is satisfied for all n: 30 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 1 1 6 ; 3 9 9 where L is the Lipschitz constant for F. The Lipschitz constant for a functional F is a measure of its rate of change with respect to its parameter and can be defined, for example, as the smallest constant such that ∇ 2 F⪯LId, where Id is the identity matrix, and we assume that F is twice continuously differentiable. It can also be interpreted as the largest eigenvalue of the Hessian of F. 32 As intuition may indicate, when the sample size (e.g., number of simulated photons) increases toward the maximum number of samples jS n j → jS max j (jS max j ¼ ∞ in the case of Monte Carlo RTE simulations), the expected error in the sampled gradient approaches zero, jϵ S n j → 0, as do the measures of variance in the sampled gradients (V 2 tot → 0, V 2 k → 0), as defined in Eqs. (7) and (8). In other words, as the stochasticity in the problem reduces to zero, we approach the classical step size of the deterministic problem given by α ¼ 1 L . 32 In this study, we aim to satisfy the aforementioned step size criteria for an assumed value of the Lipschitz constant L, which we will choose conservatively depending on the particular scenario. However, as we are primarily interested in reaching the best possible solution for a given allocation of computational resources, convergence to a region around the unique solution may be sufficient for our purposes. For this reason, we will also investigate larger step size criteria, which violate Eq. (11), yet exhibit good performance in our scenarios of interest.
Taking the above considerations into account, we present a basic method for SGD using adaptive sample sizes in Algorithm 1 (simplified from Ref. 30). The algorithm imposes a limit on the total number of photons to be simulated using Monte Carlo transport models throughout the entire descent, N ph .

Stochastic Forward and Adjoint Models
In this section, we cover the computation of the stochastic forward model and stochastic gradient approximation, referred to as the first-order stochastic oracle. We will cover the basic radiative transport forward problem, and the gradient computations involved in our example problems of absorption estimation in QPAT and UMOT. The specific details of these models are not required to understand the main premise of this paper, but serve as a demonstration in a context familiar to many in the biomedical optics community, where Monte Carlo models of optical transport are employed to estimate medium properties.

Forward Model
For any optical source Qðr;ŝÞ, either incident on a medium or present within it, we wish to model the resulting radiance, ϕðr;ŝÞ, describing the radiant flux at each position r, and in each direction s. This can be achieved using the RTE.
where T and S denote the attenuation and scattering operators, respectively, which together compose the RTE operator, L. For notational convenience, we assume that Eq. (12) is combined with appropriate boundary conditions, which we do not write explicitly here; see Ref. 33 for more details. Here, μ a is the absorption coefficient, μ s is the scattering coefficient, and p is the scattering phase function. Using the defined operators, Eq. (12) can be rewritten in a more compact form: Algorithm 1 Inversion using Monte Carlo sampled gradients with adaptive sample size.
Choose initial photon sample size jS 1 j, and desired value of γ k or γ tot if run test? then compute sampled gradient, ∇F S n , and approximate true gradient, ∇F [using Eq. (9)] To obtain (stochastic) estimates of the radiance resulting from a given source, and thus to obtain an estimate of any derived data function yðϕÞ, we can implement a Monte Carlo solver, L −1 MC . In this study, we have adapted a hardware-accelerated version (utilizing graphics processing units) of the commonly employed "Monte Carlo multilayer" program used to simulate radiative transport within a layered planar medium. 34,35 The basic operation of this program is unchanged from the original release. Simulated photons are initiated by sampling from a given source function, Q, and scattering/absorption events are pseudorandomly generated along each photon's trajectory until either: (i) the photon leaves the domain or (ii) the photon drops its weight below some threshold value. In this study, the scattering directions are sampled from the Henyey-Greenstein scattering phase function. The expected accuracy of the computed radiance using Monte Carlo solvers L −1 MC depends on the total number of photons used, i.e., the sample size jS n j. As jS n j → ∞, the radiance approaches the deterministic solution of the RTE. Importantly, however, Monte Carlo models allow an estimate of the radiance to be achieved with any number of photons with jS n j ≥ 1. The expected computational requirements (number of floating point operations) of the Monte Carlo solver L −1 MC also scales with the number of photons simulated, and it is this trade-off between accuracy of the forward model (and corresponding adjoint model) and computational cost that we will be investigating.

Gradient Computation: Adjoint Model
To compute the gradient of our cost function ∇F with respect to the optical properties of the medium, we make use of an adjoint RTE model. Although direct methods of finding the derivative of a Monte Carlo method can also be developed, 12 adjoint methods have more applicability in general, and also allow closer comparison with optimization techniques used in machine learning. For further details of forward and adjoint methods in the RTE, we refer to Ref. 36; for specific details of CPI problems, we refer to Ref. 37. We first consider a change to Eq. (12) where μ a → μ a þ μ δ a , μ s → μ s þ μ δ s , for the same source Q, which results in a change in radiance ϕ → ϕ þ ϕ δ . This implies E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 6 ; 3 9 4 ðT μ a þμ δ a ;μ s þμ δ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 1 1 6 ; 3 3 2 We also define the fluence, Φ, as the angular integral of the radiance: To proceed beyond this point, we must now consider the specific form of the data function relevant to a particular modality of interest. We begin with the first of our two example modalities, QPAT.

QPAT case
In QPAT, the medium is illuminated with a pulsed optical source, Q (see Fig. 1). The distributed optical energy is absorbed at various points within the sample, giving rise to internal acoustic waves. These acoustic waves can be detected at the surface of the medium by a sensor and processed to locate the initial pressure distribution p 0 within the medium. [38][39][40] This internal pressure distribution is related to the spatial distribution of absorbed optical energy, h, where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 7 ; 1 1 6 ; 1 1 3 and where Φ is the optical fluence of Eq. (16). We have omitted the Grüneisen parameter for clarity of exposition, though this parameter can be included in practice. Assuming that we can recover the absorbed optical energy, h, the problem remains to find the distribution of μ a ðrÞ within the medium. 41,42 It is worth noting that although the optical source is pulsed, it is acceptable to use a continuous-wave (time-independent) model to describe ϕ and Φ because the time scale of the acoustic wave propagation is orders of magnitude slower than the optical propagation. 43 First, restating our cost function in terms of the QPAT data function, h, we have E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 1 1 6 ; 5 0 7 We then write the Fréchet derivative of F QPAT as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 9 ; 1 1 6 ; 4 5 0 where μ δ a is a small change in absorption. In this paper, we will neglect changes in scattering, however, the below formalism is still general for the gradient with respect to absorption. The gradient term with respect to scattering coefficient is described in Ref. 42 and will be included in future investigations. Writing the Fréchet derivative of h as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 1 1 6 ; 3 6 8 and defining Φ δ ¼ DΦμ δ a , we arrive at E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 1 1 6 ; 3 2 5 Next, we define the adjoint radiance, ϕ Ã , as the solution to E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 1 1 6 ; 2 7 9 where the right-hand side describes the "adjoint source" which is isotropic inŝ. We then substitute the above into Eq. (21) to give E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 3 ; 1 1 6 ; 2 2 3 where we exploited the fact that the right-hand side of Eq. (22) does not depend on direction.
Using the definition of the adjoint operator, and the fact that the change in radiance is zero on the boundary ∂Ω yields Finally, we make use of the perturbation expression Eq. (15), while again, here, we neglect any change in scattering. This gives E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 5 ; 1 1 6 ; 9 5 DF QPAT ¼ −hΦðh obs − hÞ; μ δ a i L 2 ðΩÞ þ hϕ Ã ϕ; μ δ a i L 2 ðΩ×S n−1 Þ ; To compute a stochastic approximation of this gradient, we can thus use the forward model Monte Carlo solver L −1 MC to provide estimates of ϕ and Φ, and an adjoint Monte Carlo solver L −1Ã MC to produce ϕ Ã from an adjoint source term Q adj ¼ μ a ðh obs − hÞ, as defined in Eq. (22). Due to the symmetry of the problem, the adjoint solver is identical to the forward solver and follows the same basic operating principles. The only difference is that here the adjoint source Q adj ¼ μ a ðh obs − hÞ may in fact be negative in some locations. This is handled by splitting the source term into two parts, one purely positive, Q þ adj , and one purely negative, Q − adj . Two simulations are then run (where the total number of photons to be used is split between the two simulations accordingly), and the results summed to produce ϕ Ã . Algorithm 2 describes the basic operation for computing a sampled gradient, ∇F S n , for QPAT using the above derivation. This will be used in conjunction with Algorithm 1 to conduct an inversion with adaptive sample size for each iterate, jS n j.

UMOT case
Referring to Fig. 2, in UMOT, we have an optical light source Q q incident on a medium, as well as an optical detector J m . In addition, an ultrasound source is incident on the medium, where the focus ηðrÞ is scanned through the sample. 44,45 Assuming for simplicity an ideal (delta function) ultrasound focus, the data of interest in this case are found to be of the form 46 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 7 ; 1 1 6 ; 4 3 9 bðrÞ ¼ ηðrÞΦ q ðrÞΦ m ðrÞ; (27) where Φ q is the fluence resulting from the optical source Q q , and Φ m is the resulting fluence from a virtual source Q m which is reciprocal to the detector J m . 46 From this point, we proceed in similar fashion as in Sec. 3.2.1, where now our data fitting error is given by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 8 ; 1 1 6 ; 3 7 0 and its Fréchet derivative as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 9 ; 1 1 6 ; 3 1 5 In this case, the Fréchet derivative of b becomes Db ¼ ηΦ q · DΦ m þ ηΦ m · DΦ q ; leading to E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 1 ; 1 1 6 ; 7 0 1 Here, we need to define two adjoint radiances, ϕ Ã;1 and ϕ Ã;2 , as the solution to E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 2 ; 1 1 6 ; 6 5 3 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 3 ; 1 1 6 ; 6 0 9 L Ã ϕ Ã;2 ¼ ηΦ m ðb obs − bÞ; and substituting into Eq. (31) to give E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 4 ; 1 1 6 ; 5 8 6 by the same arguments as for QPAT we get Thus, similar to the QPAT case, here we are able to compute a stochastic approximation of this gradient using the forward model Monte Carlo solver L −1 MC to provide ϕ q and ϕ m from our two sources, and an adjoint Monte Carlo solver L −1Ã MC to produce ϕ Ã;1 and ϕ Ã;2 from the adjoint source terms Q 1 adj ¼ ηϕ q ðb obs − bÞ and Q 2 adj ¼ ηϕ m ðb obs − bÞ, as defined in Eqs. (32) and (33). Here as well, adjoint source terms are split into two parts, one purely positive, Q þ adj , and one purely negative, Q − adj , with the photon budget being split accordingly. Algorithm 3 describes the basic operation for computing a sampled gradient, ∇F S n , for UMOT using the above derivation. This will be used in conjunction with Algorithm 1 to conduct an inversion with adaptive sample size for each iterate, jS n j.

Fluence Monte Carlo
It should be noted that numerous Monte Carlo radiative transport solvers do not explicitly output the radiance, as this requires additional programming to store the angular ordinates at each location. Commonly, only the fluence will be available, which is the angular integral of the radiance Eq. (16). In such cases, the aforementioned integrals for the gradients of interest Eqs. (26) and (37) can be computed under the assumption of approximately angularly isotropic radiances, Algorithm 2 Monte Carlo sampled QPAT gradient.
where for example ∫ ϕ Ã ϕdŝ becomes Φ Ã Φ. The accuracy of this approximation of course depends on the true angular dependence of the radiances, where the approximation is poorest in regions close to directional light sources, but improves further away. The higher the scattering asymmetry g of the medium, the slower the approximation improves as a function of distance from these sources. In many cases, however, this is a satisfactory assumption and is employed in the below example cases.

Results
In this section, we present the results of a number of investigations using our two example problems of QPAT and UMOT. We will demonstrate the implementation of the forward-adjoint Monte Carlo solvers described above, along with adaptive sampling strategies to estimate the absorption coefficient of a medium via SGD. Here, we investigate media with a semi-infinite slab geometry, with numerous layers in the z direction having different optical properties, but otherwise homogeneous in the x and y directions. The application to layered geometry in this demonstration was chosen for simplicity to provide an easily recognizable setting to test these adaptive sampling methods. Furthermore, while apparently simplistic, layered geometries are still of practical interest for applications including instrument calibration and validation, and the imaging of biological structures with small curvature but significant heterogeneity in depth. The latter example includes studies such as functional (cognitive) imaging when localized to small activation regions. Application of these new methods in more complicated 3D geometries will be carried out in future work. Each of the medium layers can be described in terms of thickness, scattering coefficient, absorption coefficient, (background) refractive index, and scattering asymmetry parameter. We will assume all parameters of the layered medium are known a priori with the exception of the absorption coefficient, which we will attempt to solve for. For the examples in this study, we set the total slab thickness to 2 cm, and the inversion is conducted with a resolution of 0.25 mm, (80 layers). The true "measured" data in all problems are generated using a single forward model Monte Carlo simulation using a large sample size of 10 9 photons. With this sample size, the variance of the measured forward data h obs , b obs is found to be negligible in this setup, and as such can be treated as effectively equivalent to the deterministic solution of the RTE.
To conduct an inversion, we stipulate a total photon budget, N ph , for which all combined sample sizes in the descent must not exceed, i.e., P n jS n j ≤ N ph . Once the total photon budget is expended, we terminate the descent. This is to emulate an imposed restriction on computational resources required to reach a solution. While each iteration (involving forward and adjoint runs of the Monte Carlo) has a nonzero computational overhead, optimization of these Monte Carlo programs for repeated iteration (such as employed in Ref. 23) allows this overhead to become negligibly small. This means that the required computational resources of the inversion (and therefore required computation time) are proportional to the total number of simulated photons used throughout the descent, i.e., the photon budget N ph . The inversions are carried out using Algorithm 1, along with Algorithms 2 and 3 to compute the gradients for QPAT and UMOT, respectively. In Algorithm 1, we will compute the metrics V 2 tot and V 2 k and conduct the norm test and inner product test once every 10 iterations to evaluate the quality of our computed gradients (using N rep ¼ 100 independent repeated samples of the gradient), and to update the step size and sample size. It is worthy to note that as this is an investigation of how such methods might perform in best case scenarios, we do not include the photons used to compute these metrics as counting against the total allowed photon budget.
There are three different strategies we have employed to control the step size and sample size as the inversion progresses, see Table 1 for a summary. Strategy 1 uses a fixed-step size as described in Eq. (11) for a chosen value of γ tot . The sample size is adaptive and attempts to enforce successful outcomes of the norm test (V 2 tot ≤ γ 2 tot ), by increasing the sample size when the norm test is violated. In the event of a violation of this inequality, the fractional increase in the sample size is equivalent to the factor by which the norm test fails, V 2 tot ∕γ 2 tot . Strategy 2 uses an adaptive step size which still satisfies Eq. (11); however, it selects the largest step size possible for this criterion each time the metrics are evaluated. In this strategy, the sample size is also adaptive and attempts to enforce successful outcomes of the inner product test (V 2 k ≤ γ 2 k ) by increasing the sample size when the inner product test is violated. In the event of a violation of this inequality, the fractional increase in the sample size is equivalent to the factor by which the inner product test fails, V 2 k ∕γ 2 k . In strategy 3, we attempt to accelerate the descent using a larger adaptive step size with V tot in the denominator in place of V 2 tot . Upon failure of the inner product test, the sample size is increased by fraction V k ∕γ k , and differs from strategy 2 to reduce the speed at which the photon budget is depleted. This is an attempt to reduce premature increase of the sample size caused by volatility in the computation of the norm and inner product metrics.
Finally, we introduce an error function for the estimated absorption distribution, μ a , as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 8 ; 1 1 6 ; 3 2 3 This metric would not be available under normal circumstances (as we do not know the ground truth μ true a ), however it is useful to monitor in terms of the underlying performance of each strategy. Furthermore, as we will see, the sampled data cost function F S n is itself heavily dependent on the number of photons (sample size) used in the forward Monte Carlo and is thus not an ideal indicator of proximity to the true solution.

QPAT
We begin with our example QPAT problem. The starting sample size in all cases shown is jS 1 j ¼ 200 photons per iteration (100 for each forward run, and 100 for each adjoint run in accordance with Algorithm 2), and the total photon budget for the inversion was set to N ph ¼ 2 × 10 6 photons. The Lipschitz constant was set at L ¼ 2.5, as this displayed stable descent in our test problems using large photon budgets (low-variance case). The initial estimate of the absorption distribution in the medium is μ a ¼ 0.2 cm −1 in all layers. The scattering coefficient of all layers was set to μ s ¼ 40 cm −1 , and the scattering asymmetry parameter was set to g ¼ 0.9. The ground-truth absorption μ true a is shown in Fig. 3(a), along with the final retrieved absorption distributions obtained via strategies 1, 2, and 3 using the stated values of γ tot and γ k . Figure 3(b) shows the corresponding measured data and the final forward modeled data for each Table 1 Table showing the different inversion strategies used. Strategy 1 has a constant step size, with adaptive sample size. Strategies 2 and 3 both have adaptive step sizes and adaptive sample sizes. It is worthy to note that in accordance with Algorithm 1, the sample size is only increased upon a failure of the relevant test. If the test passes, then jS nþ1 j ¼ jS n j.

Strategy
Step size, α n Sample size, jS nþ1 j ¼ κðnÞjS n j of the strategies. Figure 4 shows the outcome of each strategy in terms of the sampled data cost function F S n and the absorption error function F μ a . It can be seen that the ranking of these methods in terms of the lowest achieved value of the sampled cost function F S n does not correlate directly to the best outcomes in terms of the error in the estimated absorption F μ a . This is due to the above-mentioned dependence of the sampled data cost function on the sample size used in the forward model, where for example the case of strategy 2 only appears to perform poorly in terms of F S n due to its small sample size used throughout the inversion. This is more clearly shown in Fig. 3(b), where the final forward modeled data from strategy 2 are noisier than the other strategies due to the low sample size at the end of the inversion, where this noise would clearly impact the sampled cost function. It is worthy to note that the relevant step sizes and sample sizes for each of these three examples are shown in Fig. 5. Before finding the best parameter for strategy 1, we trialed a range of values of γ tot over the range (0.1, 200). With lower values, the adaptive sample size was required to increase rapidly to maintain high-quality (low variance) sample gradients. This resulted in the photon budget being depleted early, terminating the descent after around 100 iterations, which did not perform well. Too large a value of γ k and the norm test never failed, meaning the sample size was never required to increase and the inversion progressed for the maximum 10,000 iterations permitted by the photon budget. However, as strategy 1 has a fixed value of γ 2 tot in the denominator of the step size, large values also result in  step sizes that were too small to perform well. A value of γ tot ¼ 4 was found to strike a balance between these two extremes and was the best performer using strategy 1. Strategy 2 has an adaptive step size which selects the largest possible step size that still satisfies Eq. (11), instead of selecting a constant step size that accounts for the worst case scenario, as in strategy 1. For this reason, we found that the largest value of γ k ¼ 20 was the best performer for this strategy, where the photon budget remained at 200 photons for each of the 10,000 iterations. For strategy 3, the best performer was a value of γ k ¼ 10, where larger values appeared to allow too much variance in the gradient, leading to unstable descents. In all strategies, the recovered absorption distribution matched the ground-truth absorption more closely in the regions of the sample closest to the light source at z ¼ 0. This is due to the decay of the fluence as a function of depth, as we can see the QPAT signal is highest at shallow depths in Fig. 3(b). The deeper regions of the sample were the last to approach the ground truth in each of the three strategies. From Fig. 6, we see the values of our two metrics V k and V tot . In all cases, both measures of the variance begin at low values, indicating that even with low numbers of photons being simulated, the computed gradients are of reasonable quality, likely due to the poor initial first guess being far from the true solution. Each of the measures of variance increase as the inversion progresses until they begin to violate the norm test or inner product test depending on the strategy. It is seen that the strategy 1 example attempts to keep V tot ≤ 4, however, due to some level of variation in the metrics themselves, this condition can be seen to be violated regularly, requiring regular updates to the sample size. For strategy 2, the imposed limit of V k ≤ 20 is never violated,  and thus the sample size is never required to increased. We also see that strategy 3 manages to keep V k ≤ 10 for the majority of the descent. In addition to these experiments shown in Figs. 3-6, we also trialed a number of other conditions including media with isotropic scatterers (i.e., with g ¼ 0), various scattering coefficients, and various initial estimates of the absorption. In all cases explored, the methods showed similar behavior as above, but with some differences in the ideal values of γ tot and γ k for each strategy. The outcomes of a range of these experiments are shown in Table 2 for various problem parameters. Strategy 3 was used in all cases in the table, with the same Lipschitz constant (L ¼ 2.5), starting sample size (jS 1 j ¼ 200 photons), photon budget (N ph ¼ 2 × 10 6 photons), and ground-truth absorption distribution μ true a as used in the above examples. The final attained values of the sampled data cost function F S n and absorption error F μ a are similar in all cases with the exception of the high asymmetry and low scattering case (g ¼ 0.9 and μ s ¼ 4 cm −1 ). In this case, the reduced scattering coefficient is only μ s 0 ¼ μ s ð1 − gÞ ¼ 0.4 cm −1 , meaning much lower overall attenuation of the light through the sample. This results in a more uniform data function, h, where the simulated photons probe the domain more uniformly, and allows the problem to converge significantly faster than in the higher attenuating cases demonstrated in Figs. 3-6. It is also worth noting that in the regime with low scattering and high scattering asymmetry, it is generally problematic for the performance of approximate transport models such as the diffusion approximation, and the results here highlight the flexibility of RTE based approaches, and the efficiency of the proposed adaptive sampling techniques.
Finally, interesting behavior was observed when using certain initial guesses of the absorption. An example of this is shown in Fig. 7, where we show the resulting cost functions for a starting estimate of μ a ¼ 1 cm −1 (significantly overestimating the absorption at all depths), and medium properties of g ¼ 0.9 and μ s ¼ 40 cm −1 . In this case, we see that the descent appears to encounter local minima in the data cost function F S n at various points during the descent, depending on the particular strategy used. However, the algorithm manages to escape these local minima and converge to a better solution. This is seen to be the case for all three strategies shown in Fig. 7.

UMOT
Next, we demonstrate similar experiments performed using the UMOT modality described in Sec. 3.2.2 for the transmission geometry. In this setup, we used the same medium slab size as the QPAT example, and the same optical properties apart from the absorption distribution. The starting sample size in all cases shown is jS 1 j ¼ 4000 photons per iteration, 1000 for each forward Table 2 Final outcomes of QPAT inversions with various medium optical properties and starting values of μ a . Values of F S n and F μ a are the final values at the end of each inversion after the stated number of iterations. In each case, strategy 3 was employed, with a starting sample size of jS 1 j ¼ 200 photons per iteration, and a total photon budget of N ph ¼ 2 × 10 6 photons. Slab thickness is 2 cm in all cases, with the same ground-truth μ true a distribution as shown in Fig. 3(a).
Starting μ a (cm −1 ) 0.01 0.2 1.0 Medium properties run (per each of the two sources), and 1000 for each of the two adjoint sources as outlined in Algorithm 3. The total photon budget for the inversion was set to N ph ¼ 4 × 10 8 photons. The Lipschitz constant was set at L ¼ 50, as this displayed stable descent in our test problems using large photon budgets (low-variance case). The initial estimate of the absorption distribution in the medium is μ a ¼ 0.1 cm −1 in all layers. The ground-truth absorption μ true a is shown in Fig. 8(a), along with the final retrieved absorption distributions obtained via strategies 1, 2, and 3 using the stated values of γ tot and γ k . Figure 8(b) shows the true measured UMOT data, b obs , along with the forward modeled data from the final estimated medium for each strategy. Figure 9 shows the outcome of each strategy in terms of the sampled data cost function, F S n , and the absorption error function, F μ a . The relevant step sizes and sample sizes for each of these three examples are shown in Fig. 10, and the values of the metrics measuring the variance in the sampled gradients are presented in Fig. 11. Similar to the QPAT modality, we found that strategy 3 performed the best in terms of the final achieved value of the error in the absorption estimate F μ a .
In addition to the results shown in Figs. 8-11, in Table 3 we present a summary of results for a range of different medium optical parameters and starting estimates of the absorption. In all cases, strategy 3 was used, and the starting photon budget was the same as in the previous UMOT examples (jS 1 j ¼ 4000 photons), with a total photon budget of N ph ¼ 4 × 10 8 photons. For each  Measured data , (a) (b) Fig. 8 UMOT inversion: (a) ground-truth absorption distribution, μ true a , and recovered absorption distribution μ a using each of the three strategies with the stated values of γ tot or γ k . (b) Associated measured data from ground-truth medium and simulated forward data at the end of the inversion using each strategy.   of the inversions presented in this table, we conducted the inner product test once every 50 iterations, using N rep ¼ 50 repeated evaluations of the gradient. The resulting inversions display similar error in these cases to the above examples where we used N rep ¼ 100 repeated evaluations of the sampled gradient once every 10 iterations to run the inner product test. This demonstrates that the described methods can still be successful when dedicating fewer computational resources to the inner product or norm test metrics, which control the adaptive sample size and step size.

Discussion and Conclusions
The results shown in Sec. 4 demonstrate that the adaptive sampling strategies performed well in both our example problems of QPAT and UMOT. We were able to achieve low error estimates of the medium absorption using a total computational expenditure that was either comparable to or significantly lower than the resources required to simulate a single low-variance run of the forward problem. In each demonstration, the adaptive sampling strategies maintained low photon numbers throughout the early stages of the inversion. Photon numbers were only increased when required to keep the variance in the gradients below the stipulated limits. These adaptive sampling strategies thus enabled significant computational savings compared to a naïve implementation, which might seek to use low-variance (high quality) computations of the gradient at every iteration. For instance, if we were to use a constant step size of 1∕L and the same number of photons per iteration as that which was used to generate the measured data (10 9 photons), then we find we still required hundreds of iterations to reach a similar quality estimate of the absorption as seen in the above problems. This means that the computational requirements of the low-variance approach would be proportional to N ph ¼ 10 11 photons. Comparing this to N ph ¼ 2 × 10 6 photons used in the QPAT examples or N ph ¼ 4 × 10 8 photons used in the UMOT examples, the required computational resources/time to attain our solutions with these adaptive sampling methods is multiple orders of magnitude lower compared to the naïve low-variance approach.
In this work, we have emphasized the similarities between our approach and that of SGD, as employed in the context of machine learning. However, it should be noted that there are significant differences between the two settings. In machine learning, the measured data are assumed to consist of a large number of samples to be fit to a deterministic model to minimize a suitable loss function, and each stochastic gradient is generated by a random subset of these data forming the descent direction of a subfunction. The same method has also been applied in Table 3 Final outcomes of UMOT inversions with various medium optical properties and starting values of μ a . Values of F S n and F μ a are the final values at the end of each inversion after the stated number of iterations. In each case, strategy 3 was employed, with a starting sample size of jS 1 j ¼ 4000 photons per iteration, and a total photon budget of N ph ¼ 4 × 10 8 photons. Slab thickness is 2 cm in all cases, with the same ground-truth μ true a distribution as shown in Fig. 8 alternative image reconstruction techniques where the data can be more naturally considered as consisting of a large number of random samples from some underlying distribution, for example, in positron emission tomography. 47 By contrast, our image reconstruction approach considers the complete measured data on each iteration, with stochasticity arising from the approximation within the forward model: we are effectively subsampling the gradient in terms of the parameter space, rather than data space. This is to say that at each iteration we utilize a subset of some notionally complete model, rather than of the data. The motivation by which each approach is employed is consistent: stochasticity is intentionally introduced to whichever part of the objective function introduces the greatest computational demand. This suggests a third possible approach, where the computational load of the (sub) gradient computation can be lowered through some stochastic division of both the data and the model; this might be relevant in imaging modalities with discrete counting data, such as time-domain and/or dynamic diffuse optical tomography. Our work suggests a number of interesting future developments: • In the examples shown here the "observed" data were effectively "noise free" by virtue of running the forward Monte Carlo on a very large number of photons. Thus an interesting topic for further study will be to evaluate these methods on noisy forward data, wherein the data fitting term should not be iterated to convergence, but where regularization should be introduced either by early stopping (i.e., by setting a minimum threshold for the data error) or by adding an explicit penalty term.
• Related to the previous point, we further note that our objective function employed a leastsquares data fitting term in this study. Depending upon the nature of the noise in the data and that of the stochastic forward model, more suitable metrics may include the Kullback-Leibler discrepancy (for Poisson likelihood) or a generalized measure of the distance between samples of probability distributions (Wasserstein distance 48 ).
• Our results demonstrate a consistent tendency for the adaptive sampling method to exhibit a geometric increase in the sample size as the descent progresses. This suggests that our adaptive approach could be employed to find a particular set of sampling parameters that perform well in a given regime, including the starting photon budget jS 1 j, rate of increase of the sample size κðnÞ, and rate of change of the step size α n . If a suitable set of such parameters could be found, they could help determine a fully prescribed sampling strategy. Once calibrated for a given problem of choice, this would avoid the need to explicitly compute the variance of the sampled gradients during the descent, and lead to even greater efficiency and speed in the inverse problem.
• Further topics of interest include more advanced methods of variance reduction (e.g., recursive gradients 49 ); adaptive estimates of the Lipschitz constant as described in Ref. 30; alternative optimization strategies such as back-tracking line-search, or primal dual methods; 50 the use of preconditioning and/or second-order optimization methods; 51 and an in-depth comparison of these nonlinear adaptive models to the alternative approaches such as PMC. 27 In summary, we have successfully demonstrated a means by which stochastic forward models, not directly amenable to standard variational methods for optimization, can be employed efficiently in nonlinear image reconstruction. We expect this concept to lead to many new directions of research in optical image reconstruction.

Disclosures
No conflicts of interest, financial or otherwise, are declared by the authors.