Deep learning in photoacoustic tomography: current approaches and future directions

Abstract. Biomedical photoacoustic tomography, which can provide high-resolution 3D soft tissue images based on optical absorption, has advanced to the stage at which translation from the laboratory to clinical settings is becoming possible. The need for rapid image formation and the practical restrictions on data acquisition that arise from the constraints of a clinical workflow are presenting new image reconstruction challenges. There are many classical approaches to image reconstruction, but ameliorating the effects of incomplete or imperfect data through the incorporation of accurate priors is challenging and leads to slow algorithms. Recently, the application of deep learning (DL), or deep neural networks, to this problem has received a great deal of attention. We review the literature on learned image reconstruction, summarizing the current trends and explain how these approaches fit within, and to some extent have arisen from, a framework that encompasses classical reconstruction methods. In particular, it shows how these techniques can be understood from a Bayesian perspective, providing useful insights. We also provide a concise tutorial demonstration of three prototypical approaches to learned image reconstruction. The code and data sets for these demonstrations are available to researchers. It is anticipated that it is in in vivo applications—where data may be sparse, fast imaging critical, and priors difficult to construct by hand—that DL will have the most impact. With this in mind, we conclude with some indications of possible future research directions.


Acoustic Reconstruction
The potential of biomedical photoacoustic tomography (PAT) to obtain high resolution images based on optical absorption and, moreover, provide images that depend quantitatively on endogenous or exogeneous molecular contrast, has resulted in rapidly growing interest in the modality. For example, the ability to obtain accurate, spatially-resolved, estimates of blood oxygenation would have significant impact both clinically and for pre-clinical applications. There are two aspects to PAT image reconstruction, an acoustic inversion from the measured acoustic time series to the initial acoustic pressure distribution, 1, 2 and a spectroscopic optical inversion to recover optical absorption coefficients or quantities derived from them. 3 The acoustic inverse problem can be solved exactly in closed form in the ideal circumstance that complete data is available and the medium has a uniform and known sound speed. In most practical scenarios, however, there are divergences from this ideal case, e.g. heterogeneities in the sound speed or bandlimited detection over an incomplete set of measurement points, making the acoustic inversion challenging. (The use of linear arrays for in vivo imaging is a case in point.) When, in addition, a solution is required to the optical inversion, the image reconstruction task becomes more challenging still as the forward operator is nonlinear. Iterative model-based approaches have been devised that manage this greater complexity by providing a flexible way to frame the problem and incorporate prior knowledge of the kind of solution expected. [4][5][6] However, such approaches, while appealing, are typically computationally intensive and time-consuming, which precludes their use in many applications.
In contrast to purely model-based approaches, data-driven techniques, and in particular Deep Learning (DL), are increasingly widely used for tomographic image reconstruction. [7][8][9][10][11][12] These techniques, which primarily originate from computer vision and are known to excel at segmentation and classification tasks, are frequently treated as 'black boxes'. This is widely considered undesirable in biomedical imaging and inverse problems and recent work has started to provide insights into why certain network architectures work well for certain tasks, [13][14][15] and also to provide justifications for the use of DL approaches in the solution of inverse problems including image reconstruction. We will refer to the application of DL within the image reconstruction pipeline as learned image reconstruction.
The rising interest in learned image reconstruction has lead to a transition from classical analytical methods to such data-driven approaches. While much of this work has focused on established imaging modalities such as MRI 9,16,17 and CT, 7,8,18 this transition is also clearly discernible in the recent literature on PAT image reconstruction. In this paper we will review the recent work done in this area and place these new approaches into a broader context by drawing connections to classical analytical reconstructions. We also provide a tutorial style introduction to the the use of DL in PAT image reconstruction, including describing and demonstrating several different approaches for learned image reconstruction. Code is available for these examples, free to download, allowing researchers to reproduce them, and providing them with a starting point for their own learned reconstructions. i PAT is a particularly suitable area in which to review these methods for several reasons. There is a very active experimental community interested in a wide range of applications, from dataintensive, large-scale, 3D imaging to 2D high-frame rate uses. This results in a wide variety of different approaches for data collection and presents many different challenges in the reconstruction pipeline, including those of limited data, computationally expensive forward operators, uncertainty in model parameters, and the lack of training data; the latter especially a problem for in vivo applications. This leads to a final point, which is that the community is not only in a transition from classical to data-driven approaches, but also in a transition from proof-of-concept studies to applying the techniques in challenging clinical and pre-clinical scenarios. Indeed, these two transitions may prove to be symbiotic: data-driven approaches are rarely needed in proof-of-concept studies in which complete data is available and time is not of the essence, but many of the problems facing in vivo use are not easily tackled within a classical framework. We hope that by describing a framework for learned reconstructions, and by presenting an overview of the diverse work done, this review can provide guidance for possible future directions for image reconstruction as PAT transitions from the bench to the clinic.

Scope of Tutorial Review
There are multiple ways in which DL could be used within the context of PAT, so to keep this review to a reasonable length it is necessary to limit its scope. This review will concentrate on DL as applied to tomographic reconstruction in photoacoustics. In other words, the focus will be on using DL networks, sometimes in combination with classical approaches, to reconstruct photoacoustic (PA) images from projections (which here are acoustic time series), this includes pre-and post-processing approaches with the intend to improve reconstruction quality. With this as the focus, there are several applications of DL to PA imaging that must regrettably wait for a future review. First, this review will be limited to PA tomography and will not cover the use of DL in relation to PA microscopy (PAM), the principal difference being, for our purposes, that in PAT it is necessary to reconstruct the image from a set of measured projections but in PAM the image can be measured directly. Second, this review will not cover work where DL approaches have been used subsequent to a final reconstruction. This includes, for example, applications where DL has been used to segment or classify PAT images, or regions of images, into, say, diseased or healthy. Third, this review will not cover the use of DL to make diagnostic judgements, eg. to answer questions such as "Does this image indicate diabetes, rheumatism, cancer, etc?".
Papers on DL for PAT reconstruction are currently appearing at a steady rate and we anticipate that trend will continue. This review attempts to cover all relevant papers or preprints appearing up to the end of June 2020. i Codes and training/test data for all experiments discussed will be made available at: https://github.com/ asHauptmann/PAT_CODES 2 Forward and Inverse Problems in Photoacoustic Tomography

Physics of photoacoustic signal generation
The photoacoustic effect is the name given to the phenomenon by which the absorption of an optical pulse generates an acoustic pulse. A light pulse incident on soft biological tissue will be scattered around in the tissue, eventually either leaving the tissue or being absorbed by absorbing molecules in the tissue, known as chromophores (hemoglobin being one of the most important). The energy of the excited chromophore is then converted into heat. This all occurs on a timescale (∼ns) which is much shorter than the timescale required for the tissue to move (for the local mass density to change, ∼µs), so the heating is isochoric and therefore accompanied by an increase in pressure. Tissue is elastic, so the regions of higher pressure will act as sources of acoustic waves. Because of the difference in timescales, the pressure increase is usually treated as occurring instantaneously, and PA wave generation and propagation is modelled as the initial value problem: where x ∈ R 3 is the spatial variable, t ∈ R ≥0 is time, and p(x, t) is the acoustic pressure. The medium properties to which the acoustic wave is sensitive, sound speed and mass density, will in general vary with position. However, for propagation through soft tissue the variations are often small and are rarely known in advance so the medium is usually treated as acoustically homogeneous. (Acoustic absorption, not described by Eq. (1), may also become important in some applications.) The initial condition, f (x) ≥ 0, is the termed the initial acoustic pressure distribution and is related to the optical properties of the tissue by the equation where λ is the optical wavelength, µ a is the optical absorption coefficient (dimensions of reciprocal length), φ(x) is the optical fluence (dimensions of energy per unit area), and Γ is a dimensionless constant which accounts for the efficiency of the acoustic generation (sometimes called the Grüneisen parameter, which it equals in some circumstances). The dependence of φ on wavelength has been made explicit in Eq. (2), but φ also depends on the absorption and scattering throughout the tissue, making Eq. (2) nonlinear in the absorption coefficient µ a . The positivity of the initial pressure distribution, f (x), arises from the fact that µ a φ is the energy density due to absorption of the light and Γ is positive for most materials, especially soft tissue.

Tissue optics
The nature of the dependence of the fluence φ on the absorption and scattering is usually modelled in biological tissue using transport theory, 19,20 ie. making the assumption that coherent optical effects can be safely ignored. Under this assumption, the light field is described in terms of energy by the radiance, ψ(x, t,ŝ, λ), which is the rate of energy flow per unit area per unit solid angle in directionŝ ∈ S 2 at position x at time t (units of power per unit area per unit solid angle). When there are no significant inelastic processes such as fluorescence present, the radiance at each wavelength obeys the following integro-differential equation, known as the radiative transfer equation, which can be thought of as a statement of the principle of the conservation of energy: where v is the speed of light, q is a source term, µ s is the scattering coefficient, and ϑ(ŝ,ŝ ) is the scattering 'phase' function, a probability density function describing the likelihood of a photon travelling in directionŝ being scattered into the directionŝ. The fluence, for a given wavelength, can be found by integrating the radiance at that wavelength over all angles and time: The quantity of interest in quantitative PAT is sometimes the absorption coefficient, but more often it is a related quantity. For example, the absorption coefficient is related to the concentrations of the chromophores present by 21 where C q is the concentration of the q th chromophore and α q (λ) is its molar absorption coefficient spectrum. A quantity of considerable clinical interest is blood oxygen saturation, 22 which is related to the concentrations of two particular endogenous chromophores, oxy-and deoxy-hemoglobin, C HbO and C Hb respectively, by

Photoacoustic measurements
In PAT, measurements of the PA-generated acoustic waves are made on a surface S surrounding a region Ω ⊃ supp(f ) containing the object to be imaged, f (see Fig. 1). Note that S is not a boundary, ie. it is assumed not to affect the acoustic field. The measurement operator M will typically consist of a filtering operator, W, which accounts for the angle-dependent frequency response of the detectors, and a spatial sampling operator, S, which selects the part of the acoustic field to be detected such that where ε is additive measurement noise. (In some imaging systems, eg. in those using LED excitation, the signal-to-noise ratio can be very low and it is necessary to average many times.) A variety of different sampling operators have been considered for PAT, including detection at a set of points, {x s ∈ S }, for which S is a simple geometric surface such as a plane, 23 cylinder, 24 sphere, 25 ellipsoid 26 or polyhedron, 27 measurements of spatial integrals of the acoustic field along planes or lines 28 or patterns, 29 2D measurements using a ring of detectors focused in a plane, 30 and measurements made with a linear array of elements also focused in a plane (as used for conventional ultrasound imaging). 31 PA signals are by their nature broadband, often more broadband than the ultrasound detectors used to measure them, so the detected frequency range is usually restricted. Furthermore, due to the finite size of real ultrasound detectors, they also filter the spatial wavenumbers. (As the detection area increases, the more directional the detectors become, ie. the narrower the acceptance angle.) The filtering operator, W, accounts for both the frequency and wavenumber filtering effects.
When the detectors are ideal, W = Id, the identity, and neglecting noise, Poisson's solution 32 to the initial value problem in Eq. (1) shows that the relationship between the measured time series g(x s , t) and the initial acoustic pressure f (x) can be written in the form: where dA is an area element of the surface given by the spherical shell |x s − x| = ct. This shows that the time average of g between times 0 and t equals the spatial average of the initial pressure f (x) over a spherical shell of radius ct centered at x s . More concisely, we can write g sm = R sm f (x), where R sm is the spherical mean Radon transform, and g sm = t −1 t 0 g(x s , t )dt is the spherical mean data. Some of the literature relevant to PAT image reconstruction considers the data in this form. 1, 25

Acoustic, optical and spectroscopic operators
Before discussing PAT inverse problems, it will be helpful to define three operators describing the forward or direct problems (see Fig. 2). First, the operator A, a linear mapping from the initial acoustic pressure distribution f to the measurements g under additive measurement noise ε, which is based on Eqs. (1) and (7): A maps from image space X f to data space Y . Second, the operator F, a nonlinear mapping from the absorption coefficient, µ a , to the initial pressure distribution, f , which is based on Eqs. (2), (3) and (4): F maps from image space X µa to image space X f . Finally, a third operator maps chromophore concentrations to absorption coefficients, from X C to X µa , based on Eq. (5):

PAT Inverse Problems
There are two main inverse problems in PAT, corresponding to the acoustic and optical forward operators described above. First, an acoustic inversion from the measured time series to an estimate of the initial acoustic pressure distribution, ie. an estimate of A −1 g, and second, an optical inversion which attempts to recover quantitatively accurate estimates of optical coefficients (or related properties), eg. an estimate of F −1 (f ). It can be shown 1 that the acoustic inverse problem is well-posed when sufficient data has been measured, but what constitutes sufficient data? The data measured by a closely-spaced array of omnidirectional, broadband, noise-free, point detectors arranged such that all the rays passing through every point in the imaged object reach at least one of the detectors would be sufficient data. For example, if ideal detectors are positioned on the surface of a hemisphere at a spacing of λ min /2, where λ min is the shortest wavelength generated (to satisfy the spatial Nyquist criterion), the sound speed is constant everywhere, and the object lies inside the hemisphere's convex hull -the 'visible' region 33 -then the acoustic inversion will be well-posed. Given the stringency of these requirements, it is unsurprising that real experimental settings will often diverge from this ideal, leading to inversions that are no longer well-posed. One challenge for the reconstruction, then, relates to dealing with incomplete or imperfect measurement data. There are also challenges relating to the forward operator. These issues are outlined below as it is in tackling these issues that DL may be able to make the most useful contributions.

Incomplete or imperfect data
For the acoustic inversion, the data could be incomplete or imperfect for several reasons: • Noise is present in any real measurement.
• Detector responses are never perfectly broadband or omni-directional. Compromising on these characteristics is sometimes necessary in order to achieve sufficient detection sensitivity. • Limited-view detection, in other words insufficient coverage of the object, perhaps because of the limitations of the available hardware, eg. a 2D linear array to image a 3D object, or due to restricted access to the object. • Undersampling in space or time, perhaps in order to achieve faster data acquisition, or due to hardware constraints.
In the optical inversion, when considered separately from the acoustic inversion, the input data are images of the initial pressure distribution f obtained from the acoustic inversion. There are two ways in which this data can be deficient: • Artifacts may be present in the image data due to an imperfect acoustic reconstruction.
• Wavelengths. The data must contain images at a set of wavelengths chosen such that the spectroscopic aspect of the optical problem, L −1 , is well-posed.

Inaccurate forward operators
Eqs. (1) -(4) are broadly considered to capture the physical phenomena relevant to PAT, but to solve practical problems they must be implemented as numerical models. There are two ways in which these models can differ from the ideal.
• Simplifying approximations are often made to the forward operators in order to reduce the complexity of the computations necessary to implement them as numerical models. For example, the radiative transfer equation, Eq. (3), is often approximated using a diffusion approximation, and the wave equation, Eq. (1), is sometimes substituted with a simpler model, eg. based on rays. • Inaccurate model inputs. Although the focus in experimental settings is usually on the accuracy of the measurement data, the accurate determination of the auxiliary parameters on which the forward operators depend is often just as crucial. For example, the acoustic operator A depends on the sound speed and how it varies within the tissue, which is rarely known to a high degree of precision. And the optical operator F not only requires knowledge of how the tissue was illuminated but also of the tissue's scattering properties, both of which can be hard to determine accurately.
This latter problem, the difficulty of accurately measuring the necessary model inputs, in particular the sound speed and the optical scattering distributions, has led some researchers to consider them as additional unknowns in the inverse problem. 34,35 These inversions have been shown to be less well-posed 36, 37 than the inversions of A and F, and additional data or constraints are usually required to find meaningful solutions.

Statistical framework: noise models and priors
The question naturally arises as to what can be done to improve the image reconstruction when the data is imperfect or the forward model is only known approximately. An approach which sounds like common sense, but in practice can be challenging, is to try to find the reconstruction f that is most probable given the data g. This requires a statistical framework, 38 which also provides a way to incorporate in the reconstruction any other information that is already known about the final image, the data, or the operator, in order to constrain the solution to one that has a higher probability of being correct. Specifically, we want to find the posterior probability distribution π(f |g), ii or some related quantity that characterises the most probable reconstructions. In the Bayesian framework, we can incorporate our prior knowledge about the problem via Bayes' formula where π(f ) incorporates prior knowledge of the solution f , and π(g|f ), called the likelihood, incorporates the known noise statistics using the forward operator A. For example, if the noise in Eq. (9) is normally distributed with zero mean and variance σ 2 we can express the likelihood as 38 Even though in many applications we might not be able to explore the full posterior distribution π(f |g), the Bayesian framework can provide guidance for the interpretation of specific image reconstruction approaches. For instance, computing the maximum a posteriori (MAP) estimate relates to finding the minimiser in variational approaches, as we will discuss later in Sec. 3.1.4. Classically, both the likelihood and the prior are explicitly modelled and might therefore be limited in their expressibility. Hence a natural question arises in the context of this study: Can we use learning-based models instead of analytical models to generalise this approach? In particular, two ways in which learning-based methods could be incorporated are: i) Learning a prior π(f ) that describes the unknown initial acoustic pressure distribution better, ii) Compensating, in the likelihood π(g|f ), for model uncertainties or complex noise statistics.
In DL it is conceptually easier to address the estimation of a prior, as it relates to the training set, as we will discuss in the later sections; it is not so straight-forward to incorporate model uncertainties into the likelihood estimation. A useful direction on how to tackle this is given by the well-established approach of Bayesian approximation error modelling. [38][39][40] In this approach, modelling errors in the forward operator A are estimated as normally distributed and explicitly corrected in the likelihood term, Eq. (13). This approach has been applied in PAT to compensate for uncertainty in the measurement parameters (model uncertainty). 41,42 ii Here, the notation π(f ) is used to denote the probability density function of f , and π(f |g) the conditional probability density of f given g.

Image reconstructions
While the two inverse problems described above, the acoustic and optical inversions, are the fundamental image reconstruction problems in PAT, variations on them are often used in practice. Common PAT reconstruction problems that appear in the literature are: • reconstructing an image of the initial acoustic pressure distribution from the measured time series data, A −1 g, • reconstructing an image of the optical absorption coefficient from initial acoustic pressure distribution images, F −1 (f ), • reconstructing an image of optical absorption coefficient directly from the measured time series data, (A F) −1 (g), • reconstructing images of quantities related to optical absorption, eg. chromophore concentrations or blood oxygenation, from a multiwavelength set of initial acoustic pressure images, (FL) −1 (f ), • reconstructing images of quantities related to optical absorption directly from a multiwavelength set of time series data, (A FL) −1 (g).
Research has already begun on applying DL to several of these tasks; this literature will be reviewed in Sec. 5. The next two sections will give an overview of the classical approaches to PAT image reconstruction, and a short tutorial on the kinds of DL that are being used for image reconstruction.
3 Classical Approaches to PAT Image Reconstruction DL can be used to complement or augment current approaches to PAT reconstruction, or replace parts of them. For this reason, as well as to provide context, this section describes several widely used 'classical' -ie. not learning-based -approaches that have been used for solving the PAT inverse problems. This section is not intended to be a comprehensive review of classical methods for PAT image reconstruction, for which the literature is large, but for later reference.

Acoustic Reconstruction
Here we consider the acoustic inversion of PAT, ie. the linear problem of solving Eq. (9) for f , the initial pressure distribution, given g, the measurement data. We will denote a generic reconstruction operator, or data-to-image mapping, by A † : Y → X f throughout this review. Let us now discuss specific choices for such a mapping.

Backprojection and beamforming
Algorithms based on the idea of backprojection are widely used in PAT. This terminology comes from X-ray tomography in which the forward operator (the linear ray transform) maps from image to data space by integrating the target along a set of straight lines for each detector, and the backprojection operator maps from data to image space by putting the data back along those straight lines and summing over all detectors. In the X-ray case, these dual operations are also adjoint; the backprojection operator is the adjoint of the forward operator. 43 In PAT the situation is slightly different. The forward operator A maps from image space X f to data space Y by integrating through f along a set of spherical shells of radius t = |x − x s |/c centered on the detector points x s (see Sec. 2.1.3). Correspondingly, the backprojection operator A # maps a function of x s and t, from data space Y to image space X f by putting the data back onto the same spherical shells with the mapping t → |x − x s |/c, and summing over all detector points, x s . For some function h(x s , t), which might be the measurement data or some function of it, the backprojection operator is where dS is an area element on the measurement surface S . On the other hand, the adjoint operator A * is given by 44 which is clearly a backprojection, but not of the data g (see also Sec. 3.1.4). When the data is processed before backprojection (or sometimes the image is processed after backprojection) the resulting algorithm is often referred to as a filtered backprojection. Filtered backprojection formulas for PAT have been found for a variety of measurement surface geometries. 1,25,27,45,46 Perhaps the most well-known, called the 'universal backprojection' algorithm, 45 gives exact reconstructions for detector points covering a spherical, cylindrical or planar measurement surface, and can be written as where α is the angle between the inward normal to S and the vector (x − x s ), and Ω s is the solid angle of S as seen from a point x ∈ Ω, eg. Ω s = 4π when S is a sphere. A 2D version of this has also been derived: 47 where the weighting factor κ(x, x s ) = |x − x s | for the universal backprojection algorithm, but has also been treated as a learnable parameter (Sec. 5.1.2). Linear array transducers of the kind used in conventional ultrasound imaging are increasingly being used for PAT, with backprojection-type formulas commonly used for image reconstruction. In this context, image reconstruction is sometimes referred to as 'beamforming' and the backprojection operation A # is descriptively dubbed 'delay-and-sum'. Linear arrays are typically short, consisting of just 128 bandlimited detection elements focused in a plane, so the image reconstruction is very ill-posed. Many variations of backprojection-type algorithms with different pre-and post-processing steps have been explored to try to maximise the image quality given these severe constraints. 48,49 In DL approaches to PAT image reconstruction, backprojection / beamformingtype algorithms have been used widely to map from data space Y to image space X f before and after post-and pre-processsing networks respectively (see Secs. 4.3.2, 5.1 and 5.2).

Series solutions
The first analytical solution to Eq. (1) was found in the form of an infinite series, and more have since been derived. 23,24,[50][51][52] A formula for the case of detection points lying on a plane is of particular interest because it is in the form of a Fourier transform, which can be computed efficiently using the Fast Fourier Transform. 23 The solution relies on the fact that any acoustic wavefield, p(x, t), can be written as a sum of travelling plane waves whose temporal frequency, ω, and wavevector, k = (k 1 , k 2 , k 3 ), are linked by the dispersion relation ω = c|k| = c k 2 1 + k 2 2 + k 2 3 . The solution takes the form where B(k 1 , k 2 , ω) = (ω/c) 2 − k 2 1 − k 2 2 /ω, F and C are Fourier and Cosine transforms respectively, andf (k) is obtained by algebraic transform fromf (k 1 , k 2 , ω) using the dispersion relation. In DL, this method has been used as a component in learned iterative reconstrucions (see Sec. 5.4.2). When used with linear array transducers, this method and its variants are sometimes referred to as 'Fourier beamforming'.

Time reversal
Perhaps the most physically intuitive algorithm is based on the concept of time reversal. 53-55 Consider a measurement surface S surrounding a region Ω ⊃ supp(f ). Imagine the photoacousticallygenerated waves propagating outwards and being measured as they pass through the surface S . After a suitably long time T the acoustic field in Ω will be zero (guaranteed in a 3D homogeneous medium by Huygens' principle 56 ). If the measured pressure, g(x s , t), were now reproduced on S in time-reversed order, starting with g(x s , T ), then the acoustic field in Ω created by the in-going waves would reproduce the out-going wavefield exactly but backwards in time. In particular, the field at t = 0 would be the initial acoustic pressure distribution f (x). Based on this idea, time reversal image reconstruction uses a numerical acoustic model to solve the following time-varying boundary value problem for the time-reversed field p r (x, t r ), from time t r = 0 to T , The solution p r (x, T ) = f (x) for x ∈ Ω.
In DL studies, the time reversal approach is sometimes used for comparison with network approaches, but care must be exercised here to ensure a fair comparison. To help elucidate two problems with time reversal, note that the time-varying Dirichlet condition, p r (x s , t r ) = g(x s , T − t r ), is equivalent to reintroducing the measurement data as a source term within a reflective cavity defined by the measurement surface S . 53 First, then, time reversal is not a good choice when using data detected on a sparse array of points because during the time reversal procedure they act like point scatterers. Second, when the true sound speed is spatially-varying but the reconstruction uses a homogeneous sound speed, the reflective effect of the boundary condition can trap artifacts in the image region. 57 In these scenarios, time reversal may not be the best method for comparison. Furthermore, when the sound speed is spatially-varying, resulting in multiple scattering, the requirement that the acoustic field in Ω will fall to zero in a finite time T is no longer satisfied. One solution 58 is to use the following iterative scheme: where A TR signifies the time-reversal operator.

Variational approaches
The iterative time reversal algorithm points to a more general approach to reconstruction as it looks very similar to this gradient descent scheme which solves the least-squares minimisation problem: where f * denotes the optimal solution. (The similarities between Eqs. (21) and (20) become even clearer if we observe that the adjoint operator, A * , can be implemented numerically in a similar way to the time reversal operator, A TR , except that the pressure time series are reintroduced to the domain in time-reversed order by adding them to the existing field rather than enforcing the pressure at the detector points. 44 ) The idea of posing the image reconstruction as a numerical optimisation is appealing, 4-6, 59-61 because it provides a very flexible framework both for how the forward operator is defined (eg. the sound speed could be spatially-varying) and for tackling illposedness in the inverse problem. Eq. (22) will have a unique solution when g is a complete set of ideal data. However, if g is incomplete or imperfect then Eq. (22) may not have a unique solution, or overfitting (in which the model starts to fit to the noise in the data) may become a concern. Early stopping of the iteration in Eq. (21) is one way to avoid overfitting, but a more general approach to restricting the solution space is to add another term to the functional in Eq. (22) that expresses prior information about the kind of solution that is expected, eg. non-negativity of solutions and smoothness or sparsity conditions. The problem then becomes where the regularisation parameter α balances the importance placed on the first term -the data consistency term -and the second term, R, which encodes the prior information about f . There is an extensive literature on methods to solve minimisations such as this. 62,63 If the regularisation term R(f ) is differentiable, one could simply employ a gradient descent, Eq. (21) with α∂R/∂f . If not, another approach to computing solutions iteratively is the proximal gradient method, which means computing the iteration: where A * (A f (n) − g)) is the gradient of the data consistency term and prox R,ηα , the proximal operator, takes the updated image estimate and projects it into the constrained set defined by the regularisation, or in other words the space in which the solution is thought to exist. It is formally defined as the minimisation problem The formulation as a minimisation problem in Eq. (23) is directly connected to the Bayesian formulation in Eq. (12) and corresponds to maximising the posterior distribution π(f |g) to find the most likely reconstruction f . This represents a point estimator known as the maximum a posteriori (MAP) estimate. 38 In this context, the negative logarithm of the prior distribution directly relates to the regularising term, − log π(f ) ∝ R(f ). This general framework provided by the variational approach has inspired several learned iterative approaches to PAT reconstruction (see Secs. 4.3.3 and 5.4).

Matrix formulation
The acoustic forward operator A is linear and so can in principle be discretised and written as a (large) matrix. When this matrix can actually be explicitly computed, the image reconstruction problem has been reduced to a matrix inversion and all the machinery of linear algebra, and the associated methods of regularisation, can be brought to bear to solve it. This includes the variational approaches above in Sec. 3.1.4. For instance, if one considers a quadratic regularisation in Eq. (23), such as R(f ) = f 2 2 , then the solution can be computed in closed form and is given by where Id denotes the identity, and which is sometimes called a Tikhonov-regularised solution.
There are many methods that can be used to discretise the forward operator, from pseudospectral methods 4 to semi-analytical approaches. 64 However, whether it is convenient -or even possible -to compute and store A explicitly as a matrix will depend on the number of detectors and the size of the image, and whether sparsity or other structures in the matrix can be exploited. 65 In fact, we will make use of a matrix representation for A in the tutorial part of this review (Sec. 4.5), as the problem under consideration is sufficiently small.

Optical Reconstructions
This section briefly summarises classical approaches to solving the nonlinear Eq. (10) for the absorption coefficient or related quantities. From Eq. (2) we can see formally that µ a = F −1 (f ) = f /(Γφ(µ a )). An empirically determined value for Γ is sometimes used, or, when the final quantity of interest is a ratio of concentrations, see Eq. (6), Γ is assumed to be constant with wavelength and cancels out. The dependence of the fluence φ on µ a , however, needs to be considered carefully.

Non-iterative approaches
A simple approach to deal with the dependence of φ on µ a , but one with questionable accuracy, is to ignore the dependence and apply the spectroscopic inversion directly to the PA data, L −1 f . This is sometimes known as linear unmixing. Despite its obvious flaws, this stance has been taken (usually implicitly) in many experimental papers in which the PA spectrum at a point, f (λ), has been assumed to be proportional to the absorption spectrum at that point µ a (λ). The difference between f (λ) and µ a (λ), which linear unmixing ignores, is known as spectral coloring. A better approach, but still one whose accuracy needs to be demonstrated on a case-by-case basis, is to approximate the fluence using estimated average background absorption and scattering values, and suppose that this fluence remains unchanged by small changes in the optical absorption coefficient. In some cases, the fluence distribution can be measured directly using a second imaging modality in addition to PAT. 66,67 However, this requires complementary hardware to make the additional measurements, and it is difficult to achieve the same spatial resolution for φ as for f (or one may as well measure just the fluence distribution and not do PAT at all).

Fixed-point iterations
If the scattering is known, then the absorption coefficient can be found using a model of light transport, such as Eq. (3) or a suitable approximation, to calculate both the fluence and the absorption coefficient iteratively using the fixed point iteration: 68

Variational approaches
As with the acoustic inversion described in Sec. 3.1.4, casting the optical inversion as a minimisation problem allows the various constraints and prior information to be included systematically.
Here, the inverse problem for the absorption coefficient is stated as and a similar expression can be written for the oxygenation saturation or other quantities of interest. As, from Eq. (2), F(µ a ) := Γµ a φ(µ a ), the functional gradient is given by where Dφ is the Fréchet derivative of φ, the form of which will depend on the particular model of light transport used. 34,42,69 4 Tutorial Introduction to Deep Learning for PAT Image Reconstruction

What Role could Deep Learning Play?
How can DL help to solve the challenges posed by the twin problems of incomplete data and inaccurate forward models outlined in Sec. 2.2? Or are there other ways in which DL can be used to enhance PAT image reconstruction? There are many areas in which DL could make an impact. For example, a DL network could be used to • correct for missing or corrupted data in the measured time series data (pre-processing), • reconstruct images from incomplete or imperfect data given the forward operator (effectively learning prior information to regularise the solution), • approximate a forward operator (eg. when it is difficult to write an accurate and computationallyefficient forward model explicitly), • approximate an inverse operator (even when the data is perfect and the forward operator known this may speed up the image reconstruction), • remove artifacts and noise from reconstructed images (post-processing), • segment images, • classify or label images or regions of images.
(As mentioned in the Introduction, the last two points are out of the scope of this review.) An important attribute of a DL network is the speed with which it can process an input. For small networks this can be very fast, which may be useful in settings where reconstructions are required on short time scales such as real-time or dynamic imaging. However, the speed of evaluation will depend on the size of the network and the size of the input data. It is also important to keep in mind, that the final reconstruction speed will still depend on how the forward operator is utilised in the processing pipeline.
The motivation to use DL in image reconstruction, which these conceptual advantages provide, can readily be followed by action thanks to the availability of easy-to-use DL tools, such as TensorFlow 70 and PyTorch, 71 which make employing these new methods straightforward. Furthermore, the tendency of the machine/deep learning community to provide open-source algorithms and data accelerates the development of new methods and makes it simpler for researchers to try new approaches. Consequently, we provide the codes accompanying this review along with a basic example of training and test data.

Brief Introduction to Deep Learning
This review concentrates on the application of Deep Learning, by which we mean in particular deep neural networks, to image reconstruction tasks in PAT. Specifically, we will concentrate throughout this section on the acoustic inverse problem, so we are interested in finding a mapping from the measurement data g ∈ Y to the initial acoustic pressure f ∈ X f . The driving incentive is the hope that a reconstruction operator can give better (faster, more accurate) reconstructions than classical approaches. The mapping in Eq. (28) may be a composition of model-based parts, involving a known operator describing the acquisition geometry and physics, and pure learning-based components. Before we can review common approaches and network architectures, we will give a short introduction to DL and the main network components. We will concentrate here on a high-level overview to help develop an intuition for the operations involved. For a more extensive review, see, for instance. 72-74

Deep neural networks
A deep neural network, denoted here by the nonlinear operator Λ θ , maps an input vector to an output vector. The network consists of several 'layers', each of which is a composition of an affine linear function with learnable parameters and a nonlinear function (often referred to as the 'activation function' but referred to as a nonlinearity here). The term Deep Learning refers, roughly, to networks that consist of multiple layers, in contrast to shallow networks consisting of only a few layers.
Let us now formalise the notion of a layer. Given an input vector , . . . , I}, a linear map given as a matrix C ∈ R I×J , a vector b ∈ R I , and a point-wise nonlinear function ϕ : R → R, then one layer L in a network is given by iii The individual neurons in such a neural network are now the mapping to one element of the output vector, see Fig. 3 for an illustration. If we write this out for the above case, Eq. (29), then the result iii In the literature, the term 'layer' is used somewhat ambiguously. Here, it will be used to refer to both an operation and its output, not just the output. One exception is the input layer, which refers to just the input data with no prior operation. of the i-th neuron is the i-th element of the output vector h 1 i and each neuron sums over all input elements of h 0 with a common bias b i : The network type is essentially defined by the linear mappings in each layer, defined by the structure of the matrix C.

Fully connected layers
The basic choice for C is a dense matrix, which gives a fully connected layer as all the inputs are related to all the outputs. We then obtain a simple L-layered network by the composition of L fully connected layers. This network can be expressed as the composition of several layers L l for l = 1, . . . , L to obtain For example, if we write this out for a two-layer network, we get the relation In the general case, the trainable parameter set θ of the network Λ θ is given by the matrices and the bias vectors, that is This basic network architecture, consisting of multiple fully connected layers, is the basis for many deep neural networks. When using fully connected layers in imaging applications, the input, either an image or other measured signals, must be reshaped into a vector for the input layer. If one then aims to extract some relevant low-dimensional information from the input, the dimensions of successive layers will be gradually decreased until the desired output dimensions are reached. An often used network architecture worth mentioning in this class is termed an autoencoder. Here, the input is first encoded using a contracting path to extract a low-dimensional representation of relevant features and then subsequently decoded using an expanding path to represent a clean version of the input signal. Input h 0 and output h l typically have the same or similar dimensions.

Convolutional neural networks
Often the values in image pixels or voxels are related in some way with those in neighbouring pixels or voxels, eg. both may be part of the same image feature. For applications of DL to imaging, therefore, it seems wise to take spatial relations, and especially local relations, into account. A fully connected layer does not explicitly maintain these spatial relations, as all inputs are connected to all outputs without reference to their respective spatial positions. In other words, the linear mapping C in Eq. (29) does not have any pre-determined structure. It is possible, however, to think of structures for C that do retain spatial information and can use local features in the input, such as edges, to encode such features more efficiently in the output. Convolutions, especially with small filters -3 × 3 say -are a popular and very successful choice for such operations, as these are also translation equivariant iv and hence encode the same local features under translation of the image and are agnostic to the image size. Additionally, localised filters have the advantage of leading to linear mappings with sparse structure that can be efficiently implemented without an explicit matrix representation. In this case, instead of learning the whole matrix C, one needs to learn only the filter coefficients. Usually multiple such filters are used, each one referred to as a 'channel' here. Networks using this idea are called convolutional neural networks, or CNNs.
Consider an application to imaging in R 2 . The input is either a single or multichannel image . . , J} denotes the input channels, and similarly an output . . , I} denotes the output channels. (These images are square but this can straightforwardly be extended to nonsquare images.) The affine linear mapping is then defined by a set of I filters , and biases b ∈ R I , where each output channel has one bias. The convolutional layer that maps between the two multichannel images h 0 and h 1 is then defined for each channel as where * denotes a discrete convolution (see Fig. 4). The set of parameters in this case is given by the coefficients of the filters ω i and biases b i . Each output channel has one scalar bias and the input and output of each channel are connected by one specific filter. Thus we could consider an analogy here: in a CNN, each channel in the convolutional layer, Eq. (33), acts similarly to a neuron in a fully connected layer, Eq. (30), with the filter components ω i,j analogous to the point-wise weights C i,j ; compare Figs. 3 and 4. We also note that the convolutional layer, Eq. (33), could be written in the general form Eq. (29) by vectorising the input and representing the convolution as a (sparse) matrix. An important feature of CNNs is the fact that each layer maps between multichannel images of the same (or similar) resolution. v Therefore, a CNN is a natural choice to represent data-to-data or iv We say that a function is translation equivariant if translating the input and then applying the function is equivalent to applying the function followed by translation of the output. image-to-image mappings, rather than mappings between spaces with different dimensions such as data-to-image. Nevertheless, in many applications there are reasons why it might be desirable to downsample input images during the processing (eg. memory constraints, sparser representation, wider receptive field) and hence many architectures include downsampling operations, called pooling layers, which reduce the image resolution using mean or maximum filters, for instance. This will become clearer in the following section on network architectures, specifically in Sec. 4.3.2. By combining convolutional and fully connected layers, we can define the majority of network architectures that are used in the literature for image reconstruction. The specific networks depend on the task for which they are employed and hence we will discuss the particular architectures later in this section. Let us now focus on how the network parameters are learned.

The learning task
After defining the network architecture, the parameters of the network need to be determined. This is done by learning them from a set of training data. Before this can be done, we need to define the actual learning task that will determine the network's mapping properties. That is, we want train the network to perform a specific task, such as either reconstructing or denoising an image, or in other applications to perform segmentation or classification. More precisely, given a network Λ θ we need to find an optimal set of parameters θ * , such that our network fulfils the desired mapping property, ie. it does what we want. The training of the network is nothing more than an optimisation problem to find the optimal set of parameters θ * , which can be formulated in various ways as we will summarise shortly. Specifically, we will consider the reconstruction task of recovering the initial pressure f from the measurement data g given the parametrised reconstruction operator A † θ , such that Eq. (28) is fulfilled.
Supervised training The first idea that comes to mind is to minimise a distance function between the desired output -the known ground-truth -and the actual output of the network. This leads to supervised training, in which the optimisation problem is formulated with knowledge of a desired ground-truth in order to find the parameters of A † θ . For the optimisation we need pairs of measurement data g i and corresponding ground-truth is called the training set. Next we need to define how to measure the closeness of the resulting reconstruction. For that purpose one typically formulates a loss function in the L p -norm, such as The learning task is to find an optimal set of parameters θ * in the space of possible parameters Θ that minimises Eq. (34) with respect to the given training set In fact, one is not limited to loss functions of the form Eq. (34) and depending on the learning task other more suitable choices can be made. Additionally, one can add regularisation terms to the loss function, either on the output of the network or even on the parameters, for instance requiring sparsity by minimising the L 1 -norm θ 1 , where the 1-norm here acts element-wise as usual.
Finding a set of optimal parameters θ * as formulated in Eq. (35) leads to an optimisation problem and hence can be solved with suitable optimisation techniques. Here, gradient based methods are typically used in DL, where the gradients for the update are computed via backpropagation. 75,76 The most common optimisation strategies are stochastic gradient methods, where the stochasticity refers to randomisation in the subset of training samples (batches), such as the popular adaptive moments estimation algorithm Adam. 77 Alternative training regimes Although the majority of learned image reconstruction approaches applied to PAT to date have been fully supervised, one current direction within the DL community is the investigation of possible alternative training regimes. In particular, these are concerned with cases in which only a small number of input and ground-truth pairs are available. Such approaches are typically referred to as semi-supervised or self-supervised training. These developments will not be covered extensively in this review, but we will discuss some possible directions on how to move away from fully supervised training in the conclusions. Roughly speaking, what these approaches have in common is that instead of requiring closeness to a known ground-truth for all data pairs, we define an auxiliary measure on the goodness of reconstructions. For instance, one could think of a data consistency term, A A † θ (g) − g 2 2 , that is used in a similar way to the concept of cycle consistency in the computer vision community. 78 Related directions use the concept of adversarial networks, in which a discriminator is used to evaluate how well reconstructions resemble 'realistic' ones during the training procedure.
In summary, regardless of the chosen training regime, defining the learning task leads to an optimisation problem, where we aim to find an optimal set of parameters for the network architecture with respect to a chosen measure and training set.

Architectures for Learned Reconstruction
The reconstruction task in PAT can be addressed in various ways, as outlined in Sec. 3, and since learning-based reconstruction algorithms are often inspired by these classical methods there is a wide range of possible approaches. In an attempt to classify learned reconstructions, we could divide the possible approaches into three classes by the number of times the physical model, the forward operator A or a related operator, is involved in the reconstruction process: never, once, and multiple times. Four common strategies that are directly related to classical schemes are illustrated schematically in Fig. 5. (The middle two strategies in Fig. 5 fall into the same class in this classification.) In the following, we discuss these three classes of approach on a conceptual level, giving one example of a standard architecture for each. As mentioned above, we will concentrate here on the acoustic reconstruction problem, Eq. (28); extensions and applications to the optical reconstruction problem will be discussed in the literature review in Sec. 5.6.

The fully-learned approach
In the fully-learned approach, the whole learned reconstruction operator A † θ is given by one network architecture, ie.
where Λ θ : Y → X f . At first sight, such fully-learned approaches seem promising as they eliminate the need for a potentially expensive reconstruction operator. However, the 'no free lunch' concept applies here, as this improved reconstruction speed comes with a major limitation, which will be discussed below. First, though, we discuss the potential advantages. The forward operator A : X f → Y is non-local in nature. For instance, a point source f ∈ X f has a spatially global effect on the measurement data g ∈ Y (although it is localised in time). Similar non-locality of data-image relations are observed in most tomographic inverse problems. A fully connected layer has filter coefficients connecting each input to each output, and they can all be different, so it can cope with non-locality in the data-image relation, and can represent any linear mapping. In particular, the linear forward operator A could be learned by a fully connected network. (It could even be learned by one fully connected layer with no nonlinearity, although that would just be A represented as a dense matrix, which could be computed directly rather than learned.) Also, an inverse mapping such as the backprojection A # in Eq. (14) can be learned by a dense layer, and in particular by a composition of dense layers with nonlinearities. In a CNN, on the other hand, a layer acts only locally, meaning an output pixel is only related to nearby input pixels. A fully connected network might therefore seem, at first glance, a better choice than a CNN for this task.
(Although some ability to learn non-localities can be regained by using multi-scale CNNs such as the U-Net, as described below.) Another potential advantage of a fully learned approach, depending on the particular architecture, is that it can provide reconstructions quickly, with low latency, as no explicit model evaluation is required. The use of a fully connected network, however, has a major limitation similar to the problem faced by matrix representations of operators for high-dimensional problems, in that we need to learn a dense matrix of size M × T , where M is the total number of pixels, or voxels, and T the product of the number of detectors and the number of sampling points in time. Let us for example consider a three-dimensional setting with m × m × m = M voxels and m × m × t = T measurement points, where m = 64 and t = 128. Then a single dense layer, mapping between data and image space, represented in single precision (32 bit) would occupy ∼ 500GB. Thus, reasonable applications of this approach are limited in practice to two-dimensional problems. Also, the large number of learnable parameters necessitates a large training set for the training procedure to avoid overfitting to the training samples. Additionally, as the fully connected layer associates each point in the input with the output nodes, the trained network depends specifically on consistent dimensions in the data space as well as image space, and hence the acquisition geometry. For PAT this means that one needs to train a separate network if the measurement setup changes, such as the number or location of the sensors, or the time-sampling points, or if there is a change in the sound speed distribution, for example.
One could append a small CNN to the fully connected layers to exploit spatial features in the output from the fully connected layers to produce the final reconstructions. This thought leads to the architecture known as AUTOMAP, 11 originally devised for magnetic resonance imaging. A version of this kind of network is shown in Fig. 6. In our case, the input to the network is given by the time-series of measured acoustic pressure. For the application of the fully connected layers, the input must be flattened or vectorised, ie. reshaped into a vector, before being passed to the network. The vector output of the fully connected layers is reshaped into an image and post-processed by a small convolutional network to produce the final output (Fig. 6). This is just one architecture that uses fully connected layers and there are many variations on this theme, some of which are discussed in Sec. 5.3.1. This network can be thought of as a learned regularised backprojection operator (the fully connected layers) followed by a post-processing network to improve the image (the convolutional layers). This way of seeing the network leads us directly to the next approach, in which a classical backprojection operation is first performed with knowledge of the physical model and the network acts on the output in image space.

Reconstruction and post-processing
A major limitation of the fully-learned approach is the inflexibility with regard to the acquisition geometry and acoustic properties, ie. each network is specific to a fixed arrangement of the detectors and the sound speed. This can be overcome by using an explicitly model-based (classical) reconstruction from measured time-series to image space as an initial reconstruction step. This allows for potentially higher image resolutions to be used, as the memory burden of the fully connected layers has been removed, and potentially facilitates efficient initial reconstructions using approximate and computationally cheaper models. In other words, if we substitute the fully connected part in Fig. 6 with an explicitly known reconstruction operator A † , we arrive at the approach of an initial analytical reconstruction followed by a learned post-processing step. More precisely, let A † : Y → X f be an analytically known reconstruction operator, that is ideally known to be robust (small changes in the input give small changes in the output). For example, A † could be A # or A * or another approximation to A −1 . Then one can train a CNN to remove the reconstruction artifacts that arise from using A † . 7,8,79 In the PAT case these artifacts can range from blurred out edges and noise to more severe undersampling and limited-view artifacts. The learned inverse mapping is now given as where the network Λ θ : X f → X f maps between the same space. The main advantage in this approach lies in the analytical knowledge of the reconstruction operator, and so the network can be designed to focus instead on exploiting the structure in reconstruction artifacts in order to remove them. Computationally, the evaluation time of the neural network is usually negligible and reconstruction times are typically limited by the complexity of the reconstruction operator. It is important to notice that the evaluation of the reconstruction operator can be decoupled from the training process and just used when creating the training data, and hence this approach is also advantageous in the training phase if the reconstruction operator is expensive to evaluate. For learned post-processing, typically one employs a high-capacity and particularly expressive network, ie. one with many layers and learnable parameters, that are capable of learning complicated image priors. The most prominent architectures for this application are based on the U-Net, 80 which can be roughly described as a multi-scale convolutional autoencoder. More precisely, instead of applying convolutions only on the full resolution image, the network includes down-sampling layers that reduce the image size in order to extract larger spatial features. The extracted coarse features are then subsequently upsampled to construct the final image. Intuitively, this process can be related to the principle of multi-resolution analysis, such as the wavelet decomposition, 81,82 where the input image is decomposed into a fine-to-coarse basis. For image reconstruction tasks, instead of passing the reconstructed image f 0 = A † g directly through a network to produce the output image the learning task is typically reformulated as a residual problem in which a correction to the initial image is learned. This is motivated by the notion that the network can be used to identify noise and artifacts to remove from the image. Such networks are often referred to as residual networks, such as a residual U-Net, 8 for which a basic architecture (with 3 scales vi ) is illustrated in Fig. 7. In the encoder part of the network, the left side, in each scale we employ two convolutional layers, followed by a downsampling of factor 2. This downsampling is done by a max-pooling operation, which takes the maximum value in a window of 2 × 2 and reduces the image size. The numbers on top of each bar indicate the number of channels and, as can be seen, the number of channels is increased as the resolution is decreased. For the decoder part, we follow a similar approach of using 2 convolutions in each scale followed by an upsampling by factor two with a transposed convolutional layer. The final result is then added to the input via the residual connection. A particular design choice in the U-Net is the use of skip connections that connect the encoder and decoder parts at each scale by a concatenation. The reason for using these skip connections is two-fold. Computationally, they stabilise the training procedure by avoiding the problem of vanishing gradients in very deep networks. Additionally, the skip connections help to preserve the finer structures in the higher resolution scales. It is interesting that even though CNNs are translation equivariant, the U-Net is able to learn local dependencies due to the decomposition into the coarser scales and the resulting large receptive field. Thus the post-processing vi The classic U-Net architecture 80 has 5 scales; we use fewer here due to the small image size in the experiments.
approach proves powerful even in applications with strong local dependencies, such as limitedview problems. On the downside, such large capacity networks tend to overfit to the training data if training data is scarce, but still need considerably fewer training samples than the fully-learned approach. We will discuss this further in the experimental part in Sec. 4.5. Additionally, the output depends solely on the quality of the initial reconstruction and the capability of the network to correct for these shortcomings. Therefore, without further modifications, we cannot guarantee that the reconstructed image is optimally consistent with the data -one possibility to overcome this will be discussed next.

Model-based learned iterative reconstruction
In order to improve data consistency of the reconstructions, one could use the forward operator multiple times in the reconstruction procedure and not only for an initial reconstruction. We call such approaches learned iterative schemes, as neural networks are interlaced with evaluations of the forward operator A, its adjoint A * , and possibly other hand-crafted explicitly known operators. Typically, such learned iterative schemes outperform other learned reconstruction approaches in reconstruction quality, 9, 10, 18, 83 but come with a higher computational complexity. We also observe that this allows for the use of smaller networks, as the reduced network capacity is compensated for by providing more informative inputs to the network. We will introduce the concept with a simple learned gradient-like scheme. 10,84 For instance, minimising the data consistency term D(f ; g) = 1 2 A f − g 2 2 in a gradient descent scheme, as in Eq. (22), could be formulated as a network with the updates Comparing with Eq. (21) we see that the only learnable parameter of the network is the step length θ ∈ R. Extending this idea, we can devise a learned gradient scheme by using a CNN Λ θ : X f × X f → X f to compute the update in Eq. (40), and iterate the process such that Here, each network Λ θn has its own set of parameters. The iterative process in Eq. (41) then defines a reconstruction operator when stopped after N iterates: with an initialisation, such as the adjoint of the measurement data f (0) = A * g. The initialisation is essential in this approach as it maps from data space Y to image space X f whereas the networks only map from X f to X f . Each network Λ θ n−1 is a learned updating operator for the n th iterate and we can see a conceptual similarity of Eq. (41) to the proximal gradient descent scheme in Eq. (24), which provides a way to interpret the learned updating operator similar to a proximal mapping. Such learned iterative approaches are also known as model-based learned reconstructions, as the learned reconstruction operator A † θ repeatedly uses the explicit forward and adjoint operators. Clearly, this leads to an increased complexity depending on the number of operator evaluations required, but the additional knowledge supplied to the networks allows the use of smaller architectures to achieve similar, or even superior, reconstruction quality compared to the previously discussed approaches.
A basic network architecture for this task is illustrated in Fig. 8. The whole unrolled reconstruction is shown, for two iterations, in Fig. 8 (Top) and the architecture of the residual blocks, based on the residual network ResNet, 85 is shown in Fig. 8 (Bottom). At each iteration, the current reconstruction, f (n) , and the corresponding gradient of the data consistency term, ∇ f D(f (n) ; g) = A * (A f (n) − g), are concatenated into a two-channel input to the learned updating operator Λ θn . This input layer is followed by 3 convolutional layers with 32 channels, 3 × 3 filters, ReLU nonlinearity and bias. The final layer (3x3, linear, no bias) reduces the 32 channels to a single residual update to be added to f (n) such that we can rewrite the learned update equation in Eq. (41) as The last convolutional layer does not use a nonlinearity as the residual updates need to be able to be both positive and negative. After each residual block the intermediate result f (n+1) is used to compute the new gradient ∇ f D(f (n+1) ; g), which is then passed on to the next residual block. By using smaller networks than in the post-processing approach, with both the current iterate and the gradient information as inputs, the networks rely less on prior knowledge from the training data and rather learn a desirable combination of both inputs. In fact, the gradient of the data consistency contains information on where the image needs improvement to fit the observed data. Just as importantly, smaller networks are less prone to overfitting and so require less training data. This aspect is further emphasised by a recent study that showed using explicitly known operators in the network architecture does indeed reduce the training error. 86 As the operator is used repeatedly in the reconstruction process, this also allows for some flexibility in acquisition geometry that the network can be applied to. Many extensions of the basic learned gradient scheme have been proposed in the literature and applications will be discussed in Sec. 5.4.1.

Generating Training Data
The training set will define the features learned by the network. This essentially defines the probability distribution describing our images of interest, in other words the prior of possible images π(f ) in the Bayesian framework, Eq. (12). This directly addresses the first point raised in Sec. 2.2.3 of how to learn a better prior. For many biomedical applications it is difficult to handcraft informative priors that represent structures of interest, eg. blood vessels, and so the alternative approach of learning a prior from a set of sample images is appealing. The choice of the training data set then becomes highly important as it defines the prior distribution. A suitable training set will have two primary features: good representation of the relevant structures, and enough variety to represent the image distribution. In established medical imaging modalities, such as magnetic resonance imaging, one can use large databases of highly sampled gold standard reconstructions as ground-truth images. In PAT, such a database is not currently available, and, furthermore, many scanner geometries are not able even in principle to collect complete data. There are therefore essentially two options for creating a training set: simulate synthetic data as realistically as possible, or define a high-quality (albeit imperfect) reconstruction using a classical inversion method as the ground-truth. It's important to emphasise that the training set, together with the training regime, determines the reconstruction quality one can expect. For instance, in a fully supervised setting with only reconstructions from classical inversion methods as the ground-truth, the network would not be expected to provide better reconstruction quality than the classical approach, although it may be able to compute the images more quickly. On the other hand, if augmented training sets or semi-supervised approaches are employed, more complicated priors might be learned and classical methods may be outperformed in reconstruction quality.
Synthetic training data The first step in the creation of synthetic training data is to define the ground-truth images f i ∈ X f for i = 1, . . . , I. From these ground-truth images we can then simulate the corresponding synthetic measurement data g i ∈ Y according to Eq. (9). Note that this includes the simulation of measurement noise. The pairs of ground-truth image and synthetic measurement data then define the training set In PAT we are often interested in imaging vasculature, so we need a way to create a large enough set of images with relevant vessel structures. A standard way to obtain such structures is to use other imaging modalities that provide images or volumes of vessels and then to segment them to extract the relevant vessels as a ground-truth image. For the following experiments we designed two datasets from different image databases. The first set was created from a set of lung CT scans vii via vessel segmentation and projection to two dimensions, and the second set was taken from retina scans viii with a segmentation provided. As Fig. 9 shows, these two sets have very different characteristics, one having piecewise constant features the other smoother features, in other words the prior distributions π(f ) are different. As we will see, this difference will have a major impact on the reconstructions obtained depending on how the two training sets are used in the training and testing. (In this particular case, as an alternative to the segmentation of vessel structures from other modalities, one could imagine creating ground-truth images by, for example, using vessel growing algorithms 87 to create a large set of synthetic training data.)

Piece-wise phantom
Smooth phantom Experimental training data An alternative to synthetic training data is to use measured data for the creation of the training set, ie. start with measurement data g i ∈ Y and create a reference reconstruction f i ∈ X f . In this scenario, ideally one would have complete measurement data available that can be used to create a high quality reference reconstruction, for instance with a variational approach, Sec. 3.1.4. Then one can either train a network on the pairs of (g i , f i ) to speed up reconstruction times, or one can retrospectively undersample the measurement data to obtain g i and train with pairs ( g i , f i ) to improve reconstructions from undersampled measurements. In our experience we have found that in the application to real measurement data it is essential to include some experimental data in the training procedure, as structures and noise can vary significantly from synthetic to experimental data. Transfer training A third option is to combine synthetic and experimental data. This is usually a good idea if one does not have sufficient measurement data available. Here, one can exploit a concept known as transfer training or update training. We refer the reader to two discussions on the topic. 88,89 In our case, the underlying idea is to perform pre-training with a large set of synthetic training data that represents a good prior for the targets we are interested in. Then, after the first training phase on the synthetic data, we can update the obtained parameters with a shorter training on the limited set of available measurement data. This fine-tunes the network parameters to the characteristics of the experimental data, for instance by adjusting threshold capabilities to the noise level. This update is typically done with a reduced learning rate. Sometimes, rather than updating all parameters of the network, the majority are fixed and only the first and/or last layers are updated with the new data. Finally, we remark that here self-supervised training regimes, as mentioned in Sec. 4.2.4, might be promising in the transition to experimental measurement data, although this area has not yet been widely explored.

Comparison of Learned Image Reconstruction Approaches
In this section, we use the synthetic data sets introduced above to examine the performance of the three learned image reconstruction approaches described in Sec. 4.3 with respect to accuracy and robustness. We consider a 2D limited-view scenario with a line detector at the top of the domain, as illustrated in Fig. 10. For simplicity, we create a matrix representation of the acoustic forward model as described in Sec. 3.1.5 by sampling the forward operator with k-Wave. 90 Since this section serves in part as a tutorial, we will describe the individual steps required to set up the experiments in detail. Experimental design Here we describe the steps necessary to train and evaluate the 'reconstruction and post-processing' reconstruction approach as outlined in Sec. 4.3.2. This will cover all the concepts needed to set up the examples for the other learned reconstruction approaches.
1. Data acquisition geometry and definition of the forward operator: The essential first step is the definition of the imaging setup under consideration, which also defines the forward operator A. Here we chose a limited-view planar acquisition geometry in a two-dimensional domain, see Fig. 10, and we use k-Wave 90 for the PA time series simulation in MATLAB. For flexibility in the training data creation and reconstruction, we use a matrix representation of the forward operator by sampling each pixel in the image domain; this is done in the script: createForwMat.m The resulting matrix is saved to disk ix so it can be loaded within the learning framework in Python for data creation and reconstruction in the following.
2. Training data creation: First we need to choose the set of ground-truth images {f i } I i=1 we want to use for the training, ie. one or both of the sets described above and shown in Fig. 9. Then we create the corresponding synthetic measurement data by using the matrix form of A to create g i = A f i + ε i , where ε i is normally distributed noise added to the measurements with standard deviation of 1% of the maximum measurement amplitude. We then create the initial reconstruction using the adjoint, such that f rec i = A * g i . Recall that in the matrix representation, the adjoint corresponds simply to the transpose. The training set for supervised training has now been generated. Test and validation sets can be created in the same way. These preparations are done in the script: callNetwork.py For the other experiments, with the fully-learned approach and learned iterative schemes, we can just use the generated data as input and hence the set is given by 3. Network selection and training regime: Given the training set, we can now set up the network and define the training regime. All the relevant functions can be imported from the supplied script: PATnets.py For this case we choose a classic residual U-Net x for the network Λ θ , and as a loss function the classic squared 2 -norm for supervised training. Then the optimisation problem reads as θ * = arg min The optimisation is performed with the Adam algorithm, initial learning rate 10 −4 , batch size 4, and a total of 5 · 10 4 iterations.
parameters θ * can be saved for later evaluation, or a direct evaluation can be performed. For evaluation, we load the test set and record the average reconstruction quality.
Reconstructions: robustness and generalisation In this Section we will evaluate the three reconstruction approaches described in Sec. 4.3, fully-learned, post-processing, and learned iterative reconstruction, following the four-step process outlined above. In particular, we will examine how these methods compare in generalisability with respect to changes in the data sets they are trained on. For that purpose we will consider three scenarios for the training and test sets: i.) Consistent sets: Trained on the retina data (Fig. 9b) of 1000 samples and tested on a separate but consistent test set from the retina data with 151 samples. This corresponds to a scenario where the priors are the same, π test (f ) = π train (f ).
ii.) Different test set: Trained on the retina data (Fig. 9b) and tested on the lung CT set (Fig. 9a) with 151 samples. This corresponds to a scenario where the priors are different, π test (f ) = π train (f ).
iii.) Combined set: Trained on both data sets with a total of 3760 samples and tested on a separate combined test set with 308 samples. Here the priors are consistent, but more complicated than in (i). We emphasise that this training set is also larger.
We trained the three reconstruction approaches for each scenario using the same training regime, as outlined in Sec. 4.5, with minor tuning as necessary to ensure the parameters are close to optimal. This ensured the results were representative and allowed useful conclusions to be drawn. Nevertheless, as we will see below, not all of these architectures are conceptually the right choice for the scenarios under consideration and it was not possible to improve the performance significantly through further parameter tuning. Note that the fully-learned approach uses a regulariser of the learned parameters θ 1 to reduce overfitting.

Phantom
Fully-learned Post-processing Learned iterative Fig 11 Reconstructions obtained for test case i.) with consistent priors π test (f ) = π train (f ). Trained and tested on the piece-wise constant phantoms. The fully-learned approach does not perform satisfactorily due to strong overfitting to the training data whereas the other two approaches are able to produce quantitatively and qualitatively superior results, but still exhibit errors in the reconstruction.
Let us first discuss the obtained reconstructions from a visual perspective. The results obtained for the first case i.) are displayed in Fig. 11. Most striking here is the result obtained by the fully-learned approach, which clearly falls short in reconstruction quality compared to the other two approaches. We observe that this is primarily due to the limited size of the training data and hence the network strongly overfitting, even though we use regularisation to reduce this. This is clear from the training error plots shown in Fig. 12. In contrast, the other two approaches correctly learned some form of representation of the prior π train from the training data. Consequently, the reconstructions for case i.) are visually close to the ground-truth. In particular, we can see a good reconstruction quality close to the detector, but on the boundary where limited-view artifacts are stronger the reconstructions lose quality. Training curves for case i.)

Fig 12
Training curves for the fully-learned approach and learned gradient descent in comparison for case i.), exported from the tracking tool tensorboard. While both approaches have a tendency to overfit the training data during training, the fully-learned approach does suffer more compared to the learned iterative reconstruction.

Phantom
Fully-learned Post-processing Learned iterative Fig 13 Reconstructions obtained for test case ii.) with inconsistent priors π test (f ) = π train (f ). Trained on the piecewise constant phantoms and tested on smooth phantoms. All methods struggle to produce satisfactory results and one can see that the piece-wise constant prior from the training data is reproduced by each method.
For the second case ii.), the results are shown in Fig. 13. It is clear, on the first sight, that the networks produce results according to the learned piece-wise prior from the training data, as one would expect. Additionally, all the algorithms show a deterioration in reconstruction quality from the consistent case i.). For the post-processing and learned iterative scheme features close to the detector are to some extent correctly reconstructed, but they struggle further away. The fullylearned approach, due again to strong overfitting, produces a result with very limited resemblance to the ground-truth. One interesting feature is that the networks, and especially the learned iterative scheme, tend to smear out features where there is uncertainty in the reconstruction.

Fully-learned
Post-processing Learned iterative Fig 14 Reconstructions obtained for test case iii.) with consistent priors π test (f ) = π train (f ). Trained and tested on combined phantoms with piece-wise constant as well as smooth features. The reconstruction quality of the fullylearned approach improved slightly compared to the other test cases due to the larger training set, but it is clearly outperformed by both methods that use the model in the reconstruction pipeline.
In the final case iii.) the training samples are combined and so the size of the training data set is increased. The results are shown in Fig. 14. There is a clear improvement over the second case, as the test data is consistent with the mixed prior, and both approaches that use a model in the reconstruction do fairly well in reconstructing the target. There is a slight influence of the mixed prior visible in the results, as the reconstructions for the piece-wise constant phantom exhibit some smoother features related to the smooth phantoms. Lastly, the fully-learned approach seems to struggle with the mixed priors, but the reconstruction is still arguably closer to the ground-truth than in the other cases, as the increased training data reduced the overfitting. Nevertheless, the result is still not satisfactory. These observations are supported by the quantitative values shown in Table 1, which shows the mean and standard deviation over the whole test data for each case. We computed the peak signalto-noise ratio (PSNR), which is a logarithmically relative root mean squared error and hence related to the quantity we minimised in the training. Additionally, we computed the structural similarity index measure (SSIM) as an indication of the perceived similarity in the reconstructions. We can see that the post-processing and learned iterative schemes perform better in this test, but with a strong deterioration when changing the prior distribution for the test data, as is also seen visually for case ii.). For case iii.) neither method using a model showed a strong improvement over case i.), in fact both methods deteriorate in terms of SSIM as the priors are not perfectly reproduced, although PSNR is either stable or improves for the post-processing. For the fully-learned approach, PSNR improved considerably with the larger training size, although SSIM slightly deteriorated, most likely due to the difficulty in reproducing the prior correctly. This is further an indicator of the large quantity of data needed for the fully learned approach to work well. Similar observations are made by Baguer et al.: 91 in their overview they show that a fully learned approach only performs well with large amounts of data, which explains in parts the poor performance in this limited data setting.

Deep Learning in PAT -Literature Review
The majority of the journal articles in which DL techniques have been applied to PAT image reconstruction are concerned with the acoustic part of the reconstruction and there are fewer papers tackling the optical part. Most of the subsections below will therefore focus on the acoustic reconstruction. The papers concerned with DL approaches applied to the optical reconstructions will be reviewed in Sec. 5.6. We also draw attention to related reviews on the matter of optical imaging and/or learned image reconstruction. 12, 92-94

Post-Processing
Early approaches to learned image reconstruction concentrated on the reconstruction-and-postprocessing approach as outlined in Sec. 4.3.2. The work by Antholzer et al. 79,95,96 investigated the approach of using filtered backprojection (Sec. 3.1.1) to reconstruct an initial image and then train a U-Net, with 5 scales, to do post-processing. This was in a sparse and limited-view data setting, and followed the residual learning approach 8 given by Eq. (39). Similar to our observations in Sec. 4.5, the authors report that consistent training and test data, ie. π test (f ) ≈ π train (f ), is crucial for optimal performance of the trained network; 79 this seems to be more so in the case of limited-view detection geometries. This observation was confirmed and clearly demonstrated in the study by Guan et al., 97 who proposed a dense U-Net to ameliorate this negative effect. Other extensions have been proposed too: using a leaky ReLU nonlinearity, 98 or using the first iterate of a modelbased iterative approach (Sec. 3.1.4) instead of a backprojection-type reconstruction. 99 In, 100 the authors propose combining a reconstruction obtained with the adjoint with the first iterate of an iterative algorithm in a learned fusion process.
In comparison to other approaches, U-Net-based networks generally performed better than other architectures, eg. compared to a simple 3 layer CNN, 95 VGG, 98 and compared to applying U-Net directly to the measurement data g, 101 especially with respect to robustness. It is interesting that Antholzer et al. 96 compare their results to a classic 1 -regularisation approach for compressed sensing and report that when the system matrix is randomly sampled, and hence undersampling artifacts change as well, the classical variational approach clearly outperforms the network-based post-processing approach. This enforces the observation that consistent training and test data is needed for this approach to be successful.

Application to in-vivo imaging
In an extensive study, the U-Net-based post-processing approach was successfully applied to in vivo measurements 102 and showed clear improvements over backprojection-based algorithms when the data was undersampled or detected over a partial aperture (limited-view problem). Hariri et al. 103 showed that this approach can improve in vivo imaging when using low-fluence sources. The observation of improved visual performance for in vivo applications was also reported in other studies. 104, 105

Extensions of the post-processing approach
The primary problem with the post-processing approach is that the result depends on a network that is determined only by the information content of the training data and not the physics of the problem. To tackle this, Antholzer et al. 79,95 propose a nullspace projection to ensure data consistency after post-processing. In other words, only components in the nullspace N(A) of the forward operator A are added to the reconstruction and as such do not change the data consistency term, A f − g 2 2 . The solution therefore takes the form where P N(A) denotes the orthogonal projection to the null space. In Schwab et al., 106,107 the authors combine post-processing by a U-Net with a learning-based filter in the backprojection step (κ in Eq. (17)) to improve initial reconstructions from limited-view measurements.
Recently, LED-based excitation systems have become popular but because of their low power output many averages (thousands) are required to improve the signal-to-noise ratio. The resulting long-duration measurements are sensitive to motion artifacts. To compensate for this, Anas et al. 108 proposed using a recurrent neural network, a convolutional LSTM network, 109,110 to exploit the temporal dependencies in the noisy measurements. They report a considerable improvement over single-frame post-processing. In our opinion, this explicit consideration of the temporal aspect with recurrent units is more promising for low power systems than just post-processing with a U-Net. 111 With a similar motivation to expand on the information before post-processing, Kim et al. 112 propose to use the delay part of delay-and-sum but without taking the sum (Eq. (14) without the integral). The resulting three-dimensional input is then processed and collapsed by a U-net to produce the final reconstruction.

Beyond fully-supervised training regimes
A possibility to provide an uncertainty estimation for reconstructed images by the post-processing approach was investigated by Godefroy et al. 113 The authors propose to train a U-Net with Monte Carlo (MC) dropout to provide reconstructions and an uncertainty estimate. Here, a set of images is sampled with the MC dropout procedure, which provides a reconstruction (the mean of these images) and a standard deviation indicating instabilities in the reconstruction.
Finally, we observe that the approaches here were all trained in a supervised manner by minimising an explicit loss function given by the 1 or 2 error. In a recent study, Vu et al. 114 explore the possibility of using a generative adversarial network (GAN) to process the image. In this setting, the U-Net is interpreted as the generator producing a clean PA image and the discriminator acts as the loss function evaluating reconstruction quality. GAN-based approaches lead the way to applications where no paired training data is available.

Pre-Processing
In a similar manner to the previous approach of using a network for post-processing reconstructions, one can instead focus the learning task on the data side and then use a classical reconstruction algorithm (Sec. 3) to obtain the PA image; see Fig. 5. In this sense we reformulate the learned postprocessing reconstruction operator in Eq. (37) to its analogue for learned pre-processing as Here the network Λ θ can act as a denoising and artifact removal step on the data side to make the inversion step easier (essentially it changes the learning task from an inversion step to a de-noising step).

Artifact removal for source localisation
Defining a clear purpose for an application enables the formulation of task specific processing algorithms, for instance in the case of tracking applications as explored in the work by Allman et al. [115][116][117] Here the aim is to localise a point-like source and to this end it is essential to distinguish clearly the true signal from noise and artifacts. The authors propose to use an object detection and classification approach to separate artifacts from the true signal. Their approach is based on a network architecture known as Faster R-CNN 118 that produces a classification between signal and artifact, a confidence score and locations as a bounding box. After a subsequent artifact removal step the final PA image is reconstructed using beamforming (Sec. 3.1.1). The authors show that their networks for accurate source location trained on simulated data can be transferred successfully to experimental data, 115 as well as ex vivo and in vivo 119 measurements.

Sampling and bandwidth enhancement
The PAT reconstruction problem is well-posed if perfect measurement data is available (see Sec. 2.2). One approach to pre-processing is therefore to aim to produce ideal data for the inversion from the non-ideal measurement data. This was investigated in the work by Awasthi et al. 120,121 The authors considered a sparse data (but full-view) scenario with limited bandwidth detectors and trained a network to produce high quality data from the degraded input. In particular, the network attempted to upsample the data from 100 detectors to 200, to denoise it, and to increase the bandwidth. The improved data was then reconstructed by filtered backprojection (Sec. 3.1.1). Two architectures were used for Λ θ : a simple 7-layer CNN 120 and a U-Net-based architecture. 121 In general the U-Net architecture performed better, but it is interesting that for low noise the simple CNN architecture was highly competitive. Translation to in vivo measurements without retraining was successful for both methods. 120,121 Conceptually, such a pre-processing approach can be understood as learning a representation of the likelihood π(g|f ) conditioned with a training set for the images f . Nevertheless, the reconstruction quality is essentially limited by the goodness of the pre-processed measurement data and hence we believe this approach is only viable in fairly simple measurement scenarios, such as the tracking applications discussed above. 115,117

Fully-Learned
When considering a fully-learned reconstruction, it's important to keep in mind that the measurement data g ∈ Y lies in a different spatio-temporal space than the reconstructed images f ∈ X f and as such a mapping between the spaces Y → X f needs to be constructed. In Sec. 4.3.1 we discussed the non-local nature of the mapping, and that in principle a fully connected layer can account for this. While the mapping may therefore be done by a fully connected layer, we nevertheless clearly saw in Sec. 4.5 that with a limited amount of data the fully connected layers are hard to train to achieve high quality reconstructions. Additionally, we observed that the CNN following the fully connected layers did most of the visual 'heavy-lifting' for the final reconstruction. This observation is in line with what has been reported in the literature, as discussed below in Sec. 5.3.1.
Following this idea, Shang et al. 122 proposed a two-step approach, where first a fully connected layer is trained to transform measurements into the image space, and then a U-Net is trained to process the result while the weights of the fully connected layer are fixed.

Convolutional approaches
Even though there is no clear theoretical justification to use a CNN directly to transform a spatiotemporal signal from Y into an image in X f , as they learn spatially-invariant mappings, many studies in fact explore this scenario. The strength of convolutional-based networks lies in their capability to exploit local relations in the data, and as such can deal efficiently with noise in the input. The issue of spatial invariance can be overcome by using multiple pooling layers to increase the receptive field of the network, and the representation on the coarse scales effectively encodes the locality of the information. This implies that large multi-scale networks are needed to transform the signal into the sought-after PAT image effectively. In an early study by Waibel et al. 101,123 it was shown that using an asymmetric U-Net to reconstruct the PA image directly from raw sensor data is feasible in a limited-view setting. In comparison to a post-processing approach using a U-Net it was competitive in terms of mean reconstruction error, but exhibited a higher variance in reconstruction error. To overcome this, various solutions have been investigated in the literature, including enlarging the network to increase the capacity. 124,125 Others propose to introduce a pre-processing step to provide more informative input to the network, either by a hand-crafted interpolation 126 or even learned pre-processing with a separate CNN. 127 Note that in the latter case the transformation after the pre-processing is in fact done by a dense layer and hence is the closest to the AUTOMAP architecture discussed in Sec. 4.3.1. In both cases, the pre-processing step seems to be essential to provide an input, reduced in dimensionality, to the network performing the transform to the image space. Additionally, the authors in 127 motivate the pre-processing architecture based on the universal backprojection, Eq. (16), and provide time-series and as well as the time-derivative to the network. Lan et al. 128 reduce 120 time series to 1 by summing them with delays, then feed this single time series into a LSTM network followed by a fully connected layer and a subsequent CNN to form the reconstructed image.
Following the discussion in Sec. 5.2.1, there are situations in which the full reconstruction problem can be simplified to the case where only a source location must be found. This can be achieved by, for example, using a feature detection network 129 or first forming a reconstructed image using an extended U-Net then converting to a numerical value for the source location. 130,131

Discussion of fully-learned approaches
In summary, the more advanced fully-learned approaches seem to provide a slight improvement over reconstruction followed by post-processing with a U-Net. However, the fully-learned approach does not explicitly include the acquisition geometry and sound speed in the inversion procedure. While this generality might conceivably be useful, it means that for the network to be robust to changes in these experimental parameters, the training data must account for the full range over which they might vary. As we see it, the fully-learned approach might therefore be useful in cases where a measurement device is available with corresponding data-image pairs, (g, f ), to be used as training data, but the acquisition geometry and other underlying parameters needed for reconstruction are not known. (ie. If it is a 'black box' with examples of known inputs and outputs but the parameters implicit in A are not known.) A fully-learned approach would then provide a way to improve the imaging pipeline without having to go through the potentially difficult procedure of determining the instrument characteristics. Finally, following our observation in Sec. 4.5, the fully-learned approach needs substantially more training data than other approaches that involve A explicitly. This might constitute a major limitation when transitioning to experimental measurement data, where data availability is inherently scarce. Nevertheless, pre-processing approaches, as in, 126,127 are potentially promising in reducing the hunger for training data.

Learned Iterative Reconstructions
Learned iterative schemes, as described in Sec. 4.3.3, are model-based reconstructions that use known forward and adjoint models within a learned update. Given the reconstruction operator A † θ in Eq. (42), defined by the iterates in Eq. (41), we can formulate the training task in an end-to-end manner. That means, given paired training data (g i , f i ) ∈ Y × X f , then an optimal parameter θ * is found by solving the optimisation problem in Eq. (35), where the loss function is given as Computing the gradient of the loss function with respect to θ requires performing back-propagation through all of the unrolled iterates n = 0, . . . , N − 1. This requires storage as well as evaluation of forward and adjoint in each training step for each iterate and hence can be computationally burdensome, and so has mostly been demonstrated in 2D imaging scenarios. In Shan et al. 132 the basic learned iterative reconstruction approach has been applied with an extension to simultaneuously reconstruct sound speed as well, which constitutes a learned version of. 133 Following the illustration in Fig. 8, the authors suggested to also add a residual connection updating a sound speed estimate together with the reconstruction.

Learned primal dual in 2D
For reconstructions in PAT, the work by Boink et al. [134][135][136] has demonstrated the robustness of these learned iterative schemes to a number of in silico phantoms as well as in an experimental study. The authors consider an extension to the learned gradient schemes introduced above called learned primal dual 18 (LPD) based on the successful primal-dual hybrid gradient method 137 (also known as the Chambolle-Pock algorithm). The LPD method can be formulated in a similar manner to Eq. (41) by learning updating operators in the primal space X f and the dual space Y In this case the network Γ θ operates in data space Y , whereas the network Λ θ operates in image space X f . See also the illustration in Fig. 15, in which it is clearly seen to be an extension to the learned iterative scheme in Fig. 5 (Bottom). In their work, [134][135][136] the authors examine the robustness of LPD with respect to changes in the target, including the contrast, background, structural changes, and noise level. They found that if the network is trained only on the basic training data it generalises fairly well with respect to noise (1dB degradation in PSNR) and structural changes (3dB), but is most sensitive to changes in background (7dB) and contrast (11dB). 135 Additionally, the authors combine their learned reconstruction with a joint segmentation that is learned with the same network as an additional output, and is shown to provide increased robustness compared to a reconstruction by filtered backprojection and segmentation with U-Net.

Learned iterative reconstructions in 3D
As already indicated, learned iterative reconstruction methods are ideally (and typically in 2D) trained in an end-to-end manner. While this can provide an optimal set of network parameters, if suitable optimisation procedures have been used, it also comes with two computational challenges. First, the memory footprint of storing and manipulating the network tends to be large and exceeds single GPU configurations making it necessary to use costly (and often less readily available) multi-GPU clusters. More significantly, however, during training the loss function must be evaluated several times, and each of these involves evaluating the forward and adjoint operators for each iterate. This quickly leads to unreasonable training times for three dimensional images, especially when considering large volume sizes, ie. many voxels, and accurate forward models.
To overcome this limitation, Hauptmann et al. 83 proposed greedy training for learned gradient schemes for 3D PAT. That is, instead of looking for a reconstruction operator that is optimal endto-end, only iterate-wise optimality is required. For the learned gradient scheme in Eq. (41) this amounts to the following loss function for the n th unrolled iterate: given the output of the previous iterate f (n) := Λ θ n−1 f (n−1) , A * (A f (n−1) − g) and initialisation f (0) = A * g. It is important to note that, as only iterate-wise optimality is required and the parameters θ n are not jointly minimised over all iterates, such a greedy scheme constitutes an upper bound on the minimised loss function for end-to-end networks. Nevertheless, this renders the training procedure feasible since training can be separated from the evaluation of the model; the gradient of the data consistency term A * (A f (n) − g) used in Eq. (50) can be computed before the parameter optimisation is performed. In their study, 83 the authors show that in this way a learned iterative reconstruction algorithm can be trained for realistic 3D volumes of size 240 × 240 × 80 in a limited-view acquisition geometry. The results suggest that improved reconstructions can be obtained compared to both post-processing with a U-Net and iterative reconstruction with total variation regularisation. Application to in vivo measurement data was presented after transfer training was performed, as outlined in Sec. 4.4. As the authors use an accurate, full-wave, solver for the forward and adjoint operators, reconstruction times were still slow in the order of minutes, but with an 4× speed-up compared to iterative reconstruction with total variation.
In a follow-up study, 138 the authors considered the use of a faster but approximate forward model to overcome the slow reconstruction times. Here the fast k-space method discussed in Sec.
3.1.2 was used for the inverse as well as the forward propagation model, but as the forward model includes a singularity 139 this results in an approximate gradient only. Following the greedy training scheme, Eq. (50), the networks learned to reduce the resulting artifacts to produce a useful update. Using this fast approximate forward model the authors achieve a reconstruction time in the order of seconds, more precisely an 8× speed-up compared to their previous learned approach, 83 and 32× compared to iterative reconstruction with total variation. Results are presented for in vivo measurements of a human target.
Finally, Yang et al. 140 extend the previous study using an approximate model 138 by using recurrent inference machines 84,141 for the network architecture. This way the authors are able to improve reconstruction results for in silico experiments in 2D by 2dB in PSNR. In conclusion, learned iterative approaches seem to provide an improvement in reconstruction quality compared to other learned reconstructions discussed in this review, but come with the major limitation of reconstruction speed due to the repeated application of the forward model and its adjoint.

Hybrid Approaches
From the previous sections it is apparent that most approaches, while having clear advantages, come with their own shortcomings. To try to mitigate these, a few groups have investigated hybrid approaches. For instance, to overcome the missing model dependence in the fully-learned approach the work by Lan et al. [142][143][144] proposes augmenting the end-to-end approaches 124, 125 by additionally feeding the network a reconstructed image, either directly into the network at a suitable location 144 or with a separate processing branch. 142, 143

Augmented analytical approaches
Another route is to incorporate learned methods into classical inversion approaches more explicitly than the learned iterative approaches in Sec. 5.4. For instance by formulating a variational problem with a learned regulariser in the variational formulation of Eq. (23), such that the functional to be minimised becomes and explicit minimisation of E(f ) can be performed in an iterative algorithm. This approach has been proposed as the NETT framework 145 and applied to PAT. 146 The strength of this approach is in the emphasis on the model in the data consistency term and convergence guarantees under certain conditions, 145 but time consuming iterative minimisation with the explicit forward and adjoint models is still needed, similar to the learned gradient schemes. Another possibility for an augmented analytical approach is presented by Schwab et al., 147 who consider a data-driven extension of the truncated singular value decomposition, where the network is trained to produce the singular vectors corresponding to small singular values to improve reconstruction quality. We emphasise that such augmented analytical approaches are especially important where reconstruction convergence guarantees are needed, such as in critical clinical applications, but they seem to fall short in visual performance compared to the most advanced learned reconstruction approaches.

Optical Inversions
There is not, to date, a large literature using DL to tackle the optical inversions in PAT image reconstruction (see Secs. 2.2.4 and 3.2). What there is all assumes the acoustic inversion has already been solved, which is to say the initial acoustic pressure distribution f is either given as the basic measured quantity or has already been estimated by solving A −1 g. The inverse problems subsequently tackled fall largely into two classes: solving F −1 (f ) to estimate optical absorption coefficients, or solving (FL) −1 (f ) to estimate chromophore concentrations or, more often than not, blood oxygen saturation, sO 2 . The primary task of the networks in these cases is to account for the effect of the fluence, which is felt in two related ways: voxelwise it makes the PA spectra different from the absorption spectra (spectral coloring), and spatially the PAT image is no longer linearly related to the absorption coefficient distribution. These are related because the absorption coefficient at one voxel can affect the PAT image at another through the fluence. While this nonlocality of the operator F can be strong, eg. a large absorber close to the light source may 'shadow' a large part of the image region, for small absorbers the effect can be quite localised. The first application of machine learning to this problem 148 used 'fluence contribution maps' that made this assumption. In the DL approaches discussed below the use of U-Net-type architectures is common, and it is known that their multi-scale nature can help mitigate the spatial-invariance implicit in CNNs (see Sec. 5.3.1).

U-Net-based optical inversions
Cai et al., 149 in an early contribution, used a variation on the U-Net, named the ResU-Net, to obtain estimates of sO 2 and a contrast agent from 2D multiwavelength PAT images. In this architecture, all the convolutional stages of a standard U-Net are replaced by residual blocks. 85 In a similar approach, Yang et al. 150 proposed the another U-Net variant, DR2U-Net, the principle difference being that the residual blocks contain recurrent loops. Both these networks were shown to outperform linear unmixing L −1 f -which ignores the effect of the fluence -in simple in silico tests. Chen et al. 151 trained a U-Net to recover a 2D optical absorption coefficient distribution from a single-wavelength 2D PAT image. The loss term included a TV regulariser. The network was initially trained and tested with simple simulated examples and then demonstrated on 2D experimentally measured data. The measured training set was augmented by rotating the images in steps of 1 degree. The one result shown is promising, but the geometric simplicity and similarity of the training and test cases means the general applicability of the network remains unclear.
Exploiting the fact the U-Net was designed for segmentation of biomedical images, 80 Luke et al. 152 combine two U-Nets, one for segmentation and one for estimating blood sO 2 , into a single 'O-Net' with common input and output layers. The network input consists of two 2D slices from two 3D images obtained at different wavelengths, and the output is two 2D images: a segmentation and a map of sO 2 . The network gives promising results on simulated data -it is shown to work better than linear unmixing -but the digital phantoms are simple geometric shapes. To overcome this concern, Bench et al. 153 perform a similar inversion but using 3D multiwavelength training images generated from vessel-like phantoms within a multi-layered tissue. These images also contained limited-view artifacts from the acoustic reconstruction, and therefore incorporated many aspects that would be present in real in vivo data. In these simulations, the vessel sO 2 estimates were accurate to within 1% on average, with a standard deviation of 6.3%.
Yang et al. 154 also use more realistic simulated data based on a 3D digital breast phantom, using a 3D light model, and acoustically processing 2D slices as input to the network to mimic the limited-view measurements made by a linear array transducer. Their network architecture, called an EDA-Net, uses the idea of 'iterative deep aggregation' 155 to enhance the basic U-Net. In this architecture, every skip-connection is replaced with multiple nodes at the same scale, each of which is fed from below by (nonlinear) upsampling. This network was shown to perform slightly better than ResU-Net and U-Net++ 156 and much better than linear unmixing.
In a detailed study, Gröhl et al. 123 use U-Nets to estimate the absorption coefficient in various ways. In two fluence-estimation approaches, asymmetric and symmetric U-Nets were used to estimate the fluence map, φ, from time series data g, and from initial pressure distribution f respectively. (This was subsequently divided out of f to estimate µ a .) Also, a one-step approach was described in which an asymmetric U-Net was used to estimate µ a directly from limited-view and limited bandwidth time series data, ie. solving (AF) −1 g directly. This one-step approach fared worse than the fluence estimation approaches in the in silico tests, but the comparison is perhaps unfair. Unlike the fluence estimation approaches, which just have to learn a mapping from one image space X f to another X µa , this inversion requires the network to also learn the mapping from Y to X f from incomplete data.

Learned uncertainty estimation
All the U-Net variants in Sec. 5.6.1 above have been shown to give a degree of accuracy when demonstrated on simulated data (some more realistic than others), that if repeatable with in vivo data would be useful in applications. Moving to in vivo data, however, is a challenge, as discussed below in Sec. 5.6.4. One of the difficulties with translating sO 2 estimation techniques, for example, to a clinical setting is knowing how much confidence one should have in the estimates. This problem is tackled by Gröhl et al. 123 who trained a U-Net to act as an error-estimating network, using {(PAT image, error image)} pairs, to give an estimate of the uncertainty in the µ a estimates. The uncertainty correlated well with the actual error in the images in this in silico study. The use of a meta-network to observe the performance of a given estimator and output confidence levels for its estimates is very interesting given the difficulties inherent to translating quantitative PAT algorithms to in vivo cases.

Learned spectral unmixing
In contrast to the U-Net-based approaches discussed above, which exploit the spatial information about the fluence that is present in the PA images, pixelwise approaches attempt to solve the optical inversion using the spectral data alone.
Durairaj et al. 157 propose a two-stage autoencoder architecture (Sec. 4.2.2) to estimate chromophore concentrations and molar absorption spectra simultaneously. One potentially significant advantage of this approach is that autoencoder networks by their nature do not require ground-truth data for the training. As discussed in Sec. 4.2.2, by having a smaller hidden layer than input and output layers, autoencoders aim to find a compressed representation of the input. Durairaj et al. chose the hidden layer to have many dimensions as there are chromophores contributing to the data, in the hope that the values at the hidden layer are estimates of the chromophore concentrations (endmember abundances in their terminology) and the network weights are estimates of the molar absorption spectra (endmember spectra). Because this approach aims to solve the ill-posed problem of finding both the concentrations and spectra simultaneously, it requires strong prior information. As well as a positivity condition, which is well-justified, they impose the condition that the chromophore concentrations sum to one. However, this is unrealistic as there will also be non-absorbing molecules present in real tissue. Furthermore, it is not clear how this approach can account for the effect of the fluence on the PAT images and therefore unclear the extent to which the approach outlined in this preliminary simulation study will be useful in practice.
A different approach to learned spectral unmixing was taken by Gröhl et al. 158 who used a fully connected network with 8 hidden layers to convert pixelwise PAT spectra into estimates of sO 2 . The training data was taken from 2D simulated PAT images of vessels, and when the network was tested with simulated data it gave promising results. With some bravado, this network was then tested in vivo on images of a porcine brain and human forearm, and in the case of the pig brain 'seems to provide physiologically more plausible estimations' than linear unmixing.

Training data
As a concluding remark on this section, we note that several classical approaches to quantitative PAT have been demonstrated over the past decade (see 159,160 and their references and citations) but it has proved difficult to translate these methods to work convincingly with measurement data obtained in vivo, largely due to the challenge of obtaining all the auxiliary input parameters with sufficient accuracy under experimental conditions. DL holds the promise of overcoming this problem by learning the model, thereby not requiring auxiliary inputs, but a new difficulty arises: obtaining a large collection of experimentally measured in vivo data with a known ground-truth to use for the training. As discussed in Sec. 4.4, there are two approaches: simulating the data or reconstructing ground-truth images using a 'gold standard' classical reconstruction technique. The papers discussed in this section have used the former approach of simulating the data, typically using a Monte Carlo method such as MCX 161 for modelling the light propagation and collocation method such as k-Wave for modelling the acoustic propagation. 90 The degree to which the simulations are realistic will determine how well a network trained with this data will work on data measured in vivo, and therefore will determine the confidence with which any conclusions can be drawn from a study using such an approach. In conclusion then, the use of DL to tackle the quantitative PAT problem appear to hold promise but the translation to practical, in vivo, cases remains a significant challenge.

Conclusions and Future Directions
The diversity of the work that has been done on learned image reconstruction in PAT in just the last few years, and the increasing rate at which it is being produced, suggests that the field will continue to develop for some time. In particular, we notice that already a move has begun from straightforward proof-of-concept applications of Deep Learning to more sophisticated approaches. Nevertheless, there are many issues that remain to be addressed. For instance, on the one hand there are model agnostic reconstruction pipelines using fully learned approaches that get a lot of attention due to low latency. On the other hand, as described above, there are learned reconstructions that use a physical model in combination with a network, which have been shown to be more stable and require less training data, but are (considerably) slower in providing a reconstruction. This is in part because accurate numerical models of the physics are often slow compared to networks. Therefore a major question remains: Is it possible to obtain network speed without sacrificing the stability and accuracy that comes from explicitly incorporating a model?
Another challenge, that hangs over learned image reconstructions with all biomedical applications, is how to ensure oddities (like a tumour) appear accurately in the image even though nothing quite like them was in the training data. In other words, how do we ensure the distribution of the training data matches that of the imaged target? And if it doesn't, will there be problems, as suggested by results from the tutorial, Sec. 4.5? Could this problem be ameliorated by ensuring additional constraints, such as data consistency?
To conclude this review, we describe a few current research directions that address these questions, either by considering new training regimes or by combining physical models with neural networks in different ways.
• Data consistency is important Many approaches are still missing a data-consistency term and hence the reconstructions obtained might look realistic but there is no way to assess their correctness. As we have discussed, there are a few approaches which do consider such data consistency during the reconstruction and hence provide a possible direction for further developments, such as the null space approaches discussed in Sec. 5.1.2 or learned iterative reconstructions in Sec. 5.4. Another possible way to tackle this limitation is by using networks that consider uncertainty or provide an uncertainty estimate on top of the reconstruction. First steps in this direction have been taken for PAT, 113 see also Sec. 5.6.2, but there is also rising interest in other fields to incorporate such uncertainty estimates into a learned reconstruction framework, [162][163][164] which could be taken as inspiration.
• Lack of in vivo training data For experimental scenarios, especially in vivo, using simulated training data is risky because it is hard to ensure the training set distribution matches that of the target.
As the majority of the algorithms discussed above used fully supervised training, these approaches are primarily limited by the available ground-truth data. As this is seldom a viable option when developing imaging pipelines for in vivo applications, it may be that different training regimes will be needed, such as semi-supervised approaches, as discussed in Sec. 4.2.4. For instance by including a data consistency in the transfer to experimental data (also known as cycle consistency) or discriminator (GAN) based approaches. 114 Another possibility might be to consider the framework of physics-informed neural networks, 165 in which the physical model, given by a partial differential equation, is incorporated directly into the loss function. In this case, rather than the network needing to learn the whole physical operator from the data, as in the fully learned cases presented above, the network learns much of the physics by virtue of the terms in the loss function.
• 3D nature of PAT The high computationally complexity caused by the inherently three-dimensional nature of PAT is another challenge for learned approaches, as computational models tend to be time-consuming and simply storing the data requires large amounts of memory. Possible methods to overcome this have been discussed in some recent papers, for instance by using invertible networks, 166,167 which do not require the storage of intermediate states in the network to compute the gradients for training. Another idea of how to scale learned iterative schemes to 3D is by computing the forward model on multiple lower resolutions in the reconstruction process. 168 • Model Augmentation and Correction The learned schemes that use a model in conjunction with a network are typically slow, and also face the additional problem of uncertainty in model parameters, especially the sound speed and, for the optical inversion, the scattering (see Sec. 2

.2.2).
There may be advantages, therefore, of considering different ways to incorporate some of the behaviour of the model equation directly into the network. For instance, by designing or constraining networks based on the discretisation of the forward model -similar work has already been done for diffusion equations. 15,169 This way, it is possible to explicitly embed the properties of the model into the network architecture, with a computationally more efficient (network-based) solver.
Another possibility is to use approximate models that are faster or easier to compute in place of the true (expensive) model, and train a network to learn a correction. 170,171 The error may arise from an efficient, but inaccurate, numerical discretisation of the correct model 138,172 or because the accurate model has been replaced with a more-easily solvable approximation. 171 We believe that this direction could be particularly fruitful for PAT as model information is essential to provide stability and robustness in the inversion, but we need to overcome the two major limitations: computational speed and the inherent uncertainty in the model parameters. Nevertheless, these improvements come with a major increase in training times for such networks.
• Trade-offs and choices There are so many options that trade-offs and choices will need to be made in practice. This is not a problem per se, but rather an opportunity. There are many possible ways in which a network can be incorporated into the reconstruction pipeline, and the approach that will be best suited to a particular application will depend on the nature of the application. It is the responsibility of the designer of the image reconstruction algorithm to consider the trade-offs and constraints -eg. Is reconstruction speed or a data-consistency guarantee more important? Does the algorithm needs to be able to work well with more than one hardware system? What hardware is available for the computations? -and construct the algorithm accordingly. This plethora of choice is good, because it gives sufficient flexibility for properly crafted, well-thought-through algorithms to be designed to be optimal for specific tasks. The key to realising that is developing an understanding of the strengths and weaknesses of particular architectures and approaches. We are only at the beginning of this journey, but we hope this paper has illuminated at least a little of the way along the path.