Distribution-informed and wavelength-flexible data-driven photoacoustic oximetry

Abstract. Significance Photoacoustic imaging (PAI) promises to measure spatially resolved blood oxygen saturation but suffers from a lack of accurate and robust spectral unmixing methods to deliver on this promise. Accurate blood oxygenation estimation could have important clinical applications from cancer detection to quantifying inflammation. Aim We address the inflexibility of existing data-driven methods for estimating blood oxygenation in PAI by introducing a recurrent neural network architecture. Approach We created 25 simulated training dataset variations to assess neural network performance. We used a long short-term memory network to implement a wavelength-flexible network architecture and proposed the Jensen–Shannon divergence to predict the most suitable training dataset. Results The network architecture can flexibly handle the input wavelengths and outperforms linear unmixing and the previously proposed learned spectral decoloring method. Small changes in the training data significantly affect the accuracy of our method, but we find that the Jensen–Shannon divergence correlates with the estimation error and is thus suitable for predicting the most appropriate training datasets for any given application. Conclusions A flexible data-driven network architecture combined with the Jensen–Shannon divergence to predict the best training data set provides a promising direction that might enable robust data-driven photoacoustic oximetry for clinical use cases.


Introduction
Blood oxygen saturation (sO 2 ) is an important indicator of individual health status used routinely in patient management. 1 Photoacoustic imaging (PAI) is a promising medical imaging modality for real-time non-invasive spatially-resolved measurement of sO 2 , 2 with early clinical applications 3 shown in, for example, inflammatory bowel disease 4 and cardiovascular diseases. 5In cancer, alterations in localized sO 2 levels have been linked to angiogenesis and hypoxia; 6 key hallmarks of cancer that are known to affect treatment outcomes 7 and are measurable through PAI.
Unfortunately, it remains difficult to apply PAI to derive quantitative values for sO 2 from multispectral PA measurements. 8,9 inear unmixing (LU) remains the de facto standard for sO 2 estimation from PAI measurements 10 because of its simplicity and flexibility, but has well-understood limitations in its applicability and accuracy. 11The limitations of LU are significant in the context of artefacts arising from: the optical processes (e.g.non-linear light fluence distribution 12 leading to spectral colouring), acoustic processes (e.g.reflection artefacts 13 ), or reconstruction algorithms (e.g.model mismatch in sound speed 14 ).Using LU is particularly challenging in the presence of highly absorbing tissues, such as the epidermis, 15 which can introduce reflection artefacts and a spectral bias that leads to an overestimation of sO 2 , which increases with darker skin tone. 16ta-driven unmixing schemes have shown promise to alleviate some of the shortcomings of LU [17][18][19] but suffer from three major drawbacks: (1) inflexibility to receiving different input data after training; 20 (2) performance determined by the composition of the training dataset; 21 and (3) limited testing on diverse and representative use cases. 22In comparison to LU, data-driven methods are often inflexible regarding the input data after training, impacting generalizability, making them difficult to use, and requiring laborious tailoring to a specific application and imaging system.Thus, data-driven sO 2 estimation methods can struggle to translate from promising findings in silico to in vivo data. 23,24 he lack of high-quality annotated training data and reliable validation data has made it difficult to implement data-driven methods robustly.7][28] Further, many data-driven approaches use single-pixel input spectra for their inversion, as with LU, even though 3D inversions would be preferred for realistic use cases. 29Solving the full 3D problem is typically computationally intensive, limiting its success in vivo. 30To make the inverse problem more tractable without full 3D information, some approaches use priors in the inversion scheme, 31 differential image analysis, 9 or multiple measurements with differences in illumination. 32 this work, we set out to address the three aforementioned limitations.We improve the flexibility of data-driven sO 2 estimation using a Long Short Term Memory (LSTM) network that enables input wavelength flexibility.We propose a method to inform the choice of training data, which can either be used to choose the best pre-trained model for a target application or to inform the choice of simulation parameters when creating a training data set to underpin a new model.We test these methods on diverse data sources across simulations, phantoms, small animals and humans.

Materials and Methods
We begin by investigating sensitivity to changes in training dataset simulation parameters, by defining a baseline dataset set (BASE) with typical assumptions on the tissue geometry and functional parameter ranges, then adapt it into 25 variations (Fig 1 A).We use the Jensen-Shannon divergence to determine the ideal training dataset for a given use case.We propose a deep learning network architecture based on a long short-term memory (LSTM) network that is flexible regarding the input wavelengths (Fig 1 B) and use a testing strategy that comprises computational studies in silico, phantoms in gello, 33 and in vivo data (Fig 1 C).

In silico datasets
25 in silico datasets were simulated in Python using the SIMPA toolkit 34 (Fig1 A).A Monte Carlo model (MCX 35 ) was used for the optical forward model with a 50 mJ Gaussian beam using 10 7 photons and a 20 mm radius, simulating at 41 wavelengths from 700 nm to 900 nm in 5 nm steps, and assuming an anisotropy of 0.9.For the datasets that include acoustic modelling, a 2D k-space pseudo-spectral time domain method implemented in the k-Wave toolbox 36 was used.We assumed a uniform speed of sound of 1500 ms −1 , a density of 1000 gcm −3 , and disregarded acoustic absorption.For most datasets, a generic linear ultrasound detector array was placed in the centre of the simulated volume.The generic array consists of 100 detection elements with a pitch of 0.18 mm, a centre frequency of 4 MHz, a bandwidth of 55%, and a sampling rate of 40 MHz.For the datasets mimicking the setup of commercial instruments, we used the built-in device definitions provided in SIMPA.
Baseline dataset parameters.[39][40] A 19.2 x 19.2 x 19.2 mm cube was simulated with a 0.3 mm voxel size, resulting in 64 3 voxels per volume.The background started from the second voxel from the top and was modelled as muscle tissue with a blood volume fraction of 1% 20 with 70% oxygenation. 41We added 0-9 cylinders with a diameter randomly chosen from U[0.3 mm, 2.0 mm], where U denotes a uniform distribution.The tissue was divided into 3 x 3 equal-sized compartments in the imaging plane and a vessel was included with a 50% probability.Each vessel purely contained blood with a randomly drawn sO 2 value from U[0%, 100%].To prevent vessel overlap, they were located within their respective compartments' boundaries.
We defined 24 variations of BASE (Table 1).The variations encompassed a range of tissue backgrounds, vessel sizes, illumination geometries and resolutions, as well as the addition of a skin layer, performing acoustic simulation, and using digital twins of two commercial instruments (MSOT Acuity Echo and MSOT InVision 256-TF; iThera Medical GmbH, Munich, Germany).The MSOT Acuity Echo is a handheld clinical PAI device with 256 detection elements and an angular coverage of 120 • , whereas the MSOT InVision 256-TF is a tomographic pre-clinical PAI device with 256 detection elements and a 270 • view angle.The full details for the digital twin parameters of these devices can be found in prior work 26,34 and are available within the SIMPA toolkit.
Table 1 Summary of datasets used in the study.Each row identifies the changes in dataset creation parameters performed for each of the training datasets relative to the BASE dataset (specified in Methods subsection 2.1).The left column shows an identifier assigned to each dataset that is used throughout the manuscript and the right column summarises the deviation from BASE.U denotes a uniform distribution.
Data preprocessing.We extracted pixel-wise PA spectra from simulations of initial pressure or reconstructed signal amplitude if acoustic simulations were performed.We only extracted spectra from blood vessels where the signal intensity at 800 nm was at least 10% of the maximum signal intensity in the dataset.If less than 10% of voxels were chosen this way, we selected the 10% of voxels with the highest signal amplitude from the dataset.We enforced this selection criterion to effectively exclude voxels with low signal-to-noise ratio (SNR) caused by optical attenuation, as previous works have shown that idealised simulation can contain spectra where the spectral colouring is unrealistically strong in regions with low signal amplitude. 20,32  number of extracted spectra from the different dataset variations ranged from 25 thousand to 60 million, compared to BASE with 7 million spectra; the mean over all datasets was just above 6 million.The large difference in extractable spectra is primarily caused by two factors: (1) typical simulations yield 3D p 0 distributions, but adding acoustic forward modelling leads to a 2D reconstructed image; and (2) some datasets only include a single vessel in the centre of the phantom tube, mimicking a blood flow phantom 42 (described below).To mitigate performance differences caused by this discrepancy, we stratified the dataset sizes by randomly sampling 300,000 spectra with replacement per dataset, which represents a balanced compromise between undersampling larger datasets and oversampling smaller ones. 43 performed z-score normalization on each spectrum, setting the mean (µ) to 0 and variance (σ 2 ) to 1, which discards signal intensity information and eliminates the need for quantitative simulation calibration.Since we performed the same spectra-wise z-score normalisation on experimental data, this normalisation allows us to apply sO 2 estimation algorithms trained in silico to experimental in gello and in vivo data.

Deep Learning Algorithm
To address the limited flexibility of data-driven oximetry methods, 20,40 a custom unmixing network architecture was composed that contains a Long Short-Term Memory (LSTM) network.Due to the recurrent nature of an LSTM, it can process sparse spectra containing zeros at arbitrary positions (see Fig 1 B).
The network input size was fixed at 41, representing the maximum number of wavelengths (between 700 nm and 900 nm in 5 nm steps) we consider during inference.The number is a trade-off between maximising the spectral resolution of the input features and simulation time efficiency.The network started with an LSTM layer with a hidden size of 100.A masking layer was introduced to identify zeroed values, creating a mask that instructs the layer to ignore missing values.The LSTM output was then flattened into a fixed-length encoding.Following the LSTM, a threelayer fully connected neural network (FCNN) was used with input size 4200, hidden size 1000, and output size 1.A leaky rectified linear unit was used after each layer and the activation function after the final layer was a sigmoid function to constrain sO 2 predictions between 0 and 1 (Fig 1 B).
The deep learning networks were trained for 100 epochs, where each epoch included the entire training set.The parameters were optimised with the Adam optimizer from the Keras framework 44 using an initial learning rate of 10 −3 and a mean absolute error loss.The learning rate was halved upon a 5-epoch plateau of the validation loss.

In gello blood flow phantom imaging
Two variable oxygenation blood flow phantoms were imaged using a previously described protocol. 42Briefly, agar-based cylindrical phantoms with a radius of 10 mm were created and a polyvinyl chloride tube (inner diameter 0.8 mm, outer diameter 0.9 mm) was embedded in the centre before the agar was allowed to set at room temperature.The base mixture comprised 1.5 % (w/v) agar (Sigma Aldrich 9012-36-6) in Milli-Q water and was heated until dissolved.After cooling to approximately 40 • C, 2.08% (v/v) of pre-warmed intralipid (20% emulsion, Merck, 68890-65-3) and 0.74% (v/v) Nigrosin solution (0.5 mg/mL in Milli-Q water) were added, mixed, and poured into the mold.Imaging was performed using the MSOT InVision 256TF (iThera Medical GmbH, Munich, Germany) according to a previously established standard operating procedure. 45PA images of the phantom were acquired in the range between 700 nm and 900 nm with 20 nm increments.The first phantom was imaged using deuterium oxide (D 2 O, heavy water) as the coupling medium within the system, while the second was immersed in normal water (H 2 O) during imaging.

In vivo human forearm imaging
Human forearm imaging was performed as part of the PAISKINTONE study, which started in June 2023 following approval by the East of England -Cambridge South Research Committee (Ref: 23/EE/0019).The study was conducted in accordance with the Declaration of Helsinki and written informed consent was obtained from all study participants.Participants were excluded if they could not give consent, were under the age of 20 or over 80, or had a body mass index (BMI) outside the range between 18.5 and 30.Imaging was performed using the MSOT Acuity Echo (iThera Medical GmbH, Munich, Germany) using laser light between 660 nm and 1300 nm, averaging over 10 scans each and analysis was performed at five wavelengths (700 nm, 730 nm, 760 nm, 800 nm, 850 nm).One forearm scan from N=7 randomly chosen subjects with Fitzpatrick Type 1 or 2 were selected for the purposes of testing the method proposed in this work.The authors manually segmented the radial artery in each scan using the medical imaging interaction toolkit (MITK). 46

In vivo mouse imaging
All animal procedures were conducted under project and personal licences (PPL no PE12C2B96, PIL no I53057080), issued under the United Kingdom Animals (Scientific Procedures) Act, 1986 and compliance was approved locally by the CRUK Cambridge Institute Biological Resources Unit.Nine healthy 9-week-old female C57BL/6 albino mice were imaged using the MSOT In-Vision 256TF (iThera Medical GmbH, Munich, Germany) according to a previously established standard operating procedure. 45Additionally, six 28-week-old healthy female BALB/c nude mice were imaged while inhaling 100% CO 2 as their terminal procedure.In both cases, imaging was performed at 10 wavelengths equally spaced between 700 and 900 nm averaging over 10 scans each.The mouse body, kidneys, spleen, spine, and aorta were manually segmented by the authors using MITK.

Performance evaluation
The performance of the LSTM-based method was evaluated using the median absolute error (ϵsO 2 ) between the estimate ( ŝO 2 ) and the ground truth/reference sO 2 : Ground truth values are available for the in silico and in gello datasets.For the in vivo measurements, reference values were based on literature.We assumed sO 2 of mixed murine blood to be 60% -70% 41 and of arterial murine blood sO 2 under anaesthesia to be 94% -98%. 47For the CO 2 terminal procedure, we assumed that CO 2 binds to haemoglobin, forming carbaminohaemoglobin, which leads to oxygen unloading 48 and has an absorption spectrum similar to deoxyhaemoglobin, 49,50 thus continuously decreasing the actual 51 and measured global blood sO 2 .In humans, arterial blood sO 2 of 95%-100% was assumed. 52

Simulation gap measure
To predict the best-fitting training data set for a target application, one could use the sO 2 estimates of a trained algorithm to calculate error metrics, such as the absolute estimation error, but this is only possible when ground truth or reference sO 2 values for a representative dataset are available.For in vivo applications this is typically not the case and additionally, unsupervised methods for performance prediction in the context of PA oximetry remain largely unexplored.Our alternative solution to this problem uses the Jensen-Shannon divergence 53 (D JS ).To this end, we compute D JS between the training data distributions and the target data distribution and calculate the Pearson correlation coefficient between the resulting D JS and ϵsO 2 of the LSTM estimates.D JS measures the distance between unpaired samples drawn from two probability distributions P and Q. D JS is a symmetric version of the Kullback-Leiber divergence 54 (D KL ) and is defined as where M = 1 2 (P + Q) and the discrete D KL is defined as the relative entropy between two probability distributions: To apply these measures, it is important to consider (1) handling the multidimensional probability distributions arising from multi-wavelength measurements and (2) the transformation of two sample distributions into the same sample space.We calculate an aggregate D JS by calculating the mean over the distance for each wavelength in the spectrum: where N λ is the number of all available wavelengths Λ.To standardise the sample space, a z-score normalisation is performed for each spectrum and a histogram with 100 bins ranging from -3σ to 3σ is created.A Python implementation of the Jensen-Shannon distance, available in the Scipy (v1.10.1)package, 55 was used.

3.1
The LSTM-based method achieves accurate results across a range of available input wavelengths.
The accuracy of the LSTM-based network architecture was first tested on the BASE data set, when varying the numbers of available wavelengths (N λ ) for training or inference.ϵsO 2 was extracted when training and testing the network on a certain fixed N λ , ranging from 3 to 41 wavelengths (Fig 2 A).We found that ϵsO 2 increased as more wavelengths were used for training.Nevertheless, ϵsO 2 for N λ = 10 was only slightly higher than for N λ = 41 wavelengths.For N λ < 10, ϵsO 2 starts to rapidly increase, which aligns with prior literature 31 and is potentially exacerbated due to the 'vanishing gradient problem' 56   B LSTM trained at a given N λ can be applied to data with different N λ but yields best results when N λ of the test spectra matches that of the training spectra (indicated by the green vertical line).

3.2
In silico cross-validation reveals the effect of changing simulation parameters.
For each of the 24 simulation parameter variations of BASE, 500 data points with random spatial vessel distributions and a fixed random number generator seed for reproducibility were generated.
To investigate the sensitivity of data-driven sO 2 estimates to changes in training dataset parameters, we first trained LSTM-based networks using all 41 available wavelengths on each simulated training dataset and one on a mixed dataset (ALL).We then performed cross-validation on all datasets by applying every trained network to each of the datasets and calculating the median estimation error ϵsO 2 (Fig 3 A).The ϵsO 2 values range from 0.5% to over 35%.As expected, the best performance occurs when testing on the training set, however, it is important to note that ϵsO 2 is not zero (instead ranging from 0.5% -5%), suggesting that the network has not overfitted the training data.We calculated a Uniform Manifold Approximation and Projection (UMAP) 57 embedding of 200,000 randomly chosen spectra from all datasets.With UMAP, we visualised the location of the spectra  of representative datasets on this embedding (Fig 3 B).Examining this visualisation indicates that some changes in the dataset parameters result in highly different spectra (e.g.: adding a layer of water on top of the tissue), while others lead to only minor variations (e.g.: changing the background oxygenation).Labelling the UMAP embedding with the corresponding ground truth sO 2 values reveals a correlation from low to high oxygenation along the horizontal UMAP axis (Fig 3  C).
From the in silico cross-validation heatmap (Fig. 3 A), we can derive several key observations concerning the design of simulated data: 1. Variation in background sO 2 has minimal effect with the used 1% blood volume fraction used, however, this could become more significant at higher blood volume fractions.
2. Resolution matters: Performance improves with higher spatial resolution simulations in the training data, suggesting fine details in the spectral data are important for accurate sO 2 estimation.

Illumination matters:
When changing from a Gaussian to a point source illumination the error increased.
4. Chromophore inclusion: When the test dataset contained melanin, but not the training data, the estimation error increased by an average of 5.8 percentage points.When designing a training dataset, all chromophores relevant to the target application should thus be included.

Acoustic modelling causes systematic changes:
Using acoustic modelling and image reconstruction introduced systematic spectral changes that increase ϵsO 2 .This can have detrimental effects on the estimation accuracy and should be considered during training data simulation.

6.
Training on a combined dataset is better: Including random samples from all training datasets yielded more accurate estimates for all test datasets in silico.It should already be noted that this finding was not reproducible on the experimental datasets, suggesting that the LSTM-based method was not able to generalise better by training on a combined dataset.

3.3
The Jensen-Shannon divergence correlates with the estimation error and can therefore be used to identify the best training data set.
Given the variance in performance introduced by the choice of training data, it is desirable to automatically determine the best training dataset for a given algorithm and target application.The Jensen-Shannon divergence (D JS ) allows one to quantify the distance between the data distribution of each dataset and the target data.We calculated the correlation between D JS and the median absolute sO 2 estimation error (ϵsO 2 ).When applying all networks, each trained on a distinct training dataset, to the BASE dataset, we found that D JS correlates strongly with ϵsO 2 (Pearson correlation coefficient R=0.76).The RES 0.15 dataset, which is the dataset simulated at the highest resolution, and not the BASE dataset, achieved the best D JS score.This may be because BASE dataset is a subset of the RES 0.15 dataset, resulting in low ϵsO 2 (Fig 4 A).
Extending the analysis to experimentally acquired in gello D 2 O flow phantom data showed a similar correlation (R=0.74).The network trained on SMALL achieved the best score with D JS = 0.35 and ϵsO 2 = 4.2% (Fig 4 B).The application of D JS to the H 2 O flow phantom experiment initially revealed no correlation (R=-0.1),however, networks trained on ILLUM POINT and MSOT ACOUS SKIN were outliers and after removing these, the correlation was comparable to other datasets (R=0.72,Fig 4 C).The network trained on WATER 4cm achieved the best score with D JS = 0.47 and ϵsO 2 = 12.6% (Fig 4 B).The presence of outliers emphasises the importance of expert oversight when applying summary metrics such as D JS .
D JS correlates with ϵsO 2 across multiple simulated and experimental data sets, providing evidence that the Jensen-Shannon divergence can predict algorithm performance.This is particularly relevant for previously unseen datasets where the true sO 2 is unknown.For each training dataset, a D JS value can be computed by drawing random samples from both the training and unseen dataset as outlined in section 2.7.An LSTM-based network pre-trained on the dataset with the lowest corresponding D JS would then be chosen for data analysis since lower D JS correlates with a lower ϵsO 2 .
The same strategy could also be used to guide an optimisation process to tailor the simulation parameters to create a new training dataset that matches the target application.

3.4
In gello testing shows that the LSTM method outperforms learned spectral decolouring.
Algorithm performance on the oxygenation flow phantom was compared with LU as the de facto state of the art and with a previously proposed learned spectral decolouring method. 20We show three example PA intensity images at 700 nm of the D 2 O flow phantom at three time points t=[0 mins, 44 mins, 70 mins] (Fig 5 A), annotated with reference oxygenation (sO ref 2 ) calculated from pO 2 reference measurements using the Severinghaus equation. 58omparing the estimates of all methods trained on the SMALL training dataset by plotting the estimated sO 2 over time (Fig 5 B) reveals that the LSTM-based method is, on average, more than twice as accurate as the learned spectral decolouring (LSD) method and four times as accurate compared to LU.We show the LU estimates for the three example images (Fig 5 C), demonstrating the restricted dynamic range of LU estimates from t=0 min to t=44 min, ranging from 80% to 32%, compared to a ground truth of 99% to 1%.Example images for LSD (Fig 5 D) and the LSTMbased method (Fig 5 E) are shown as well, where the latter can recover the widest dynamic range, extending from 97% to 5%.For both methods, we chose SMALL as the training data set, as it was assigned the lowest D JS score.with expectations based on human physiology.The combination of all datasets (ALL) results in an extremely low sO 2 estimate in the radial artery (median sO 2 < 50%), which contradicts the in silico cross-validation results and indicates overfitting of the method to the training datasets.
For mouse images, the aorta and the area around the spinal cord are examined (Fig 6 G), assuming from literature a physiological arterial sO 2 of 92% -98% and for the spinal cord, a mixed arterial and venous blood with sO 2 of 60% -70%.WATER 4cm is objectively the best matching dataset and was assigned the highest score according to D JS (Fig 6 H).All data-driven methods significantly increase the sO 2 estimate in the aorta, and lie within the desired bounds in the spinal cord (Fig 6 I).The BASE (Fig 6 L) and WATER 4cm (Fig 6 J) datasets estimate a broad distribution and yield a higher sO 2 estimate in the aorta and a larger spread between sO 2 in the aorta and spinal cord.The limitations of the ILLUM POINT (Fig 6 K) dataset are even more evident in the mouse data, where even more pixels are either assigned 0% or 100%.
3.6 Dynamic in vivo testing demonstrates that LSTM method can reveal physiological processes.
To provide a quantifiable decrease in the in vivo sO 2 levels in mice that could test the capability of the LSTM-based method, we imaged N=6 mice when experiencing asphyxiation breathing 100% CO 2 . 59sO 2 estimates were extracted from the major visible organs in the scan (spleen, kidneys, spinal cord, and aorta) at 3 minutes before and 10 minutes after CO 2 asphyxiation.2).Notably, the direction of predicted effects on sO 2 levels aligns well between LU and data-driven unmixing methods.Still, the effect sizes are up to three times greater when utilizing data-driven approaches, demonstrating a wider dynamic range.Intriguingly, in the case of the aorta, LU predicts an increase in sO 2 levels despite the expected global decrease in sO 2 levels due to CO 2 exposure.This may be caused by the aforementioned increase in absorption coefficient in the periphery leading to an increase in spectral colouring in depth.

Discussion
We present an LSTM-based method for estimating sO 2 from multispectral PA images.We demonstrate that it can yield superior inference results compared to the previously proposed learned spectral decolouring method while at the same time being usable in a flexible manner, making it a promising candidate to replace LU as the de facto state of the art.We also show that the performance of the trained networks is highly dependent on the training data and that changes in simulation parameters can lead to drastically different data distributions.We thus propose to use the Jensen-Shannon divergence (D JS ) to complement the LSTM-based method.D JS correlates with the median absolute sO 2 estimation error and can thus be used to select the best-fitting training dataset or to optimise the training data distribution to fit the target application.We highlight how the interplay of the LSTM-based method with D JS can be used on a diversity of in vivo human and mouse data acquired with different scanners and demonstrate that the LSTMbased method can reveal significant dependencies in sO 2 changes that conventional LU would fail to identify.The LSTM-based method further consistently outperforms LU, with estimated sO 2 values aligning better with ground truth measurements in gello and literature references in vivo.
LU also shows significant outliers in regions where imaging artefacts are present, which are either not present or less pronounced when using the LSTM-based method.
Our in silico cross-validation reveals that acoustic modelling and image reconstruction introduces systematic spectral changes not explained by the initial pressure spectrum alone.Thus, an accurate digital model of the clinically used device is crucial during data simulation to ensure the best algorithm performance.While the combined dataset showed promising results in silico, these were not replicated in the experimental datasets, which may suggest that the network is overfitting and able to differentiate between the different simulated datasets.
D JS appears to be a valuable measure for determining the optimal training dataset for the LSTMbased method, as it correlates with the median absolute sO The in vivo experiments with CO 2 asphyxiation showed an increase in signal amplitude at 800 nm in the periphery with a decrease in signal in the centre of the mouse.These phenomena suggest an overall increase in the absorption coefficient, which could be caused by a range of factors, including: blood coagulation, 61 erythrocyte aggregation, 62 the presence of bicarbonate ions (HCO − 3 ) in the blood formed by the dissociation of carbonic acid into bicarbonate and hydrogen ions, 48 or an increase in blood volume due to blood stasis.Vasoconstriction of the capillaries leading to pallor mortis could also play a role in the better visibility of the superficial organs. 63aving applied the LSTM-based method combined with D JS to a diverse range of data, we believe they provide a promising route to replacing LU for pixel-based PA oximetry.Combining the flexibility in application of LU with the increase in accuracy of deep learning-based unmixing methods is attractive.The CO 2 experiment suggests that LU can underestimate the effect size of sO 2 changes due to its compressed dynamic range and susceptibility to artefacts.We thus recommend that LU should be complemented by a deep learning-based estimation method.The codes and data of this study are available open-source and the method will be integrated into the PATATO toolkit, 64 facilitating its widespread testing and future application.
Nonetheless, there remain limitations to this study that should be the subject of further research.In this work, we investigated D JS predictions on a predefined set of datasets.Based on the results, we believe that D JS might be suitable to be used within a minimisation scheme to either manually or automatically determine the best choice of simulation parameters for a given data set, but this remains to be investigated.To account for spectral colouring artefacts more robustly, the full 2D -or better 3D -tissue context should be taken into account within the neural network, as quantitative photoacoustic imaging is only possible with the full 3D context. 12,21 dditionally, we have shown that good training data are key for deep learning-based methods for PA oximetry, as such, simulating data as realistically as possible is important.One direction towards this is to make use of domain adaptation methods 27,28,65 that adapt simulated training data to appear more realistic.

Conclusion
The presented LSTM-based approach for sO 2 estimation from multispectral PA images surpasses the performance of LU and a previously reported data-driven oximetry method, making it a promising candidate to replace LU as the state of the art.We address the impact of training data variations by introducing the Jensen-Shannon divergence (D JS ) as a valuable complement, enabling selection of optimal datasets and fine-tuning for specific applications.Our LSTM-based method consistently outperforms LU, aligning well with ground truth measurements and literature references, while mitigating outliers in regions prone to imaging artifacts.The combination of the flexibility of the novel LSTM-based method with D JS for training data optimisation is a promising direction to make data-driven oximetry methods robustly applicable for clinical use cases.
Academic Hardware Grant Program and utilised two Quadro RTX 8000.The authors would like to thank Dr Mariam-Eleni Oraiopoulou for the helpful discussions.

Fig 1
Fig 1 Overview of the methods used.A Photon transport was simulated with a Monte Carlo light model for each of 25 distinct datasets adapted from a baseline tissue assumption (BASE).Spectra with simulated amplitude for each wavelength were extracted from these simulations.B A deep learning network based on a long short-term memory (LSTM) network was introduced to enable greater flexibility regarding input wavelengths for analysis.The hidden state of the LSTM was passed to fully connected layers, which output the estimated blood oxygenation sO 2 .C The performance of the LSTM-based method when trained on datasets with different tissue simulation parameters was tested across different datasets, ranging from in silico simulations, in gello phantom measurements, to in vivo measurements.This figure was created with Inkscape using BioRender assets.

Fig 2
Fig 2 The LSTM-based method shows wavelength flexibility.A LSTMs were trained with varying numbers of wavelengths (N λ ) to show that with an increasing number of wavelengths, the accuracy of the predictions increases.B LSTM trained at a given N λ can be applied to data with different N λ but yields best results when N λ of the test spectra matches that of the training spectra (indicated by the green vertical line).

Fig 3
Fig 3 Dimensionality reduction and cross-validation reveal systematic differences between training datasets.A An LSTM-based network trained on each dataset is then applied to every other dataset and all ϵsO 2 (median absolute error in percentage points) can be visualised as a performance matrix.Dataset names are shortened for visibility but are detailed in Table 1.B Uniform Manifold Approximation and Projection (UMAP) projections of the four representative example datasets onto an embedding of all training data.C Mapping the ground truth sO 2 onto the same projection reveals a correlation along the first UMAP axis.

Fig 5
Fig 5 Estimation of flow phantom data highlights performance dependence on training dataset.Three example images of the D 2 O flow phantom are shown at different time points (0, 22, 44 min) displaying A the photoacoustic signal intensity at 700 nm with a red contour marking the blood-carrying tube and B the sO 2 estimations from different methods.We visually compare the performance of linear unmixing (LU, C), learned spectral decoloring (LSD, D) and the lstm-based method (LSTM, E) by plotting the sO 2 estimations over the same image section and time points shown in A.

3. 5 7 D 7 DFig 6
Fig 6 Jensen-Shannon divergence (D JS ) proves valuable for in vivo data.LU and LSTM applied to measurements of the human forearm (A-F) and mice abdomens (G-L).Panels A and G show the photoacoustic signal at 800 nm and panels B and H show the spread of D JS estimates for the training datasets.Panels C and I show boxplots of the highlighted regions of interest over all N=7 subjects.The horizontal lines show expected sO 2 values for arterial blood (red) and mixed blood (blue).sO 2 images are shown for models trained on BASE (D, J) and the best (E, K) and worst (F, L) dataset as predicted by D JS .On the bottom right of these images, the value distribution is shown as a grey histogram with the mean values of the regions of interest highlighted in their respective colour.

Fig 7
Fig 7 Data-driven methods estimate an increased sO 2 dynamic range during CO 2 delivery compared to linear unmixing.A single representative mouse is shown here.Panels A-D show the photoacoustic image (A) and sO 2 estimation results (B-D) 3 minutes before asphyxiation and panels E-H show the photoacoustic image (E) and sO 2 estimation results (F-H) 10 minutes after asphyxiation.We show sO 2 estimates for linear unmixing (B, F), the baseline dataset (BASE, C, G), and the best dataset as predicted by the Jensen-Shannon divergence (WATER 4cm D, H).Panels A and E show the outlines of the full-organ segmentations, and all other panels B-D, F-H show outlines of the segmented regions used in Table2.
in LSTMs, which arises when a substantial portion of the input parameter space consists of zeroes.Next, a network trained at a fixed N λ (in this case N λ = 20) was tested on data with a different N λ (Fig2 B).Accuracy was found to decrease rapidly if fewer wavelengths were used for testing, but the error remains low if slightly more wavelengths are used.Nevertheless, the results show that the LSTM-based network performs best if the N λ used during inference matches the N λ during training.
Fig 4 The Jensen-Shannon divergence (D JS ) can predict estimation performance.A D JS correlates with the median absolute sO 2 estimation error ϵsO 2 when applying all networks, each trained on a distinct training dataset, to the baseline dataset (BASE).B D JS for the D 2 O flow phantom data, which shows a similar correlation with ϵsO 2 .C After removing two outliers, D JS shows the same degree of correlation with ϵsO 2 for the H 2 O flow phantom data.

Table 2 sO
2 decreases during CO 2 asphyxiation.Reported are the mean ± standard deviation of sO 2 measurement before CO 2 asphyxiation (Before) and the change in the mean (∆sO 2 ) after the procedure.Values are reported at a maximum depth of 3 mm into the mouse body for: linear unmixing (LU), an LSTM-based network trained on the baseline training dataset (BASE), and on the best fitting dataset according to the Jensen-Shannon divergence (WATER 4cm).p-values for ∆sO 2 are calculated using a non-parametric Mann-Whitney U test and indicated by (n.s. = not significant; * = p< 0.05; ** = p< 0.01).The segmentation masks are adjusted in Fig 7 B-D, F-H to show the region of interest considered when calculating the results.
60estimation error ϵsO 2 in silico (R=0.76) and in gello (R=0.74/0.72).D JS predicts a plausible training dataset for all three in vivo applications tested in this study, where the predicted value range was 0.4 -0.7 on the mouse data, 0.5 -0.8 on the forearm data, and 0.3 -0.7 on the CO 2 data.With the development of fast and autodifferentiable simulation pipelines,60it should be possible to optimise the simulation parameters for accurate sO 2 estimates by iteratively minimising D JS .When using differentiable implementations of distribution distance measures, it might even be possible to integrate this optimisation into an unsupervised training routine.The in gello experiments with H 2 O as the coupling medium had high ϵsO 2 errors for all sO 2 estimation methods and D JS was consistently high.The predicted best training dataset was WATER 4cm and the worst ILLUM POINT, which is consistent with the in vivo mouse experiments also having H 2 O as the coupling medium.Contrary to D JS prediction, ILLUM POINT has the lowest ϵsO 2 , resulting in no correlation (R=-0.11);after removing outliers, the correlation was on par with the other experiments (R=0.72).In both cases, the estimation error was lower than predicted by D JS ; while the distributions were different, the trained networks still managed to estimate accurate sO 2 values from the training data.Specifically for the ILLUM POINT dataset, judging from the quasibimodal distributions of the network's estimates on in vivo data, appears to have achieved a good agreement with high sO 2 values purely by chance.Such outliers are naturally obscured by summary measures such as D JS , thus expert oversight is recommended.