Discrimination of cancerous from benign pigmented skin lesions based on multispectral autofluorescence lifetime imaging dermoscopy and machine learning

Abstract. Significance Accurate early diagnosis of malignant skin lesions is critical in providing adequate and timely treatment; unfortunately, initial clinical evaluation of similar-looking benign and malignant skin lesions can result in missed diagnosis of malignant lesions and unnecessary biopsy of benign ones. Aim To develop and validate a label-free and objective image-guided strategy for the clinical evaluation of suspicious pigmented skin lesions based on multispectral autofluorescence lifetime imaging (maFLIM) dermoscopy. Approach We tested the hypothesis that maFLIM-derived autofluorescence global features can be used in machine-learning (ML) models to discriminate malignant from benign pigmented skin lesions. Clinical widefield maFLIM dermoscopy imaging of 41 benign and 19 malignant pigmented skin lesions from 30 patients were acquired prior to tissue biopsy sampling. Three different pools of global image-level maFLIM features were extracted: multispectral intensity, time-domain biexponential, and frequency-domain phasor features. The classification potential of each feature pool to discriminate benign versus malignant pigmented skin lesions was evaluated by training quadratic discriminant analysis (QDA) classification models and applying a leave-one-patient-out cross-validation strategy. Results Classification performance estimates obtained after unbiased feature selection were as follows: 68% sensitivity and 80% specificity with the phasor feature pool, 84% sensitivity, and 71% specificity with the biexponential feature pool, and 84% sensitivity and 32% specificity with the intensity feature pool. Ensemble combinations of QDA models trained with phasor and biexponential features yielded sensitivity of 84% and specificity of 90%, outperforming all other models considered. Conclusions Simple classification ML models based on time-resolved (biexponential and phasor) autofluorescence global features extracted from maFLIM dermoscopy images have the potential to provide objective discrimination of malignant from benign pigmented lesions. ML-assisted maFLIM dermoscopy could potentially assist with the clinical evaluation of suspicious lesions and the identification of those patients benefiting the most from biopsy examination.


Introduction
Skin cancer is the most common type of cancer in the United States, with melanoma being the fifth most prevalent among men and women. 1 The 5-year survival rate of patients with earlystage skin melanoma is ∼94%; however, ∼13% of skin melanoma patients are diagnosed with lesions already at intermediate or advance stages, 1 which are associated with 5-year survival rates of ∼61% and ∼27%, respectively. The most common diagnosis strategy for skin cancer is clinical evaluation of suspicious lesions followed by biopsy for histopathological evaluation to confirm diagnosis and tissue staging. One major drawback of this practice is the inability to clinically distinguish between similar lesions; in particular, melanoma is often mistaken for other benign pigmented lesions such as seborrheic keratosis (pSK). In addition, it is known that the accuracy of melanoma diagnosis with unaided eye is only about 60%. 2 Therefore, clinical tools that could provide objective, in situ, and accurate noninvasive discrimination between malignant and benign skin lesions during clinical examination could significantly improve early detection of skin cancer, reduce the risk of adverse events, and lead to improved cost-conscious patient care.
One of the most common tools used by physicians to diagnose skin cancer lesions is the dermoscope 2-4 which helps the unaided eye by magnifying the features on the skin. This allows doctors to examine the morphological features of concerning lesions at a significantly more detailed level. Although dermoscopy is known to improve the diagnostic sensitivity of skin lesions by ∼10% to 30%, its performance largely depends on both the level of experience of the dermatologist and the type of lesions. 2 The highly subjective nature and poor reproducibility of this method have led to the emergence of several proposed computer-aided diagnostic (CAD) systems. 3,[5][6][7] CAD systems are becoming largely popular in both diagnosis and prognosis of various diseases as they allow automated and noninvasive analysis of the tissue conditions. Table 1 summarizes some of the published works that reports the diagnosis and classification of pigmented skin lesions. Most of the works used dermoscopic images that were either collected by the authors or from publicly available datasets (e.g., ISIC archive, ISBI, Atlas, HAM10000, or PH2). Celebi and Zornberg 8 explored the clinically significant colors in dermoscopic images using K-means clustering and employed symbolic regression to classify the lesions. Ramlakhan and Shang 9 designed a melanoma recognition system using smart phone photographs that are classified using k-nearest neighbor (kNN) algorithm. Satheesha et al. 10 examined computerized three-dimensional (3D) dermoscopy features of skin cancer lesions to develop multiclass classifiers using Adaboost, bag of features (BoF), and support vector machine (SVM) techniques. Khristoforova et al. 11 used logistic regression to classify benign and malignant skin lesions using spectral features from Raman and autofluorescence spectroscopy measurements.
Classification of dermoscopy images of benign and malignant skin lesions using different deep learning approaches has also been reported. Harangi 13 used an ensemble of different convolutional neural network (CNN) classifiers, while Romero Lopez et al. 14 used transfer learning with pretrained VGGNet CNN architecture. Majtner et al. 15 combined CNN with SVM classifier using handcrafted RSurf features and local binary patterns to classify melanomas from other benign skin lesions. Lee et al. 16 developed the WonDerm pipeline that segments the skin cancer dermoscopic images using neural network architectures and classifies it using an ensemble approach. Amin et al. 17 extracted features using pretrained AlexNet and VGG16 deep learning architectures, performed feature selection using principal component analysis, and applied traditional machine learning models including SVM, kNN, and discriminant analysis. Jojoa Acosta et al. 18 utilized transfer learning with ResNet-152 architecture to classify benign and malignant skin lesions using dermoscopic images. It has been widely established that autofluorescence responses of intrinsic fluorophores vary significantly between normal and neoplastic tissues. [19][20][21][22][23] Neoplastic progressions in the epithelial tissue are associated with morphological, biochemical, and functional alterations which can cause changes in the autofluorescence responses from the tissue. [22][23][24] The skin has several intrinsic fluorophores, such as nicotinamide adenine dinucleotide (NADH), flavin adenine dinucleotide (FAD), collagen, elastin, keratin, melanin, and porphyrins. 19,[25][26][27] The levels of two metabolic cofactors and endogenous fluorophores in the epidermis, the reduced-form NADH and FAD, can change as skin cancer develops. 19,24 The optical redox ratio, typically defined as the ratio of fluorescence intensity of NADH to FAD, is sensitive to changes in the cellular metabolic rate. Increased cellular metabolic activity, a hallmark of neoplastic cell transformation, is usually attributed to a decrease in the optical redox ratio. In addition, the fluorescence lifetimes of these metabolic cofactors are sensitive to protein binding, thus to cellular metabolic pathways involving NADH and FAD. As a result, carcinogenesis process has been shown to cause changes in both NADH and FAD fluorescence lifetimes. Finally, cancer development also leads to extracellular matrix remodeling occurring within the dermis, which together with concurring epidermis thickening, result in a decrease in connective tissue autofluorescence that can be measured. Therefore, interrogation of NADH, FAD, and collagen autofluorescence could provide optical biomarkers of skin epithelial cancer.
Preferential excitation of these endogenous fluorophores in the tissue by multiple excitation sources could shed light on the biochemical changes in the target lesion area. 28 The broad emission spectral bandwidth of the fluorescence intensity signal has an intrinsic disadvantage in that it is difficult to differentiate between the intensities of overlapping emissions from multiple fluorophores. Time-resolved technique such as multispectral autofluorescence lifetime imaging (maFLIM) overcomes this challenge by quantifying the fluorescence lifetime in addition to the emission spectrum. Alex et al. 29 demonstrated fluorescence lifetime-based imaging of minipig skin and human skin to specifically target the endogenous fluorophores: keratin, NADH, melanin, elastin, and collagen under 725-nm multiphoton excitation. The capability of such optical biopsy techniques to serve as promising tools for dermatological research to facilitate preclinical and clinical translation is also highlighted. Huck et al. 30 demonstrated the effectiveness of the combined modality, multiphoton-based intravital tomography and fluorescence lifetime imaging to monitor the progression of inflammatory skin diseases. The two modalities studied the biochemical changes induced by the redistribution of mitochondria at different stages of inflammatory skin conditions. Several animal and human tissue studies have been published on the autofluorescence properties of skin cancer lesions. [31][32][33][34][35] Pastore et al. 36 conducted experiments with mouse models to study the autofluorescence response from melanoma skin lesions using multiphoton excitation at 740 and 900 nm and emission spectral bands at 447 and 540 nm. A significant difference in the bound and free NADH ratio between cancerous and noncancerous sites was observed, while the fluorescence decay obtained from targeting FAD remained almost the same between the two regions. It was also mentioned that the presence of melanin in the deeper layers of the skin tissue could interfere with the overall fluorescence response from the lesions. Miller et al. 32 studied the autofluorescence emission properties between squamous cell carcinoma (SCC) bearing and normal mice skin under 480-nm excitation, and a decrease in the short lifetime component for SCC in comparison to normal skin was observed for 535-nm emission band. Drakaki et al. 37 studied the autofluorescence responses from mouse, chicken, and pig skins under ultraviolet (UV) excitation, and the structural differences and the variations in tissue constituents were investigated between the different animal species for an emission spectral band between 340 and 950 nm. De Beule et al. 34 investigated the autofluorescence response from ex vivo biopsy skin lesions under 355-and 440-nm excitations, and the average fluorescence lifetime was found to be useful in discriminating basal cell carcinoma (BCC) from normal skin tissues at the emission band between 390 and 600 nm. Galletly et al. 31 imaged unstained human biopsy samples using maFLIM under a 355-nm pulsed laser excitation, and significant differences in the mean fluorescence lifetimes for the emission wavelengths 375 and 455 nm were observed between the autofluorescence responses from BCC skin lesions and healthy skin. Lohmann and Bodeker 38 analyzed the fluorescence intensities at the emission wavelength 470 nm, from human skin with melanoma, nevi, and dysplastic nevi lesions under 365-nm excitation, and a significant difference in fluorescence intensities was observed for melanoma and nevi lesions, while melanoma and dysplastic nevi lesions did not show much difference. Fast et al. 39 investigated the autofluorescence response from human skin at 780-nm frequency doubled excitation and two emission channels at 535 and 720 nm corresponding to red and green channels. Red channel collects fluorescence emission from melanin, while the green channel collects emission from keratin, NAD(P)H, FAD, and elastin.
In this work, we developed and validated a label-free and objective image-guided strategy for the clinical evaluation of suspicious pigmented skin lesions based on maFLIM dermoscopy. In addition, a computationally efficient frequency-domain deconvolution of maFLIM data is explored, and three different pools of global image-level maFLIM features were evaluated for machine-learning (ML)-based objective discrimination between malignant and benign pigmented skin lesions.

Methods
A summary of the complete methodology performed in this study is shown in Fig. 1.

maFLIM Dermoscopy Imaging of Skin Lesions
A total of 30 patients (n patients ¼ 30) from the Dermatology Department of the Amaral Carvalho Cancer Hospital (Jahu, Sao Paulo, Brazil) were recruited for this study, following a human study protocol approved by the Internal Review Board of that institution (CAAE: 71208817.5.00005434).
Only patients presenting at least one pigmented skin lesion undergoing biopsy examination for skin cancer diagnosis were recruited. The pigmented skin lesions considered in this work are solar lentigo, pSK, pigmented superficial BCC, pigmented nodular BCC, and melanoma. maFLIM images were obtained from clinically suspicious lesions using an in-house developed time-domain maFLIM dermoscope previously described. 40 With this maFLIM dermoscope, skin tissue autofluorescence is simultaneously imaged at three emission bands (390 AE 20, 452 AE 22.5, and >496 nm, preferentially targeting collagen, NADH, and FAD autofluorescence emission, respectively) with a temporal resolution of 0.4 ns, field-of-view (FOV) of 8.65 ⋅ 8.65 mm 2 , and lateral resolution of 120 μm. For the rest of the paper, the emission wavelengths at the three spectral channels will be more conveniently referred to as 390, 452, and 500 nm. After signing the corresponding written informed consent form, each patient underwent the following imaging protocol right before the scheduled biopsy examination procedure. First, the lesion was gently cleaned with a gauze soaked in a saline solution. Then, the tip of the maFLIM dermoscope, previously disinfected using a gauze soaked in 70% ethanol, was placed in contact with the lesion, and an maFLIM image was acquired. The imaging site was selected so regions within and outside the visible lesion were present within the FOV of the maFLIM dermoscope. Right after maFLIM imaging, lesion tissue biopsy was performed following standard procedures. Each maFLIM image was labeled based on the histopathological evaluation of the lesion biopsy, which was considered the gold standard in this study. All images were acquired with a laser excitation at 355 nm and average excitation power of 10 mW measured at the sample, 140 × 140 pixels per image, and at a pixel rate of 10 kHz. These image acquisition parameters corresponded to an acquisition time of 1.96 s per image and an excitation energy exposure of 1.96 mJ at the sample, which is significantly lower than the maximum permissible exposure levels for skin based on guidelines from the American National Standards Institute -ANSI. 41 The total number of lesions imaged from the 30 patients was 60 (i.e., n lesions ¼ 60). An instrument response function (IRF) was measured by acquiring the reflection of excitation pulse by placing a mirror at the sample end.

maFLIM Data Preprocessing
Pixel-level preprocessing: The maFLIM data measured at each image pixel ðp; qÞ are composed of three fluorescence intensity temporal decay signals y m;λ ðp; q; tÞ measured at the three targeted emission spectral bands (λ). The preprocessing steps applied to each pixel maFLIM temporal signal is shown in Fig. 2(a). First, offset subtraction was applied to the raw maFLIM signal, y m;λ ðp; q; tÞ, followed by spatial averaging (order 5 × 5) to increase the signal-to-noise ratio (SNR) of the time-dependent signal. The offset was estimated by fitting a straight line on the first and last five time points in each channel. The baseline was then subtracted from the entire time vector to obtain the corrected signal. Since the background fluorescence was significantly lower than the sample fluorescence, the background correction of the signals was not performed. Second, the duration of the temporal decay signals for all emission bands was adjusted to the length of the longest signal among the three emission channels, which is 149 temporal samples (59.6 ns) by applying zero padding to the short signals. Finally, the signals from the three emission channels, y λ ðp; q; tÞ, are concatenated to form yðp; q; tÞ as shown in Eq. (1). Signal concatenation is essential for cluster analysis in image level preprocessing, explained later in this section, as well as for frequency-domain deconvolution explained in Sec. 2.3.2. The concatenated signal at each pixel location can be represented as (1) where yðp; q; tÞ is the preprocessed concatenated maFLIM decay signal; y λ 1 ðp; q; tÞ, y λ 2 ðp; q; tÞ, and y λ 3 ðp; q; tÞ are the preprocessed maFLIM decay signals from each of the three spectral channels, λ 1 ¼ 390 nm, λ 2 ¼ 452 nm, and λ 3 ¼ 500 nm; M is the temporal spacing between the signals from the three channels; ðp; qÞ indicates the pixel locations. The value of M is equal to 149, which is the length of the fluorescence emission decays in each channel. Image-level preprocessing: Pixels presenting either signal saturation or low SNR (<15 dB) were detected and masked. The majority of the acquired maFLIM images contain pixels from within and outside the skin lesion region; thus, cluster analysis was performed to group pixels based on their region of origin. The concatenated signal yðp; q; tÞ is used as the feature vector for cluster analysis at each pixel location to simultaneously include the information from the three emission channels. The steps for cluster analysis are as follows: First, an unsupervised K-means clustering algorithm was applied to generate two cluster masks. Then, each cluster mask is applied to the maFLIM image to define two regions within the FOV of the maFLIM image. It should be noted that since the K-means clustering algorithm involves random initialization of cluster centroids, it is difficult to identify which cluster mask belongs to within or outside the skin lesion region; thus, the identified regions were taken as two arbitrary regions: region-1 and region-2. Figure 2(b) shows an example of the cluster mask and the two separated regions generated from a representative maFLIM image.

Features based on time-domain deconvolution parameter estimation
In the context of time-domain maFLIM data analysis, the fluorescence decay y λ ðp; q; tÞ measured at each emission spectral band (λ) and spatial location ðp; qÞ can be modeled 42 as the convolution of the fluorescence impulse response (FIR) h λ ðp; q; tÞ of the sample and the measured IRF u λ ðtÞ: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 4 0 9 y λ ðp; q; tÞ ¼ u λ ðtÞ Ã h λ ðp; q; tÞ: (2) The standard method for time-domain maFLIM data analysis proceeds by first deconvolving the IRF of each spectral band (u λ ðtÞ) from the corresponding measured time-resolved fluorescence signal y λ ðp; q; tÞ to estimate the sample FIR for each image pixel, h λ ðp; q; t), 39 which is usually modeled as a multiexponential decay. The model order (number of exponential components) can be selected by analyzing the model-fitting mean squares error (MSE) as a function of the model order. For the maFLIM data of this study, a model order of two was selected, since the addition of a third component did not reduce the MSE. The variations in error for one, two, and three exponential components during fitting is shown in Fig. S1 in the Supplemental Material. The FIR was modeled as where τ fast;λ and τ slow;λ represent the time-constant (lifetime) of the fast and slow decay components, respectively; while α fast;λ and α slow;λ represent the contribution of the fast and slow decay components, respectively. The average fluorescence lifetime for each spectral band at each pixel location is computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 1 8 5 The parameters of the biexponential decay model are estimated for each pixel by nonlinear least squares iterative reconvolution. 42 After deconvolution, the biexponential parameters estimated at each pixel can be used as features representing the temporal dynamics of the fluorescence decays at each emission spectral band: α fast;λ ðp; qÞ, α slow;λ ðp; qÞ, τ fast;λ ðp; qÞ, τ slow;λ ðp; qÞ, and τ avg;λ ðp; qÞ. Since the sum of α fast;λ ðp; qÞ and α slow;λ ðp; qÞ is equal to one, only one of them is kept as a feature.
In addition, the following spectral intensity features can be also estimated from the deconvolved FIR, h λ ðp; q; tÞ. Absolute fluorescence intensities I λ ðp; qÞ for each emission spectral bands are simply computed by time integrating the FIR h λ ðp; q; tÞ: The normalized fluorescence intensities I λ;n ðp; qÞ can be also computed from the multispectral absolute fluorescence intensities I λ ðp; qÞ as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 6 3 2 I λ;n ðp; qÞ ¼ I λ ðp; qÞ P λ I λ ðp; qÞ : Lastly, the ratio of absolute intensities from the three spectral channels is computed at each pixel location resulting in three additional features: In general, the features can be computed at the pixel or the image level. In this study, imagelevel global features were explored, whereby one set of features, a single feature vector, is estimated to represent the whole image. Each global feature was computed from the corresponding pixel-level maFLIM feature map as follows. Based on the two regions identified per image using the cluster analysis described in Sec. 2.2, the feature median value for each region was computed, and the absolute value of their difference was taken as the global feature: Since Feature Global is the absolute difference between the feature median values from the two clustered regions, it is independent of labeling the regions as either lesion or healthy. This is particularly beneficial as it is difficult to label the clustered regions due to the unavailability of the pixel-level ground truth labels as well as the randomness in the K-means algorithm. These defined global features have the advantage of reducing patient-to-patient variability in the extracted features, which is particularly important as the color and texture of skin vary considerably with ethnicity and age. This feature extraction approach based on time-domain deconvolution of the maFLIM data generates a total of six intensity and 12 biexponential global maFLIM features, as summarized in Table 2.  43 where a modelfree representation of maFLIM data was developed utilizing the frequency-domain properties of the phasor representations. Here, we aim to replace the iterative time-domain deconvolution process by a simple division operation in the frequency domain. 44 The computational overload is further reduced by processing the fluorescence decays of all the three spectral channels together, unlike in the traditional method where the maFLIM signal for each spectral channel must be processed separately. Subsequently, several features can be extracted from the frequency-domain phasor representation of the maFLIM data. 45,46 The presented method proceeds in three steps: (1) performing frequency-domain deconvolution of the instrument response from the concatenated pre-processed fluorescence decays from all the three spectral channels, (2) constructing phasor plots for the maFLIM data, and (3) extracting global features from the phasor plots representing each maFLIM image. A detailed description of this method is presented as follows.
In this approach, the preprocessed and concatenated maFLIM signals at each pixel location, yðp; q; tÞ are normalized to sum one for further processing and feature extraction. Similar to the concatenated, preprocessed signal, yðp; q; tÞ, the concatenated IRF from all the three spectral channels can be mathematically represented as where uðtÞ is the concatenated IRF; u λ 1 ðtÞ, u λ 2 ðtÞ, and u λ 3 ðtÞ are the IRF signals from the three spectral channels; M is the temporal spacing between the signals from the three channels; and ðp; qÞ indicates the pixel positions. The first step of the algorithm is to compute the Fourier transform (FT) of both the preprocessed concatenated signal and the concatenated IRF. The FT of the signals yðp; q; tÞ and uðtÞ can be represented as Yðp; q; ωÞ and UðωÞ, respectively, where ω is the angular frequency: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 6 ; 3 4 2 Yðp; q; ωÞ ¼ FTfyðp; q; tÞg; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 6 ; 2 9 9 UðωÞ ¼ FTfuðtÞg: If the effective FIR from all the three fluorescence emission channels is denoted as hðp; q; tÞ, the effective fluorescence frequency response Hðp; q; ωÞ is obtained from the FT of hðp; q; tÞ as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 6 ; 2 6 5 Hðp; q; ωÞ ¼ FTfhðp; q; tÞg: Therefore, the convolution in Eq. (1) can be represented as a multiplication in the frequency domain as To uniformly scale all the frequency components of Yðp; q; ωÞ, the normalization with respect to the DC response Yðp; q; 0Þ can be applied as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 1 1 6 ; 1 5 5 Yðp; q; ωÞ Yðp; q; 0Þ ¼ Hðp; q; ωÞUðωÞ Hðp; q; 0ÞUð0Þ : A phasor representation for the normalized frequency response Pðp; q; ωÞ at specific values of the frequency ω can be generated by plotting the real Re½Pðp; q; ωÞ versus the imaginary Im½Pðp; q; ωÞ components of Pðp; q; ωÞ. Therefore, each pixel of the maFLIM image is mapped to a point in the corresponding phasor plot generated for a specific frequency. This transformation is shown in Fig. 3, where a representative maFLIM image is mapped to its corresponding phasor plot for an arbitrary frequency. Region-1 and region-2 marked on the maFLIM image represent the regions obtained after clustering. The pixels of each region are mapped into a two-dimensional (2D)-histogram distribution on the phasor plot, as shown in Fig. 3.
From the phasor representations of the maFLIM images at specific frequency components, ω ¼ 2πf, the following features were computed as follows. First, a bivariate Gaussian function was fitted to the phasor distribution of each region (region-1 and region-2) of a given maFLIM image: (Fig. 4). The "distance" between the phasor distributions of the two regions was then estimated as the magnitude of the difference of the distribution means: d ¼ jμ 1 − μ 2 j [ Fig. 4(a)]. The determinant of the covariance matrix jΣj from the fitted Gaussian distribution provides a measure of the "spread" of the distribution. The difference in spread of the phasor distributions of the two regions was thus estimated as: Fig. 4(b)]. The "angle" θ between major axes of the phasor distributions of the two regions was estimated as the acute angle between the eigenvectors of maximum variance of the multivariate Gaussian distributions (Fig. 4(b)). Finally, the "symmetry" of the Gaussian distribution can be quantified as the ratio of the variances along the orthogonal directions, s ¼ Fig. 4(c)].
The difference in symmetry of the phasor distributions of the two regions was thus estimated as: The fluorescence frequency response Hðp; q; ωÞ is bandlimited with a bandwidth of ∼60 MHz. To cover the bandwidth of Hðp; q; ωÞ, only the first nine frequency components of Hðp; q; ωÞ were selected, corresponding to the frequencies 5.6, 11.2, 16.8, 22.4, 28, 33.6, 39.2, 44.8, and 50.4 MHz. These frequency values are the first nine harmonics of the signal Fourier spectrum, which was calculated at a frequency resolution of 5.6 MHz (sampling frequency/#samples ¼ 2.5 GHz∕ð3 × 149Þ). For each of these frequency components, the four phasor features were computed, resulting in a total of 36 phasor features.

Feature Selection
For maFLIM dermoscopy-based automated classification of benign versus malignant skin lesions, a simple quadratic discriminant analysis (QDA) model was explored with the global where n features is the number of features in each feature pool. In addition to these individual feature pools, different combinations of the feature pools were also considered: phasor ∪ biexponential (n features ¼ 48), phasor ∪ intensity (n features ¼ 42), intensity ∪ biexponential (n features ¼ 18), and phasor ∪ biexponential ∪ intensity (n features ¼ 54). Feature selection using sequential forward search (SFS) 47 was performed on each feature pool independently using a leave-one-patient-out cross-validation (LOPO-CV) strategy, whereby the data of one patient are left out at each fold [ Fig. 5(a)]. This assures that the left-out patient data are not used for feature selection and model training. Unlike exhaustive search where every possible combination of features is examined, SFS is computationally simpler and provides an efficient strategy to investigate the importance of the available features. The steps involved in the SFS feature selection process are shown in Fig. 5(b). The feature subset yielding the highest area under the curve (AUC) value in receiver operator characteristics (ROC) analysis was selected (f SFS ) at each iteration of the LOPO-CV. Subsequently, the QDA classifier is retrained using f SFS at each LOPO-CV iteration and then tested using the left-out patient data. A threshold of 0.5 on the generated posterior probability classifies the left-out patient data into either benign or malignant. Upon accomplishing all folds, each patient data had been used as test, and a confusion matrix was constructed to estimate the classification model performance. The number of features selected during SFS (n SFS ) is kept constant in each fold of the LOPO-CV. Therefore, to determine the number of features that can produce the best classifier, the experiments are repeated for different values of n SFS , varying from 1 to 7 (1 to 6 for intensity feature pool). The maximum number of features in the feature set is chosen based on Hua et al., 48 where the optimal number of features for feature sets with some degree of correlation is ffiffiffiffiffiffiffiffi ffi n data p , where n data is the number of data points. In this case, the maximum number of features is chosen as the closest and lower integer value to ffiffiffiffiffiffiffiffiffiffiffiffi ffi n lesions p ¼ ffiffiffiffiffi 60 p ¼ 7.75 ∼ 7. The value of n SFS that produced the best F-score computed from the final confusion matrix was then chosen as the number of features in the final feature set (n selected ). This is because F-score gives a combined estimate of both the sensitivity and specificity. However, in cases when two values of n SFS produce the same F-score, a higher sensitivity is given more preference. This is because it is critical to ensure that the malignant lesions are correctly classified to provide adequate and timely treatment to the patients.
Since the LOPO-CV iterates N patient times, where N patient is the number of patient data, the features f SFS selected in each iteration for a particular value of n SFS depend on the (N patient − 1) patient data that are not left out by that iteration. Thus, there can be some variation in the features that are picked out in each iteration of the LOPO-CV. Therefore, the selection frequency is noted, which is defined as the number of times each feature becomes part of the f SFS during all iterations of the LOPO-CV. This allowed to identify the most frequent (thus most relevant) features (f selected ) from each feature pool. It is to be noted that the number of features in each f selected is denoted as n selected .

Classification of Skin Cancer Lesions Using Selected Features
QDA classifiers trained on the best features, f selected , from the three feature pools as explained in Sec. 2.4 were also combined in an ensemble fashion as shown in Fig. 6(a). Separate classifiers are trained on "k" feature pools with an LOPO-CV loop. If all the feature pools are used, k ¼ 3, otherwise, k ¼ 2. The left-out patient data from the LOPO-CV are tested on each of the individual classifiers, generating a set of posterior probabilities, P pool-1 ; P pool-2 ; : : : ; P pool-k , corresponding to classifiers trained on each feature pool. Here, Pool-k is either phasor, intensity, or biexponential feature pool. Subsequently, a weighted average of the posterior probabilities is computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 1 1 6 ; 1 9 8 P ¼ w 1 P pool−1 þ w 2 P pool−2 : : : þ w k P pool−k ; E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 9 ; 1 1 6 ; 1 5 4 w 1 þ w 2 þ : : : þ w k ¼ 1; where w 1 ; w 2 ; : : : ; w k are the weights on the posterior probabilities generated from each feature pool, while the sum of weights equals one. A threshold of 0.5 on the weighted average of the posterior probabilities assigns a label for the left-out patient data. Figure 6  3 Results

maFLIM Dermoscopy Clinical Imaging of Skin Lesions
The distribution of patients (n patients ¼ 30) and lesions (n lesions ¼ 60) imaged in this study showing benign and malignant conditions is provided in Table 3. Benign lesions included solar lentigo and pSK, while malignant lesions included pigmented superficial BCC, pigmented nodular BCC, and melanoma. Figure 7(a) shows a handheld maFLIM dermoscope imaging the forearm of a patient. The clinical photograph of a sample melanoma skin lesion is shown in Fig. 7(b), and its corresponding maFLIM feature maps are shown in Fig. 7(c). The scales of the feature maps across the three spectral wavelengths are kept the same for comparison purposes. Most of the feature maps (including α fast;390 , α fast;452 , τ fast;452 , τ fast;500 , τ slow;390 , τ slow;452 , τ slow;500 , τ avg;390 , τ avg;452 τ avg;500 ,   I n;452 , I n;500 , I 452;n ∕I 500;n and I 390;n ∕I 500;n ) clearly show two distinguishable regions: the center and the surrounding regions. The cluster masks for this sample image are also shown in the last row of Fig. 7(c). The maps showing the two clustered regions: region-1 and region-2 are plotted using the normalized integrated intensities (I Integrated ) from the three spectral emission channels. The relative contributions of fast lifetime (α fast ) for the spectral channels 390 and 452 nm exhibit a higher value in the central region compared with the surrounding regions. Fast lifetime (τ fast ) values for the spectral channels 452 and 500 nm are lower in the central region than the surrounding. Slow lifetime (τ slow ) values for all the three spectral channels are higher in the center than the surrounding regions. The average lifetimes (τ avg ) of the pixels from the central region are higher than the surrounding parts. The relative intensity (I n ) values for spectral channel 452 nm is lower in the center compared to the surrounding, while that for the spectral channel 500 nm is higher in the center compared with the surrounding. The ratios of the intensities, I 452;n ∕I 500;n and I 390;n ∕I 500;n , are lower in the central region compared with the surrounding regions.

Feature Selection
As explained in the feature selection process in Sec. 2.4, the multiple lesions from a single patient constitute the left-out data in each iteration; thus, the number of lesions that are tested in each iteration varies depending on the number of lesions that are imaged for the left-out patient. In this way, every patient becomes part of the testing, and a confusion matrix is generated after all LOPO-CV iterations are completed. As mentioned in Sec. 2.4, the number of features n SFS was varied from 1 to 6 for intensity feature pool, and 1 to 7 for phasor and biexponential feature pools. The optimum number of features n selected is chosen based on the highest F-score. Table 4 shows values of accuracy, sensitivity (Sn), specificity (Sp), and F-score Table 4 Performance metrics and confusion matrices obtained during feature selection with phasor, biexponential, and intensity feature pools. obtained when classifying benign and malignant skin lesions trained individually on the three feature pools (phasor, biexponential, and intensity). The table shows the results for n selected number of features in each feature pool. Results obtained during feature selection for all the values of n SFS is given in Tables S1-S3 in the Supplemental Material. QDA models trained on five biexponential features yielded the best performance with 75% accuracy, 84.21% sensitivity, 70.73% specificity, and 68.09% F-score. QDA models trained on six phasor features yielded the next best performance with 76.67% accuracy, 68.42% sensitivity, 80.49% specificity, and 65% F-score. QDA models trained on the intensity feature pool resulted in 84.21% sensitivity, but only 31.71% specificity.
The results from the combined feature pools show poor performance with low sensitivities and specificities. This is because the features selected by the SFS algorithm in the earlier iterations may not be the best when combined with those selected in the later iterations. From these results, it can be inferred that the five biexponential features and the six phasor features have potential in classifying benign and malignant skin lesions. The confusion matrices of the classifiers are also shown in Table 4.

Feature Relevance
To identify the important features in each feature pool, the number of times each feature becomes a part of f SFS during all iterations of the LOPO-CV was recorded. If a feature is selected at least 50% times during all the iterations, it will be considered an important feature and added to the f selected of that feature pool. Table 5 lists the important features from each feature pool and their selection frequencies during LOPO-CV. Thus, we can summarize the selected features from each feature pool as

Classification of Skin Cancer Lesions Using Selected Features
The methodology of classifying skin lesions using the selected features is explained in Sec. 2.5. As shown in Fig. 6, "k" QDA classifiers trained separately on the f selected features from "k" feature pools are combined in an ensemble fashion. Four different combinations of feature pools are used for constructing ensemble classifiers, as shown in Table 6. A weighted average of the posterior probabilities generated from the "k" QDA classifiers is calculated to predict the label of the left-out patient data. Since the weights are optimized within an LOPO-CV loop, there can be some variation in the weights selected during the n patients iterations. Figure 8 shows the histograms of the weights obtained during all the iterations for the ensemble classifiers using the four feature pool combinations. Since the sum of weights of "k" feature pools is one, it is sufficient to show the weights of (k − 1) feature pools. It can be seen from  Fig. 8 Histogram of weights on one of the feature pools, when combined in an ensemble fashion for (a) phasor-intensity, (b) biexponential-intensity, (c) phasor-biexponential, and (d) phasorbiexponential-intensity feature pools. Table 6 that the ensemble combination of QDA classifiers trained on phasor and biexponential features as well as the ensemble combination of all the three feature pools, produced the best performance with 88.33% accuracy, 84.21% sensitivity, 90.24% specificity, and F-score of 82.05%. The next best performance is obtained by the ensemble combination of phasor and intensity feature pools with an accuracy of 86.67%, sensitivity of 78.95, specificity of 90.4%, and an F-score of 78.95%. Ensemble combination of intensity and biexponential features produced an accuracy of 71.67%, sensitivity of 84.21%, specificity of 65.85%, and F-score of 65.31%. Table 6 also shows the confusion matrices for all the classifiers. While analyzing the weights on the feature pools during the ensemble combination, it can be seen in Figs. 8(a) and 8(b) that intensity features have a higher weightage when combined with phasor features. When phasor and biexponential features are combined, both the feature pools have similar weightage as shown in Fig. 8(c). Similarly, when biexponential and intensity feature pools are combined, both the pools have comparable weightage. When all the feature pools are combined, it can be seen from Fig. 8

Discussion
In this study, clinical widefield autofluorescence imaging of benign and malignant pigmented skin lesions was successfully performed in 30 patients using a recently developed maFLIM dermoscope. 40 The resulting maFLIM images from 60 pigmented lesions enabled exploring steady-state (intensity) and time-resolved (biexponential, phasor) autofluorescence global features. Results based on rigorous cross-validation methods demonstrate that simple ML classification models (QDA) based on selected time-resolved autofluorescence global features have the potential to provide discrimination of malignant from benign pigmented skin lesions.
To the best of our knowledge, only one published work has reported the use of machine learning models based on autofluorescence lifetime imaging features for the classification of pigmented skin lesions. 12 In that study, however, only skin melanoma lesions were imaged, and the classification task was restricted to discriminate early-stage from advanced-stage skin melanoma. In contrast, a more comprehensive set of pigmented skin lesions were imaged in this work (two benign and three malignant lesion categories). Moreover, the classification task focused on discriminating malignant from benign pigmented lesions, which might be clinically more relevant for early detection of skin cancer.
In multidimensional imaging data, such as in maFLIM, image features can be extracted at the pixel or the image level. We have recently explored pixel-level maFLIM features for the classification of oral dysplasia and early-stage cancer. 49 Pixel-level features, however, require the labeling of each pixel which is generally impractical. In this work, the maFLIM data were labeled at the lesion level based on the histopathology diagnosis obtained from the lesion biopsy samples; therefore, an image-level global feature extraction strategy was preferred. As shown in Fig. 7, two regions were frequently observed in the maFLIM images, corresponding to pixels either within or outside the lesion extension. In an attempt to reduce interpatient variability, the explored relative features were defined in terms of difference in autofluorescence properties between the two regions identified in each lesion maFLIM image. This strategy of using global (image-level) and relative features can find applications in many other classification tasks based on optical imaging data.
The performance of ML classification models needs to be carefully estimated when trained on limited sample size. To minimize overfitting and avoid overoptimistic performance estimations, a rigorous strategy was adopted for feature selection, model training, and performance estimation. First, the maximum number of features allowed (seven) was limited based on the sample size. 48 Second, a simple nonlinear classification model (QDA) was adopted. Third, crossvalidation was applied at the patient-level (LOPO-CV) to ensure that data from the same patient is not used for both training and validation. Fourth, feature selection was performed together with model training to make sure that the validation data are not used during neither feature selection nor model training. It should be noted that at each fold of the cross-validation strategy, a different classification model (with different selected features and model parameters) is applied to the validation set. Thus, although a single optimal model is not necessarily defined, this approach still enables identifying relevant features and providing unbiased classification performance estimation.
Classification performance was dependent of the feature pool used in the model. The most frequently selected intensity feature was I 452;n which is associated to NADH fluorescence contribution. Although the classification models using intensity features showed good sensitivity (∼84%), their specificity was poor (∼31%). This means that steady state intensity features are not sufficient to minimize false positives while discriminating benign and malignant lesions. Classification models using biexponential features showed similar sensitivity (∼84%) than those with intensity features, but significantly higher specificity (∼70%). These results indicate that time-resolved properties of pigmented skin lesion autofluorescence could represent biomarkers of skin cancer. While examining the selected features from the biexponential feature pool, it can be seen from Table 5 that the most frequently selected features were associated to NADH (τ fast;452 , α fast;452 ) and collagen (τ slow;390 , α fast;390 ) fluorescence temporal dynamics. The fast lifetime component of NADH (τ fast;452 , α fast;452 ) is associated with the free state of the molecule, 19 suggesting that metabolic pathway changes induced by malignant transformations alter the microenvironment of NADH molecules. Unlike NADH or FAD, the collagen is not involved in cellular respiration and is not part of metabolic pathways. Changes in collagen autofluorescence response can occur in benign and malignant lesions as they mostly correspond to skin thickening, extracellular matrix remodeling, and texture changes. 23 Therefore, the selected features corresponding to collagen autofluorescence (τ slow;390 , α fast;390 ) can provide high sensitivity but may not contribute to high specificity leading to several false positives as can be seen in Table 4.
The phasor representation of fluorescence lifetime imaging data analysis is a noniterative, model-free, fast approach to visualize the lifetime components (and their distributions) of the fluorescence emission of a sample. [50][51][52][53] In this work, a modified version of this method was applied to the clinical multispectral maFLIM data, and a set of global image features were extracted from the corresponding phasor representation of the two regions present in each lesion maFLIM image (Fig. 4). The processing times taken for computing phasor and biexponential features from one sample lesion using a computer with an i7 Intel core processor and 48 GB RAM were found to be 10.5 and 251.8 s, respectively. Therefore, phasor-based features are simpler and ∼25 times faster than traditional time resolved biexponential features. Unlike biexponential maFLIM features, the phasor features explored cannot be directly interpreted in terms of the skin autofluorescent constituents. Nevertheless, the classification models using phasor features showed superior specificity (∼80%) than those using biexponential features, suggesting that these two different pools of time-resolved maFLIM features might be complementary. Phasor-based features are computed from the concatenated signal and contain information from all three emission channels. It is challenging to develop solid reasoning for the high specificity obtained from these features; however, since every phasor-based feature contains information from all three channels, each channel's contribution is not quantifiable.
Feature selection starting with combinations of feature pools were also explored, although the resulting models showed lower classification performance overall (Table 4). On the other hand, ensemble classifiers based on models using the most frequently selected features of each pool outperformed any other models (Table 5). In particular, the ensemble classifier combining the models based on the most frequently selected biexponential and phasor features resulted in the best performance overall (∼84% sensitivity and ∼90% specificity). Moreover, the optimum weights identified for these ensemble models indicate that both the phasor and biexponential features contribute equally to the weighted probability [ Fig. 8(c)]. These results further indicate that these two maFLIM feature pools might be complementary, as biexponential features seem to contribute to higher sensitivity, while phasor features to higher specificity.
Some of the studies summarized in Table 1 produced superior classification performances compared with that reported in this work. However, those studies employed complex machine learning and deep learning algorithms, such as Adaboost, 10 BoF, 10 and CNN's for classification, 18 and feature extraction. 17 This work uses a simple QDA machine learning model to analyze the performance with different feature pools. It is plausible that more complex machine learning classification models, such as neural networks, might provide superior performance. Given the limited data available for training and validation, however, a simple QDA model was selected, as it still enables nonlinear decision boundaries while reducing the chances of overfitting. In addition, by choosing a simple classification method, it is possible to analyze other aspects of classification model, such as the feature pools, number of features, and ensemble combinations of the feature pools.
Recent advances in artificial intelligence (AI) are allowing the development of CAD systems for discriminating benign from malignant skin lesions based on digital dermoscopy data. Although these CAD systems have not been translated yet to the clinic, preliminary validation studies demonstrate their potential to discriminate typical benign from malignant lesions. Dermoscopy data, however, provide information limited to the visual appearance of the lesions; thus, AI-assisted digital dermoscopy is less suitable for the discrimination of visually similar benign and malignant skin lesions. maFLIM dermoscopy data, on the other hand, can capture autofluorescence-based biochemical and metabolic biomarkers of skin malignant transformation, which is independent of lesion visual appearance. This property is especially important in distinguishing visually similar benign and malignant lesions, thus minimizing the number of false positives and, in turn, reducing the number of unnecessary biopsies. In particular, the lesion types included in this work-pSK and melanoma, are visually similar benign and malignant lesions, which can be easily misdiagnosed during dermoscopic evaluation. Therefore, maFLIM dermoscopy has the potential to complement digital dermoscopy by providing superior performance in the discrimination of visually similar benign and malignant skin lesions.

Study Limitations
Although this preliminary clinical study demonstrates the potential of maFLIM-derived autofluorescence features to discriminate malignant from benign pigmented skin lesions, a number of limitations are recognized. First, the database of maFLIM images is limited in both the type of benign and malignant skin conditions, and the number of samples per condition. A more comprehensive and larger database will be needed to fully develop accurate enough classification methods for skin lesion discrimination and to rigorously quantify their performance in prospective studies. Second, the lack of histopathology-based assessment of the maFLIM imaging data at the pixel-level prevented to specifically quantify the capabilities of maFLIM dermoscopy as a tool for not only detecting malignant skin lesions but also determining their true extension and margins. Third, the current maFLIM dermoscopy system provides nonspecific excitation and spectral detection of skin autofluorescence component emission. Finally, the current implementation of the ML classification models does not allow for real-time processing of maFLIM data. Ongoing research efforts aiming to overcome these limitations include collecting maFLIM dermoscopy images from a plurality of nonpigmented and pigmented skin lesions from patients of various skin tones, performing accurate pixel-level registration between the lesion maFLIM imaging data and histopathology tissue sections, developing improved maFLIM dermoscope systems with multiwavelength excitation and narrow-band emission detection capabilities, and implementing optimized CAD using field programmable gate arrays and graphics processing units technologies for real-time maFLIM data processing, pixel-level classification, and tissue mapping visualization.

Clinical Perspective
The incidence of skin cancer including melanoma continues to increase yet most providers is forced to rely on their own visual recognition skills and experience to identify concerning lesions. In addition, many patients do not have access to a trained dermatologist which can place them in a potentially precarious situation since early detection of skin cancer leads to better survival rates. The importance of early detection cannot be understated, as it not only saves lives but also reduces the invasiveness of the treatment patients undergo and conserves precious medical resources, leading to quality, cost-conscious care. A noninvasive, label-free, fast, accurate, and objective tool capable of discriminating most common malignant from benign skin lesions would improve the clinical management of patients. Currently, there is no objective device providers can use to independently identify cancerous skin lesions. In the hand of primary care physicians, such a tool will enable the early identification of patients in need of referral to a dermatologist. In addition, the dermatologists could use such tool to identify lesions in need of a biopsy, thus reducing the rate of unnecessary biopsies and adverse events such as pain, infection, and scarring. Such a tool would also assist with monitoring cancer recurrence without the need of regular and frequent biopsies. This work demonstrates that maFLIM dermoscopy aided by ML models could potentially have the capabilities of such tool, thus impacting the clinical management of skin cancer patients for the better.

Conclusion
The results of this study demonstrate the capabilities of maFLIM dermoscopy to clinically image a plurality of autofluorescence biomarkers of malignant skin pigmented lesions. Moreover, some of these autofluorescence biomarkers were identified as promising for malignant lesion identification, particularly those quantifying the time-resolved fluorescence characteristics of skin lesions. In addition, these relevant autofluorescence biomarkers were successfully used as features in ML models trained to discriminate malignant from benign pigmented skin lesions with promising accuracy (∼84% sensitivity and ∼90% specificity). Further developments in maFLIM instrumentation and image analysis methods could result in clinical tools for noninvasive, labelfree, accurate, and objective in situ detection of malignant from benign skin lesions, with the potential to impact the clinical management of skin cancer patients.

Disclosures
The authors have no relevant financial interests in the paper and no other potential conflicts of interest to disclose.