Dimensional reduction based on peak fitting of Raman micro spectroscopy data improves detection of prostate cancer in tissue specimens

Abstract. Significance: Prostate cancer is the most common cancer among men. An accurate diagnosis of its severity at detection plays a major role in improving their survival. Recently, machine learning models using biomarkers identified from Raman micro-spectroscopy discriminated intraductal carcinoma of the prostate (IDC-P) from cancer tissue with a ≥85% detection accuracy and differentiated high-grade prostatic intraepithelial neoplasia (HGPIN) from IDC-P with a ≥97.8% accuracy. Aim: To improve the classification performance of machine learning models identifying different types of prostate cancer tissue using a new dimensional reduction technique. Approach: A radial basis function (RBF) kernel support vector machine (SVM) model was trained on Raman spectra of prostate tissue from a 272-patient cohort (Centre hospitalier de l’Université de Montréal, CHUM) and tested on two independent cohorts of 76 patients [University Health Network (UHN)] and 135 patients (Centre hospitalier universitaire de Québec-Université Laval, CHUQc-UL). Two types of engineered features were used. Individual intensity features, i.e., Raman signal intensity measured at particular wavelengths and novel Raman spectra fitted peak features consisting of peak heights and widths. Results: Combining engineered features improved classification performance for the three aforementioned classification tasks. The improvements for IDC-P/cancer classification for the UHN and CHUQc-UL testing sets in accuracy, sensitivity, specificity, and area under the curve (AUC) are (numbers in parenthesis are associated with the CHUQc-UL testing set): +4% (+8%), +7% (+9%), +2% (6%), +9 (+9) with respect to the current best models. Discrimination between HGPIN and IDC-P was also improved in both testing cohorts: +2.2% (+1.7%), +4.5% (+3.6%), +0% (+0%), +2.3 (+0). While no global improvements were obtained for the normal versus cancer classification task [+0% (−2%), +0% (−3%), +2% (−2%), +4 (+3)], the AUC was improved in both testing sets. Conclusions: Combining individual intensity features and novel Raman fitted peak features, improved the classification performance on two independent and multicenter testing sets in comparison to using only individual intensity features.


Introduction
Intraductal carcinoma of the prostate (IDC-P) is an aggressive variant of prostate cancer (PC) recognized as a distinct entity in 2016 by the World Health Organization classification. 1 Current biomarkers used by pathologists for IDC-P identification, phosphatase and tensin homolog loss, and ETS-related gene overexpression, have low sensitivity limiting their use. However, by comparing PC and IDC-P Raman spectra, high sensitivity (≥85%) biomarkers were recently identified in our previous work and led to the first machine learning model for the diagnosis of IDC-P using Raman micro-spectroscopy (RμS). 2 In this past study, Raman spectra from three institutes were collected independently to distinguish various types of prostate cancer on completely independent testing sets. Employing Raman spectroscopy for the identification of various pathologies is currently well-established. 3 One of the main challenges of using machine learning algorithms on Raman spectra is to create, extract, and select features. Most common techniques consist of using individual intensities and various complex feature selection methods, such as recursive feature elimination, 4 ant colony optimization, 5,6 and L 0 -SVM, or adaptive boosting, 7 to rank them and select the most relevant ones. While some of these methods are much more powerful than others when only a few dozens of individual Raman intensities are considered, all methods provide a very similar classification accuracy when the number of retained features is higher than 50. 8 Linear discriminant analysis accompanied with principal component analysis (PCA) is the most common dimensionality reduction technique used in Raman spectroscopy. [9][10][11][12][13][14] However, our group showed that Raman peak fitting features have better predictive performances than PCA for cancer/benign brain tissue classification. 15 In our previous study, features consisted of individual intensities of Raman bands, and the machine learning algorithm was a linear support vector machine (L1-SVM) for feature selection and an SVM with a radial basis function kernel (RBF-SVM) for classification.
In this new analysis, the same dataset, features, and machine learning architecture were kept, but a new set of features was added. Those features were obtained by fitting a Gaussian function on Raman peaks and extracting their heights and widths. Individual intensities of Raman bands can be any band within a Raman peak, whereas fitted heights capture only the maximum of peaks, but they differ since they are much less prone to stochastic noise. The algorithm used to extract these new features is an improved version of the Raman peak fitting algorithm used in Ref. 15. This algorithm was initially designed to extract height and width features of Raman peaks to improve the interpretability of Raman signal in brain tissue by fitting only Raman peaks that are constantly present in the brain Raman literature.
This present study aims to demonstrate that these two sets of features are complementary and thus improve the classifying results of three PC tissue classification tasks: benign versus cancer, cancer versus IDC-P, and high-grade prostatic intraepithelial neoplasia (HGPIN) versus IDC-P, where the cancer dataset does not contain any IDC-P spectra and the benign dataset does not contain any HGPIN spectra. Major modifications to the algorithm consisted of implementing an algorithm that the finds most common peaks without requiring a previously established list of Raman peaks, and improving the algorithm that identifies inflection points of peaks to obtain more accurate width of peaks especially in signal regions with stochastic noise.

Patient Samples and Imaging Method
The dataset consisted of 483 PC patients from three different institutions: Centre hospitalier de l'Université de Montréal (CHUM), University Health Network (UHN), and Centre hospitalier universitaire de Québec-Université Laval (CHUQc-UL). The characteristics of the dataset are shown in Table 1. Tumor stages defines where cancer is present in prostate tissue; there are three tumors stages: pT2 (organ confined), pT3a (extraprostatic extension or microscopic invasion of bladder neck), and pT3b (seminal vesicle muscle wall invasion). The Gleason score details the arrangement and pattern of cancer cells. There are five patterns, and the Gleason score is equal to the sum of the most and second most common pattern present prostate tissues. Confocal RμS measurements on formalin-fixed paraffin embedded tissue microarrays (TMAs) of PC and IDC-P tissue were acquired as described in our previous work. 2 Briefly, sections from the TMAs were transferred onto aluminum slides with low Raman activity (Miro5011, Anomet, Brampton, Ontario) and dewaxed according to the CHUM standard clinical dewaxing protocol. Raman acquisitions were made using a confocal Raman microscope (Renishaw, Gloucestershire) equipped with a 785-nm line focus laser and a grating of 1200 lines/mm allowing acquisition of Raman shifts between 602 and 1726 cm −1 . For each TMA, five accumulations of 10 s each at a 150-mW laser output power were acquired using a 50× objective with a 0.75 numerical aperture. There were always four spectra taken per TMA core at four different locations, whereas the number of TMA cores per patient varied between institutions. The diameter of TMA cores ranged between 0.6 to 1.2 mm and the probed region corresponded to a rectangular area of 24 μm 22 (8 μm × 3 μm, approximately corresponding to single-cell analysis). IDC-P identification was done by two independent pathologists. All Raman spectra files are available in the Dryad Digital Repository database. 16

Data Preprocessing
The following data preprocessing procedures are applied to obtain each spectrum: (1) summed the five accumulations of 10 s each, (2) remove cosmic rays, (3) remove background signal produced by the aluminum slides and tissue fluorescence using the rolling ball algorithm, 17 (4) apply standard normal variate (SNV) normalization, (5) average the four spectra of each TMA core resulting in between 1 and 3 spectra per patient. During background removal, the rolling ball algorithm distorts, due to border effects, Raman signals by creating peaks that are not present in the raw spectra. To avoid these artifacts, regions between 602 and 669 cm −1 , as well as between 1707 and 1726 cm −1 were removed.
A limitation of the rolling ball algorithm comes from the presence of a finite size structuring element within the algorithm. At the beginning and end of a spectrum, half of this structuring element extends past the range of spectra and causes distortions in the extracted Raman signals due to this border effect. The magnitude of this effect depends on the smoothness of the signal baseline. In this particular dataset, the baseline shows a step non-smooth gradient at the beginning of the signal, which increases the distortion, and a smooth baseline at higher wavelengths. It was determined, empirically, that to remove any residue due to this effect, signal regions between 602 and 669 cm −1 as well as between 1707 and 1726 cm −1 had to be removed, i.e., a portion of the beginning and end of each spectrum.
The next preprocessing step is to fit Gaussian functions over the most common Raman peaks present in the dataset. Since the Gaussian function is strictly positive, the minimum value of SNV normalized spectra was subtracted from the entire dataset to obtain only positive values. The peak fitting algorithm, which is an improvement of the algorithm first used in Ref. 15, considers a peak only when it appears in at least 50% of spectra at the same location or within AE2 cm −1 and if its height is above a threshold height previously determined empirically. For every peak, both inflection points are determined as well as the height of the maximum and its position. The distance between inflection points is used to compute an approximate standard deviation (σ) and to determine the fitting domain for a nonlinear least-squares regression algorithm. 18 This standard deviation, and the maximum and position of the peak determine the three starting points given to the regression algorithm to fit a Gaussian over the wavelength domain of a peak to obtain fitted heights and widths (standard deviation, σ). These values are then SNV normalized and used as features.

Statistical Analysis
A machine learning model with several hyperparameters was trained on the CHUM dataset and tested on two independent hospital cohorts: CHUQc-UL and UHN. Patient clinicopathological characteristics and number of spectra of each classification task are shown in Table 1. The model uses two types of features: individual features; intensity set, and peak features; peak set. To determine the features in each set, two separate feature selection algorithms are used. The selected features are then combined in a single feature vector and passed to the RBF kernel SVM classifier. Before training a model on the CHUM dataset, cross validation (CV) is required to determine the optimal value of the hyperparameters. For each classification task, the CV identifies the set of hyperparameters, through a grid search, that maximizes the area under the curve (AUC) of the receiver-operator-characteristic (ROC) curve of the CHUM dataset for three independent models: one model using only the intensity features, one using only the peak features, and a combined model that used both feature sets; intensity þ peak. Figure 1 shows a schematic workflow of the statistical analysis.

Feature Engineering and Feature Selection
Two different sets of SVN normalized features were used for classification. The first set, intensity, consists of individual intensity values of single Raman bands. To extract the most relevant intensities, an L1 regularization linear SVM with one hyperparameter (regularization parameter C intensity ) is utilized for feature ranking. The linear SVM assigns coefficients to individual intensity features, which are then used to rank them by importance. Features that are close to one another, which are most likely correlated and contain the same information, are removed using the hyperparameter n neighbor . Every feature that is AEn neighbor cm −1 from the feature with the highest coefficient is removed. The feature set is then updated, and the same procedure is applied for the feature with the second-highest coefficient and so forth. For example, if the intensity features with the highest coefficient has a Raman shift of 780 cm −1 and n neighbor ¼ 5, intensity features with a Raman shift between 775 and 785 cm −1 will be removed. This procedure also reduces the dimensionality of the feature set. The second set, peak, consists of Raman peaks fitted with a Gaussian distribution. The height and the width of these Gaussian fitted peaks are used as features. A linear SVM with L1 regularization (regularization parameter C peak ) is used to rank these engineered features by importance.
To further reduce the number of features within each feature set, the k intensity and k peak features with the highest ranked coefficient are used for modeling and the others are discarded. Both k intensity and k peak are hyperparameters determined during CV.

Modeling and Cross Validation
An RBF kernel SVM was used as a classifier. This algorithm has two hyperparameters: C RBF and γ. A grid search was performed using CV to optimize the following hyperparameters: C RBF , C intensity , C peak , γ, k intensity , k peak , and n neighbor . Both k intensity and k peak ranged from 5 to 100, C RBF ranged from 0.001 to 1, γ ranged from 0.001 to 0.01, n neighbor from 0 to 10, and C intensity and C peak from 1 × 10 −6 to 1 × 10 −4 . The CV scheme performed was a five repeat fivefold CV. The combination of hyperparameters generating the model with the highest AUC value was chosen to train the model using the complete training set. The selected model was then applied to the two independent testing sets.

Results
Raman spectra of the CHUM cohort were designated as the training set, due to its larger number of patients compared to the other two cohorts and to obtain a sufficiently large dataset for training purposes, and the CHUQc-UL and UHN cohorts as testing sets. The five highest ranked features of both feature sets along with their associated main vibrational mode and molecule are discussed in the following subsections for each classification task. The Raman shift of each feature and of its associated Raman peak, found in the literature, are both reported. Table 2 presents the classification performances for all three classification tasks for both testing sets. The threshold A double nested five-fold CV is performed to determine the hyperparameters through a randomized grid search. Feature selections on peak and intensity features are done separately using linear kernel L1 SVM. Classification models are RBF kernel SVMs that use selected features. The hyperparameter set that yields the highest AUC on training data is used to build the final model (green), which is evaluated on predicting outcome on the testing set (orange).

Table 2
Accuracy, sensitivity, specificity, and AUC for the three classification tasks: benign versus cancer, cancer versus IDC-P, and HGPIN versus IDC-P. Classification performances are given for four different models: Results previously published 2 (previous), intensity (using only individual intensity features, similar as previously published 2 ), peak (using only Raman Gaussian fitted peak metrics, i.e., heights and widths), combined (using both intensity and peak features). The combined models always have an equal or better classification performance than the intensity and peak models. The relative change in AUC with respect to our previous study is included in parenthesis in the combined columns. corresponding to the point closest-to-(0,1) corner was selected for each ROC to obtain accuracy, sensitivity, and specificity values reported in Table 2. In all cases, classification performances of the combined model were always equal or higher than results obtained using only one feature type. Furthermore, using the intensity model yielded higher classification performances than using the peak model, except for the classification of cancer/IDC-P tissues for the CHUQc-UL testing set. The intensity model results are superior compared to our previous study ('previous' column) for the classification of cancer/IDC-P and HGPIN/IDC-P, whereas the opposite was observed for the benign/cancer classification task. The method developed in Ref. 19 for the comparison of ROCs was used to test if improvements were statistically significant. A comparison between previous, intensity, and peak model versus the combined model was performed for both testing sets. For each comparison, a p-value was obtained. All model comparisons were statistically significant for the benign/cancer classification (p-value ≤0.05) except for the comparison of intensity and combined models for both testing sets. The p-values were 0.3414 and 0.2696 for the UHN and CHUQc-UL testing set, respectively. While they are not significant, if a 0.05 threshold is adopted, the combination of intensity and peak features is still a non-negligible improvement. All model comparisons were statistically significant for the cancer/IDC-P classification task except for the peak and combined model comparison for the CHUQc-UL testing. The previous and combined model comparison for the UNH testing was the only statistically significant improvement for the HGPIN/IDC-P classification.
The average Raman spectra for both benign and cancer prostate tissue and the five highest ranked features of each feature set are shown in Fig. 2. The description of the five best features for each the peak and the intensity models are presented in Table 3 along with their respective molecular assignments. 23 For the peak height feature set, four Raman peaks ranging from 719 to 1032 cm −1 were increased in the average spectrum of benign tissue. DNA, RNA, proteins, and amino acids (i.e., tryptophan, proline, valine, and phenylalanine) were the main biochemical components associated with these peaks. The only feature associated with an increase in cancer tissue was the peak at 1294 cm −1 , which was assigned to lipids. For the intensity feature set, a unique peak (1250 cm −1 ) associated with the β-sheet secondary structure of proteins was more intense in the Raman spectrum of cancer tissue. There were also 4 Raman peaks increased in benign tissue, from 719 to 1003 cm −1 and their molecular assignments are mostly DNA, RNA, proteins, and amino acids (i.e., tyrosine, proline, and phenylalanine).
The description of the five highest ranked features for each the peak and the intensity models are shown in Fig. 3 and Table 4 along with their respective molecular assignments. Within the five highest ranked peak features, three Raman peaks were increased in the average spectrum of IDC-P compared to cancer tissue. These peaks were mostly associated with amino acids: 825 cm −1 (tyrosine), 1170 cm −1 (tyrosine), and 932 cm −1 (proline and valine). Raman peaks  Table 4 Highest ranked features used to classify intraductal carcinoma of the prostate and invasive prostate cancer tissue found using a linear SVM with L1 regularization, and their associated Raman peaks. Tentative molecular assignment of prostate Raman peaks based on the literature. The Raman shift of the features (feature column) and of its associated Raman peak (peak, center column) are reported. 2,[9][10][11][12][20][21][22] Feature set assigned to phenylalanine (1001 cm −1 ), protein, lipids, DNA, and RNA (1666 cm −1 ) were reduced in IDC-P. For the intensity feature set, the 1484 cm −1 Raman peak, assigned to the nucleobases adenine and guanine, was the single decreased peak in the average spectrum of IDC-P. The four other Raman peaks, ranging from 944 to 1266 cm −1 , were identified mostly in IDC-P. The biochemical constituents associated with these peaks were predominately linked to proteins: α-helix and β-sheet secondary structures, proline, valine, and phenylalanine.
Following the classification of IDC-P and HGPIN, the five highest ranked features were assigned to specific molecules ( Fig. 4 and Table 5). Only one feature was found to be a dominant contributor for identifying HGPIN using the peak height feature set. More specifically, the 937 cm −1 Raman peak was identified as α-helix secondary structure of proteins and two amino acids, proline and valine. The other four features, from 1001 to 1238 cm −1 , were all increased in the average spectrum of IDC-P tissue and were assigned to the β-sheet secondary structure of proteins, phenylalanine, tryptophan, and tyrosine. For the classification using the intensity feature set, two Raman peaks were increased in HGPIN tissue: phenylalanine (1032 cm −1 ); α-helix secondary structure of proteins, fatty acid, and the nucleobase thymine (1667 cm −1 ). Raman peaks at 1003 cm −1 (phenylalanine) and at 1250 cm −1 (β-sheet) were mostly found in IDC-P tissue. The carotenoid biochemical component (1152 cm −1 ) was a main contributor to the IDC-P classification in both feature sets.

Discussion
The key difference in this analysis was the combination of the intensity and peak features, which improves the AUC of all three classification tasks. Combining both types of features had more impact on the classification of IDC-P versus cancer than on benign versus cancer but could not be appropriately quantified for HGPIN versus IDC-P since using only one type of feature was enough to obtain a perfect classification score. While the accuracy, sensitivity, and specificity of these results are lower than those of our previous study for the classification of benign and malignant prostate tissue, the AUC increased by þ4% for the UHN testing set, and þ6% for the CHUQc-UL testing set. In comparison, for both testing sets, the AUC increased by 9% for the classification of IDC-P versus cancer. One hypothesis that could explain the higher AUC improvement for the classification of IDC-P versus cancer is the presence of four width features within the five highest ranked features in the IDC-P/cancer classification task and their absence in benign/cancer task (see Tables 3 and 4, respectively). In the results section (Sec. 3) of this analysis, peak centers of individual intensity of Raman bands and heights were reported but also compared to the most important features of our previous study. In this particular case, it is the first time the widths of Raman peaks are identified as being useful for classification purposes in a machine learning setting. Therefore, for now, this observation is only factual and there is no interpretation. Raman spectra datasets on various tissues for various classification tasks are required to learn if the width of peaks are also relevant features for other applications.
Comparing the highest ranked individual intensity features of each classification task with the highest ranked features found in our previous study 2 reveals that both analyses identified similar features. The intensity features of the cancer/benign and IDC-P/cancer tasks are present within the 10 highest ranked features of our previous analysis and they are also assigned to the type of tissue. As for the HGPIN/IDC-P task, there are only two similar features. However, since in both cases the classification performance is near perfect, the highest ranked features are likely interchangeable due to the large spectral difference between the two tissue types. Table 5 Highest ranked features used to classify high-grade prostatic intraepithelial neoplasia and IDC-P found using a linear SVM with L1 regularization, and their associated Raman peaks. Tentative molecular assignment of prostate Raman peaks based on the literature. The Raman shift of the features (feature column) and of its associated Raman peak (peak, center column) are reported. 2,[9][10][11][12][20][21][22] The results obtained in this analysis using only the individual intensity features also yielded improved results compared to those obtained in our previous analysis for two classification tasks: cancer/IDP-C and HGPIN/IDC-P. This is likely the result of a combination of several differences such as using the SVN normalized individual intensities as input for the SVM L1 feature selection algorithm instead of using SNV normalized spectra as in the PLUS study. The addition of the n neighbor hyperparameter might also have contributed to improving the classification results.
A probable limiting factor for the classification performance of benign/cancer could be the presence of different tumor stages and Gleason scores distributions in the training and testing dataset (see Table 1). For example, the training and testing sets contain 139 and 28 spectra with a Gleason score of ≤3 þ 3, respectively, and contain 22 and 56 spectra with a Gleason score of 4 þ 3 for the training and testing set, respectively. These distribution imbalances, both Gleason score wise and tumor stage wise could explain the lower AUC score for the classification of benign versus cancer tissue in comparison to IDC-P versus cancer. A follow-up study is planned to investigate this.
The most important difference with respect to our previous result is the identification of Raman peak widths as important biomarkers for the classification of IDC-P and invasive prostate cancer tissue. These results further reinforce the usefulness of the clinical implementation of Raman microscopy. By exploring an exhaustive list of feature selection algorithms, models more readily transferable to the clinical workplace with a total number of features of ten or less will be studied.

Disclosures
No conflicts of interest, financial, or otherwise, are declared by the authors.