Characterization of Mueller matrix elements for classifying human skin cancer utilizing random forest algorithm

Abstract. Significance: The Mueller matrix decomposition method is widely used for the analysis of biological samples. However, its presumed sequential appearance of the basic optical effects (e.g., dichroism, retardance, and depolarization) limits its accuracy and application. Aim: An approach is proposed for detecting and classifying human melanoma and non-melanoma skin cancer lesions based on the characteristics of the Mueller matrix elements and a random forest (RF) algorithm. Approach: In the proposal technique, 669 data points corresponding to the 16 elements of the Mueller matrices obtained from 32 tissue samples with squamous cell carcinoma (SCC), basal cell carcinoma (BCC), melanoma, and normal features are input into an RF classifier as predictors. Results: The results show that the proposed model yields an average precision of 93%. Furthermore, the classification results show that for biological tissues, the circular polarization properties (i.e., elements m44, m34, m24, and m14 of the Mueller matrix) dominate the linear polarization properties (i.e., elements m13, m31, m22, and m41 of the Mueller matrix) in determining the classification outcome of the trained classifier. Conclusions: Overall, our study provides a simple, accurate, and cost-effective solution for developing a technique for classification and diagnosis of human skin cancer.

No reliable biomarkers exist for melanoma diagnosis. Consequently, current diagnostic methods for skin lesions are subjective and imprecise. Typically, a patient must undergo around 36 biopsies to confirm (or discount) melanoma. However, despite this large number of biopsies, false negative predictions cannot be entirely ruled out. 4 Thus, new skin cancer detection methods with greater accuracy and less invasiveness are urgently required. Among the various optical imaging technologies available nowadays, optical coherence tomography (OCT) 5,6 and polarization-sensitive OCT 7 make possible the real-time comprehensive morphological mapping of skin tissue samples with micrometer resolution by measuring the inherent properties of light (e.g., the scattering, birefringence, and refractive index properties) as it propagates through the sample. 8 However, while OCT has a greater sensitivity for detecting melanoma than other techniques, such as reflectance confocal microscopy, 9 high-frequency ultrasonography, 10 and multispectral imaging, 11 detecting early stage melanoma using OCT still poses a significant challenge 5 due to the great number of different types of non-melanoma skin cancer. 12 Many studies have shown that the Stokes-Mueller method, based on polarized light, has significant potential for replacing current clinical standards for skin cancer detection. Lu and Chipman 13 proposed a Mueller matrix decomposition method for determining the diattenuation, retardance, and depolarization properties of a sample. Ghosh et al. 14 investigated the efficacy of the Mueller matrix decomposition method in extracting the individual intrinsic polarimetry characteristics of a scattering medium with both linear birefringence (LB) and optical activity. Du et al. 15 used a Mueller matrix imaging technique to construct two-dimensional images of the polarization parameters (i.e., attenuation, depolarization power, and linear retardance) of human skin basal cell carcinoma (BCC) and human papillary thyroid carcinoma tissues. Martin et al. 16 used the Mueller matrix decomposition techniques proposed by Lu and Chipman 13 and Ossikovski 17 to differentiate between healthy and irradiated pig skin samples based on their measured retardance, diattenuation, and depolarization properties. Pham et al. [18][19][20][21][22] employed a Stokes-Mueller method to examine the polarization properties of skin cancer, liver cancer tissues, neuroblastoma, collagen-rich tendons, and cartilage. It was shown that the proposed method yielded nine effective parameters for distinguishing between normal skin tissue and various skin cancer tissues, including BCC, squamous cell carcinoma (SCC), and malignant melanoma.
Machine learning provides a powerful tool for performing the objective and precise diagnosis of cancer through its use of statistics, probabilistic algorithms, and massive computational power. According to recent studies, machine learning techniques can improve 15% to 20% of the previous accuracy of cancer detection. 23 For example, Codella et al. 24 used a convolutional neural network (CNN) in deep learning combined with image segmentation algorithms to recognize melanoma in a dataset consisting of 900 training dermoscopic images and 379 test images. The classification accuracy was found to be 76%. By contrast, the average diagnosis accuracy of eight expert dermatologists was just 70.5%. Esteva et al. 25 used a GoogleNet Inception v3 CNN architecture and a transfer learning technique to perform the first-level classification of three class disease partitions (benign, malignant, and non-neoplastic) with an accuracy of 72.1% and the second-level classification of the same partitions with an accuracy of 55.4%. Baldwin et al. 26 proposed an automated Mueller matrix polarization imaging system and a classification and regression tree (CART) statistical analysis approach for classifying three classes of Sinclair swine tissue (normal, benign, and cancerous) and showed that the sensitivity was as high as 90%. Sigurdsson et al. 27 detected five skin tumor lesion types using Raman spectra and a nonlinear neutral network. The experimental results showed that the proposed system achieved a classification rate of 80.5% for malignant melanomas and 95.8% for BCC. Legesse et al. 28 used a perceptron algorithm to discriminate healthy and tumorous regions in BCC Stokes-Raman scattering (CARS) based on an analysis of the texture features. It was shown that the classifier achieved a sensitivity of 88% and a specificity of 91%. Murugan et al. 29 used random forest (RF) and support vector machine (SVM) classifiers techniques for skin cancer detection. The experimental results showed that the proposed system achieved a classification rate of 72.2% using RF techniques and 87.81% using SVM+RF. Singh et al. 30 detected breast cancer using RF classifier technique. It was shown that the classifier achieved a sensitivity of 90.56% and a specificity of 86.40%. Based on the fruitful achievement of Mueller matrix in Refs. 13-22 and machine learning techniques for skin cancer detection in Refs. 24-30, furthermore, the RF classifier is adopted for this study because of its advantage for overcoming the overfitting and suitable for classifying untrained data. 31 Notably, the RF has the advantage in reducing the influence of noisy trees contribution. 32 Moreover, the RF allows the ability of investigation to feature importance, 33 which is useful to analyze the impact of optical properties of tissue on different types of skin pathology. Accordingly, the present study explores the feasibility for using a machine learning technique to discriminate between normal skin tissue and three classes of skin cancer based on the 16 elements of the Mueller matrix of a biomedical sample, in which all of the optical effects may appear simultaneously.
3 Skin Cancer Classification Model

Decision Tree Algorithm
Decision tree algorithms implement classification by splitting the dataset using binary questions based on the feature vectors. 34 In particular, the feature vectors (denoted as X) are taken as tree nodes in the classification architecture, while the class labels are denoted as Y. A decision rule, dðtÞ, is then used to map each X to dðXÞ, where dðXÞ represents the class label of the feature vectors. 35 Depending on whether or not the input features (i.e., attributes) satisfy the binary question, they are divided into two groups (known as branches) of nodes. Thus, by applying multiple questions to the flow, the decision tree classifies the input dataset into multiple different class labels.
One of the most well-known decision tree algorithms is the CART algorithm proposed by Breiman et al., 36 which constructs decision trees by applying a threshold for features that yield the best performance of the Gini index or information gain, respectively, depending on the tuned parameters. 37 Notably, the algorithm not only accommodates both numerical and categorical variables but also handles outliers in the dataset in a robust manner. 38 As such, it is ideally suited to the classification problem considered in the present study, in which the instances in the dataset [i.e., the Mueller matrix elements describing the optical (depolarization, LB, CB, LD, and CD) properties of human tissue samples] are numerical and have no missing values, but may contain outliers.

Random Forest Algorithm
The RF classification algorithm 31 builds multiple individual sub-decision trees as building blocks for categorization tasks fT 1 ðXÞ; T 2 ðXÞ; T 3 ðXÞ; : : : T n ðXÞg. 35 Each individual subdecision tree utilizes a different method to generate the binary questions used for classification purposes, and hence the resulting tree structure and organization are unique. Since each sub-decision tree in the RF architecture performs its own classification procedure, each tree can be regarded as an individual predictor and votes for the prediction of the input data and the final classification outcome can then be determined via a polling process. Compared to the traditional decision tree classification algorithm described above, the RF classifier provides a more effective reduction of the bias-variance by combining small decision trees with random feature subsets; thereby preventing overfitting during the training process. 39

Gini Impurity
The Gini index is a statistical measure for quantifying the heterogeneity of a dataset. 40 As described above, in binary decision trees, decision rules, dðtÞ, are used to split the learning set of feature vectors L containing a certain number of feature vectors X. By splitting L into two sub-sets, namely L 1 and L 2 , such that the data points of each subset conform to a specific rule, i.e., dðtÞ. Consequently, the impurities of L 1 and L 2 , respectively, are less than that of their parent, L. 41 The impurity is measured by the Gini index, which has the following form: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 8 6 where G is the Gini index at node t and pðtÞ is the probability of a given dataset L being assigned to class c i . The Gini index varies from 0 to 1, where G ¼ 0 represents a complete equality of the data (i.e., all the data in the subset after splitting belong to a specific class label), whereas G ¼ 1 indicates a complete inequality of the data (i.e., none of the data in the subset after splitting belong to the same class label). Figure 1 presents a schematic illustration of the experimental setup used in this study. The illumination light was produced by a frequency stable He-Ne laser (HNLS008R, SIOS Co.) with a center wavelength of 632.8 nm. The light emitted by the laser passed through a quarter-wave plate (QWP0-63304-4-R10, CVI Co.) and polarizer (GTH5M, Thorlabs Co.) and was then incident on the sample (i.e., biological tissue mounted on a quartz slice). It is noted that quartz slices were used to minimize the depolarization effect when light passed through the sample. Furthermore, the blank quartz slides were measured before performing experiments for calibration purposes. The quarter-wave plate was used to produce two circular polarization input states It is noted that due to the difference in the shape of samples, each sample was under a different time of slicing and also the difference in measurement of interest position. Hence, the number of feature vectors belongs to each sample varies. This leads to the difference in the ratio of training and testing feature vectors for each type of skin tissue.

Data preprocessing
One of the most common problems facing machine learning classifiers is that of imbalanced datasets, where the data records of the majority class overwhelm those of the other classes. In such a situation, the training process is unable to learn proper classification rules for the minority classes, and hence the classification accuracy for these classes is severely impaired. 42 As shown in Table 1, the dataset employed in this study suffered this imbalance problem since the BCC class contained 282 feature vectors, whereas the normal class contained only 42 vectors. Accordingly, the oversampling technique 43 was performed to randomly duplicate instances of the minority classes (SCC, melanoma, and normal skin) based on the original number of vector features belonging to the BCC majority class (see Table 2).   [Note that one of the features (Mueller matrix element m 11 ) was used for normalization purposes, and hence was not used as a predictor.) Notably, there were no instances of missing data, and thus handling schemes for missing data were not required. Cross-validation is usually performed using k ¼ 10 folds 44 since a larger value of k reduces the size of each fold and thus reduces the difference in size of the training set and resampling subset, respectively. As a result, the bias, e.g., the difference between the true value and the expected value of the estimator, is decreased. For the present training process, 10-fold cross-validation was implemented with 3 times of repetition. According to Molinaro et al. 45 and Kim, 46 repeating k-fold cross-validation is beneficial in improving the precision score of classification models while maintaining a small bias. Figure 3 shows the training and validation accuracy results for the 10 folds of the dataset. For the training set, the classification accuracy is equal to 100% in virtually every fold. By contrast, for the validation set, the classification accuracy reduces to around 91%. This tendency is reasonable since the oversampling process increases the number of duplicate data features, and therefore the trained model produces multiple rules for one instance, and the rules become  specific for a portion of training data. This increases the training accuracy, but decreases the classification accuracy. The performance of the trained RF classifier when applied to the test dataset, including 30 BCC, 23 SCC, 3 melanoma, and 6 normal feature vectors, then those of each class equals 30 after oversampling, was evaluated by a confusion matrix, as shown in Table 3. As shown, the optimal classification performance was obtained for the melanoma class, with 30 true positive cases, no false positive case or false negative case. A good classification performance was also obtained for the normal skin tissue, i.e., 30 true positive cases, no false negative cases, and just 2 false positive cases. However, for the BCC and SCC classes, the classification performance was degraded, with 7 false positive outcomes for the BCC class and 9 false negative outcomes for the SCC class. Interestingly, almost all, i.e., 7/9 of the SCC instances, were misclassified as BCC. It is noted that when cancerous tumors develop in the tissue, numerous changes in the collagen components occur, including the deposition of collagen fibrils resulting from an increased number of fibroblasts, the production of proteolytic enzymes for cancer invasion, etc. 47,48 The change of biological structure that led the classification model significantly distinguished between normal tissue and cancerous tissue. Whereas some cases of BCC and SCC share the same clinical features, such as an ulcer with a rolled border, that may get the estimator confused. 49 The receiver operator characteristic curve is an evaluation metric for binary classification. 50 It represents true positive rate (TPR) and false positive rate (FPR) at different thresholds. Thus, the calculation of the area under the curve (AUC) can be used to evaluate the model with unbiased estimation. The closer of the AUC score to 1, the better the model is. As shown in Table 4, the AUC score of melanoma and normal skin tissue was 1. The performance of the RF model on prediction BCC and SCC is lower; however, it is still a good score with 0.999 and 0.996, respectively. Overall, the mean AUC for all types of skin tissue is 0.999.   Table 5 analyzes the performance of the trained classifier for the four different class labels. The performance metrics, i.e., the precision, recall, and F1 score are defined as follows:

Results and Discussion
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 4 8 2 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 4 2 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 3 9 5 where TP is the true positive, FP is the false positive, and FN is the false negative. The precision metric evaluates the prediction performance of the model, with a value closer to 1 indicating a better closeness of the predicted outcomes to the true outcomes. As shown, the classifier attains a precision of 1 for the SCC class. In other words, when supplied with the feature vectors of SCC, it correctly outputs a class label of normal in almost every case. Meanwhile, the recall metric evaluates the performance of the trained model for each individual prediction. In other words, the recall value of 1 for the BCC class indicates that if the trained model has previously predicted the current feature vectors as not belonging to the BCC class, then the current input belongs to the three other classes either SCC, normal, or melanoma with a probability of 100%. Finally, F1 score is the metric that combines precision and recall scores as harmonic mean. The F1 score takes both precision and recall scores into account, therefore, that is more general than these two metrics in evaluating models. Also shown in Table 4, the trained model achieves a good classification performance for the melanoma class (precision ¼ 1; recall ¼ 1). Moreover, the classifier also achieves a good performance for the normal class (precision ¼ 0.94; recall ¼ 1). However, as implied in the confusion matrix in Table 3, the classifier has a poorer performance for the BCC and SCC classes. Overall, the trained classifier successfully discriminates four classes of skin tissues with a (mean accuracy of 0.93). Figure 4 presents the distributions and magnitudes of the 15 Mueller matrix elements of the SCC, BCC, melanoma, and normal skin tissue samples. The vertical and horizontal axes show the magnitude and distribution of the corresponding Mueller matrix elements, respectively. Note that, as described earlier, one of the matrix elements (m 11 ) was used for normalization purposes, and is hence omitted here. It is seen that for each Mueller matrix element, the magnitude is approximately equal for all four types of tissue sample. However, the distribution varies from one sample type to another. For example, for element m 22 , the distributions of the different sample types are affected by outliers, which result in a significant skew of the distribution. Thus, element m 22 has only a low contribution to the outcome of the classification model, as shown in    Figure 5 shows the relative importance of the 15 different elements of the Mueller matrix within the classification model. Note that the feature importance represents the reduction in the node Gini impurity weighted by the node probability. 37 For each decision tree, the importance score of feature i on node j, ni j , is calculated as  ni j ¼ w j G j − w j ðLÞG j ðLÞ − w j ðRÞG j ðRÞ; where w j is the proportion of the number of samples reaching node j; w j ðLÞ is the child node of the left split of node j; w j ðRÞ is the child node of the right split of node j; and G j , G j ðLÞ, G j ðRÞ are the Gini impurities of node j and its left and right child nodes, respectively. Thus, the importance of feature i in a specific tree, fi, can be calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 6 6 3 fi ¼ where s is the number of node j splits for feature i; N is the number of nodes; and fi is the importance of feature i and is normalized to a value between 0 and 1. Finally, the importance score of feature i in a forest of T estimators, Fi, is given by the average importance score of feature i over the individual trees, i.e., E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 5 7 0 Fi ¼ P T fi T : As shown in Fig. 5 In other words, the circular polarization elements in the Mueller matrix exert a greater effect on the classification outcome than the linear polarization elements. This finding is reasonable since skin tissue samples have a high natural scattering effect, which causes a helicity flip of the circular polarization light while passing through the sample. 51 In general, the results presented above indicate that the proposed technique, based on Stokes-Mueller matrix polarimetry and an RF classification algorithm, provides a simple and well-accurate tool for skin cancer classification and diagnosis applications.

Conclusion
This study has proposed a Stokes-Mueller polarimetry method based on an RF classifier consisting of 220 sub-decision binary trees for discriminating between four different types of skin tissues, namely BCC, melanoma cancer, SCC, and normal, based on the measured values of the 16 elements in the output Mueller matrix. Based on the experimental results obtained for 32 skin tissue samples, it has been shown that the proposed model achieves an average classification accuracy of 93% for the four skin tissue types. It has additionally been shown that among all of the elements in the Mueller matrix, elements m 44 , m 34 , m 24 , and m 14 , relating to the left-and right-handed circular polarization states, respectively, have a stronger discriminatory power than those relating to the linear polarization states. Overall, the results show that the proposed framework has a promising potential for the development of machine learning approaches for automated cancer tissue screening and diagnosis.

Disclosures
The authors declare no conflicts of interest.