Automated cervical precancerous cells screening system based on Fourier transform infrared spectroscopy features

Abstract. Fourier transform infrared (FTIR) spectroscopy technique can detect the abnormality of a cervical cell that occurs before the morphological change could be observed under the light microscope as employed in conventional techniques. This paper presents developed features extraction for an automated screening system for cervical precancerous cell based on the FTIR spectroscopy as a second opinion to pathologists. The automated system generally consists of the developed features extraction and classification stages. Signal processing techniques are used in the features extraction stage. Then, discriminant analysis and principal component analysis are employed to select dominant features for the classification process. The datasets of the cervical precancerous cells obtained from the feature selection process are classified using a hybrid multilayered perceptron network. The proposed system achieved 92% accuracy.


Introduction
Cervical cancer is a leading cause of mortality and morbidity, which comprises ∼12% of all cancers in women worldwide. 1 Pap smear and liquid-based cytology (LBC) are the main screening tools for early cervical precancerous detection. These tests involve examination of a cervical smear under a light microscope, which is a tedious, laborious, and time-consuming laboratory procedure. Drawbacks of the Pap smear test are not only that it is insensitive, giving rise to a high percentage of false-negative results, from the literature ranging from 15% to 70%, 2 but also that highly skilled personnel are required as the reliability of the test depends upon human judgment. The implementations of both methods are time-consuming and highly dependent on the skill of the cytopathologist, which will lead to subjective perception. 3,4 Recently, Fourier transform infrared (FTIR) spectroscopy technology, which is usually used to measure and detect chemical compounds in many industrial fields, has been used to study the structural changes of cells at the molecular level in various human cancers. These structural changes result from carcinogenesis, which is caused by different modes of vibration in the molecules of the cells and tissues when it is induced by the infrared (IR) light. Major functional groups of the cells and tissues will provide unique vibrational frequencies. These frequencies can be characterized by the changes in the FTIR spectra. Thus, the normal or malignant cells can be recognized based on their FTIR spectral characteristic appearance. 5 Over the past decades, there have been a number of studies conducted to investigate the possibility of the FTIR technique as a screening tool for cervical cancer. [5][6][7] Since then, many researchers have investigated and applied FTIR spectroscopy as a diagnostic tool to differentiate between normal and malignant tissues and cells of several human cancers, including lung, 8 esophagus, 9 colon, 10,11 skin, 12 gastric, 13 gliomas, 14,15 and cervical. [16][17][18][19][20] Studies conducted by Sindhuphak et al. 21 and El-Tawil et al. 19 further proved that FTIR could overcome the limitations that exist either in the standard Pap smear or the LBC images. Those studies have made a notable discovery that the FTIR technique has detected cell abnormalities at molecular levels, which occur before changes in morphology can be observed under a light microscope as used in the Pap smear and the LBC tests. The FTIR technique can possibly detect not only normal and abnormal stages but also inflammatory and precancerous stages (dysplasia). An advantage of the FTIR is the fact that it is less time-consuming. The measuring process of the spectrum on the FTIR equipment is completed within ∼1 min for one sample. In addition, the cervical scrapings require no fixation or staining. Therefore, this technique is simpler, cheaper, more rapid, and more accurate than the Pap smear and the LBC techniques. 19 Although the limitations of the cervical cancer manual screening of the Pap smear and the LBC techniques have been solved by the FTIR, 19,21 the measured spectra still contain noise and need some variables to be adjusted for each spectrum. 19,[21][22][23][24][25][26][27] These noises usually appear as dinky curves and short peaks. The noises that exist in real peak absorbance and slope of cervical cell FTIR spectra could disturb the features extraction process. As a result, many researchers still rely on manual features extraction process, where high of peak absorbance and high of slope features of FTIR spectra are affected with noises. This manual features extraction process is usually done after the smoothing process for each spectrum using tools in FTIR spectroscopy software. Problems with manual features extraction process worsen when a large number of cervical sample screening needs to be examined. Since the FTIR spectroscopy is a computer-operated system, an automated classification system could be developed to further improve the screening of cervical cancer, where the screening of a large number of cervical samples is feasible. 28 The automated classification systems were developed to classify the cervical cells and produce more rapid and accurate screening. 29 Advances in this automated classification system may not only reduce time but also reduce human errors. 29,30 Therefore, this study aims at developing an automated cervical precancerous screening system that could solve the aforementioned problems and provide better diagnosis of cervical cancer. The automated system provides more accurate diagnosis since the FTIR spectra will be preprocessed with a signal smoothing technique, and dominant features will be automatically selected for classification. Better input signal could be obtained, and optimum features will be classified to ensure that the accuracy of cervical cancer diagnosis is increased. The cervical cell will be classified into three classes: normal, low-grade squamous intraepithelial lesion (LSIL), and highgrade squamous intraepithelial lesion (HSIL). This paper is organized as follows. The proposed system will be elaborated in Sec. 2. Section 3 will discuss the obtained results, where a comparison with other systems is presented. Finally, the conclusion is presented in Sec. 4.

Proposed Automated Screening System for Cervical Precancer
The proposed system consists of four sequence stages of spectrum acquisition, features extraction, feature selection, and classification stages.

Spectrum Acquisition
The cervical cell samples used in this study were obtained from the Gribbles Pathology Laboratory, Petaling Jaya, Selangor, Malaysia (a private provider of diagnostic laboratory services in performing tests for all major disciplines of pathology). The acquired samples were taken from tissue biopsies of women undergoing routine cervical cancer screening. The samples collected from ThinPrep ® solution (PreservCyt; Cytyc) along with their cytology diagnostic results were classified according to the Bethesda System 2001. In this work, we have obtained 650 normal cases, 160 LSIL cases, and 40 HSIL cases of FTIR spectra from individual cervical cells. The cervical cell FTIR spectra were obtained by placing a small amount, ∼0.005 ml, of liquid ThinPrep samples in a circular KRS5 window (an IR transparent cell). The liquid samples were then dried using a dryer for 2 to 3 min before the samples are induced by IR light.
After preparing the cervical cell in the KRS5 window cells, the cervical cell spectra were collected using Spectrum BX II Fourier Transform Spectrometer (Perkin Elmer type 2000) equipped with a deuterated telluride triglycine sulphate detector in mid IR region between 400 and 4000 cm −1 .
The FTIR spectroscopy software was employed to manipulate the original spectrum received from the instrument. The purpose of manipulating a spectrum is to enhance its appearance. 31 In this work, automatic baseline correction, smoothing, and normalization were applied. The spectrum was submitted to the automatic baseline correction process before it was smoothed using the smoothing package within the FTIR software. According to Quintero et al., 32 during the acquisition process, noise may affect the spectrum more than once. Thus, smoothing was required after the acquisition process to improve the appearance of spectrum. Figure 1 shows various spectrum patterns with different prominent peaks. The prominent peaks represent the absorption bands of biochemical compounds. Based on the previous study done by Wong et al.,6 there are seven biochemical compounds detected in the cervical cell FTIR spectra that could be used for the classification purpose (Fig. 1).

Features Extraction
The biochemical compounds are as follows.
However, most of the acquired signal suffers from noise, which further complicates the feature extraction process. Thus, a smoothing filter is proposed.
The coefficient filter (i.e., b k ) must fulfill three conditions. 33,34 1. The sum of the coefficients must be equal to 1.
2. The filter coefficients must be symmetrical with the  Table 1. k and l are examples of two base points of the peaks as tabulated in Table 2.
Condition (1) ensures the conservation of the peak area and a constant background. Meanwhile, conditions (2) and (3) avoid a phase shift between input and output data and avoid undesired oscillations at both sides of the peak, known as the wing effects. The Savitzky-Golay (SG) filter is currently being used widely among chemists for the smoothing and differentiation of the spectroscopy spectra. 35 Almost all spectroscopic software packages contain this standard smoothing technique. However, the last condition of the coefficient filter is not completely obeyed by the SG filters, as the last coefficients ðN p ; −N p Þ are always negative. These negative coefficients introduce some small oscillations at the peak sides. 33 As a result, the SG smoothing algorithm can cause false-negative signals at the shoulders of each vibrating band. 35 In addition, the SG smoothing algorithm can lead to the loss of weak signals and the reduction of spectral resolution. 36 The previous problems can be solved by using binomial smoothing filters. 36 In other work, we used the quadratic of half ellipse (QHE) filter as a smoothing filter. 37 The QHE filter also fulfills the conditions in which the coefficients of the QHE filter are obtained by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 4 9 2 bðkÞ ¼ n where bðkÞ is the QHE coefficient filter, n is the order of the coefficient filter, and k is the range of the filter from −n to n. However, when implementing the direct-form filter, error surface could possibly occur. 38 Thus, Williamson conducted research to consider implementing the cascade-form filter as a transformation of the direct-form filter. 38 The cascade-form filters have been proven to have better performance than the direct-form filters. 39,40 In addition, the cascade-form filters can construct low-cost systems due to their less physical modifications. 39,40 Thus, inspired by the improvement of the smoothing filter, this paper uses cascade-form filters based on the QHE and the binomial filters. 36 These cascade-form filters are also inspired based on analysis of the equation and the geometry of their ellipse curve. As shown in Fig. 2, when the QHE coefficient filter is plotted in xand y-axes, it is observed that the curve is similar to that of the binomial coefficient. Both curves show similar patterns; thus, by cascading these two coefficients, it is believed that the cascade of binomial and QHE filters will produce a smoother signal, which could be used as a good filter in this work.
Based on the previous study, we developed a features extraction algorithm for the automated screening system, where the preprocessing is considered as the features extraction process (Fig. 3). In the previous studies, the range of the wavenumber used for analyzing between normal, LSIL, and HSIL lies in the 950 to 1800 cm −1 region. 19 As plotted in Fig. 1, different types of cervical cells show different spectrum patterns with different prominent peaks in the specific bands.
To extract those aforementioned features of the FTIR spectra, this study employed a peak-corrected area-based features extraction (PCABFE) algorithm, as presented in Fig. 4. 41 The PCABFE extracts three primary features: the height of specific peaks, the height of slope between amide I and amide II, and the corrected area in specific regions. The features are calculated using E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 3 2 6 ; 5 3 2 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 3 2 6 ; 5 0 0 where HðxÞ is the height of the slope between amide I and amide II bands, A x and A 1580 are height of peak for amide I or amide II bands, which have the minimum value and absorbance value (height) for the band at 1580 cm −1 , respectively. CAðxÞ is the corrected area under the amide I peak. A x and A b are the areas under the peak and baseline, respectively. The PCABFE algorithm extracted three significant parameters: peak regions, base points, and peak locations. The values are tabulated in Tables 1  and 2.
For evaluation of the automatic feature extraction performance, a correlation test was conducted to determine the capability of the proposed PCABFE algorithm as compared to the manual extraction by using the FTIR software.

Feature Selection
The extracted features are reduced by discriminant analysis (DA) and principal component analysis (PCA) techniques. Both techniques are employed to determine the dominant features, where the irrelevant or unrelated features that could deteriorate the generalization performances of artificial neural network (ANN) are eliminated to avoid a large dimensionality problem. 42 Using the Wilks' lamda method for DA, the optimum features are selected based on the null hypothesis and the p-value. The null hypothesis is the equality of several classes of parameters, while the p-value is the probability of observing the given sample result under the assumption that the null hypothesis is true. 23 Significant level (α) chosen for admitting or rejecting the null hypotheses is 0.05. If the p-value is less than α, then the null hypothesis is not rejected. On the other hand, the null hypothesis is rejected when the p-value is more than α ¼ 0.05.
For PCA, a scree plot of all the PCs of input features of the cervical cell dataset is presented. The appropriate numbers of principal components to be used are selected by considering the result from the scree plot and eigenvalues. The features with higher eigenvalue in eigenvectors of the appropriate principal components are selected as the optimum features.

Classification of Cervical Precancerous Fourier Transform Infrared Spectrum Using Neural Network
After the signal acquisition, signal smoothing, features extraction, and features selection process, the dominant features are then fed as input data to the intelligent classification stage.
Find max absorbance value (height of peaks) of each biochemical compounds region The regions as tabulated in Table 1 and presented in Figure 1 Calculated ratio of each height of the peaks -Choose lower height of peaks between Amide I and Amide II (i.e., A x ) -Calculate height of peak at 1580 cm -1 (i.e., A 1580 ) -Determine two base points (i.e., k and l) for each peak of biochemical compounds -Calculate the area under spectrum for the region bordered by the two base points (i.e., A x ) -Calculate the area under the baseline (i.e., A b ) -Calculate the corrected area (CA) by eq 3.

CA = A x -A b
The base points as tabulated in Table 2 and presented in Figure 1 Calculate the height of slope between amide I and amide II by eq 2.  k and l are x -axis of FTIR spectra base points to calculate the corrected areas of each biochemical components. One of the aims of this study is to classify the cervical cell FTIR spectra into three classes (normal, LSIL, and HSIL). In this paper, the hybrid multilayered perceptron (HMLP) network is trained with the modified recursive prediction error algorithm for the classification purposes. During the training process of the HMLP network, this study employs a 10-fold cross-validation method. The detail on the 10-fold cross-validation method can be found in the previous study. 43 The data is partitioned into 10 sized segments or folds. Ten run iterations of training sets (i.e., 585 normal, 144 LSIL, 36 HSIL for each fold) and testing sets (i.e., 65 normal, 16 LSIL, and 4 HSIL for each fold) phases are performed with different sets in each run. A different fold of the data is used for testing, whereas the remaining nine folds are used for training in each run.
The datasets with selected features based on DA only and DA-PCA techniques are tested to obtain the better system for the automated system. The confusion matrixes are presented for evaluation purposes. In this paper, the comparison of performance is done based on accuracy result between this study and related published work. The cervical cell spectra from the FTIR spectroscopy were compared with cytology (the gold standard). Therefore, the confusion matrix is important to be presented in this paper to present an actual condition

Results and Discussions
The developed features extraction, features selection, and intelligent classification system for cervical spectra have been proposed as an automated cervical screening system. In this section, the results and discussions of the proposed method are presented. The features extraction results are explained in Sec. 3.1. Section 3.2 presents the features selection results. Section 3.3 discusses the obtained results from the intelligent classifier of cervical spectra classification in detail. Section 3.4 presents the proposed automated screening system for cervical cancer.

Features Extraction Results
The primary features (i.e., CA amide 1, CA amide 2, CA proteins, CA NA II, CA NA I, PH amide 1, PH amide 2, PH proteins, PH NA II, PH carbohydrate, PH NA I, PH glycogen, and height of slope between amide 1 and amide 2) create a very strong linear relationship with correlation test results more than 0.95 (approaching one), as presented in Table 3.
Generally, all features have a strong linear relationship (correlation test achieved more than 0.8, as presented in Table 4) compared to those extracted manually by using FTIR spectroscopy software. The results show that the PCABFE algorithm and the manual extraction using the FTIR spectroscopy software have constructed a linear curve for all features. Based on the evaluation of the PCABFE performance, a total of 32 features are extracted, as listed in Table 4.

Features Selection Results
Based on the 32 possible features, as shown in Table 4, the DA and PCA techniques are employed to determine the dominant features for the classification process in the next stage. Table 5 tabulates the results attained from DA of 32 features. Based on the results, 11 features show insignificant effect or low impact to the classification process as the p-values distribution obtained are more than 5% (as made bold in Table 5). The features with p-value distribution less than 5% are said to have high impact on the classification process. Thus, based on this argument, 21 of 32 features have been selected as dominant features for the DA process.
Afterward, the 21 features of the cervical cell FTIR spectra are further analyzed using the PCA technique. By considering the results from the scree plot and eigenvalues, the appropriate number of principal components to be used is four.  Table 6 lists the variables that have strong relationship with PC1, PC2, PC3, and PC4.

Intelligent Classification Results
For the intelligent classification results, the datasets using 21 features from the DA and 20 features from the DA-PCA processes are tested, respectively, to determine the most stable system. These dominant features from the DA and the DA-PCA processes are individually fed into HMLP classifier for classification purpose. The HMLP classification results of the DA datasets with 21 dominant features are presented in Table 7.
The HMLP with the DA datasets could detect 634 normal from 650 totals normal, 102 LSIL from 160 totals LSIL, and 24 HSIL from 40 totals HSIL cervical FTIR spectra. Meanwhile, the HMLP classification results of the DA-PCA datasets with 20 dominant features are presented. As shown in Table 7, the HMLP could detect 621 normal from 650 totals normal, 106 LSIL from 160 totals LSIL, and 27 HSIL from 40 totals HSIL cervical FTIR spectra.
Overall, the results tabulated in Tables 7 demonstrate that the HMLP shows a good performance for classifying cervical cell FTIR spectra into normal, LSIL, and HSIL classes. However, when the DA-PCA datasets were used, higher FP values were achieved than the HMLP with DA datasets. As shown in Table 7, the FP value was given as 29 (which is 26 normal cells incorrectly classified as LSIL, and three normal cells are incorrectly classified as HSIL cases) for DA-PCA dataset. While the FP values from DA datasets were significantly lower with 16 normal cases incorrectly classified as LSIL cases, and the normal cells were not classified as HSIL class. These results occurred because, in fact, the HSIL cells are high stages of abnormality, and their characteristics exhibit apparent differences from the normal cells. Meanwhile, the FN values for the HMLP with DA datasets are higher than the HMLP with DA-PCA datasets given in detail in Table 7.
The FN values are 52 LSIL cells incorrectly classified as normal class for DA-PCA datasets. No HSIL cells are incorrectly classified as normal class. Meanwhile, the FN values of DA datasets obtained 54 LSIL cells incorrectly classified as normal class, as tabulated in Table 7. The HSIL cell is also not classified as normal class. These results occurred because the LSIL cells, in fact, only affect the surface of the cervical tissue. The majority will regress back to normal spontaneously. 44 Over time, a small proportion will continue to develop into true cancer. The HSIL cells cannot be recovered to be normal cells. Based on the FP and FN values for both datasets results in Table 7, the system can significantly differentiate between normal and HSIL cells, and the LSIL and normal cells can also be distinguished. However, small portions of the LSIL are incorrectly classified as normal cells, and part of normal cells are incorrectly classified as LSIL cells. Similarly, these results are expected since most LSIL cells will regress back to normal. 44 Therefore, our system produced consistent results with acceptable accuracy to classify the cervical precancerous cells.
The result of the HMLP with the DA dataset shows relatively better performance in term of stability. Therefore, based on the 21 selected features (from DA datasets), the HMLP classifier shows a good performance for classifying cervical cell FTIR spectra into normal, LSIL, and HSIL classes with 92% of accuracy. The promising results obtained in this stage are utilized to develop an automated screening system for cervical cancer. The results of the proposed system are elaborated in detail in Sec. 3.4.

Automated Screening System for Cervical Cancer
The proposed screening system contains the automatic features extraction and intelligent screening. Figure 5 shows the interfacing of the system. A user is only required to input the cervical cell FTIR spectra. The smoothing spectrum, the features of cervical cells FTIR spectrum, and the case and class of the cervical cell FTIR spectra will automatically be displayed. This procedure could possibly produce faster screening results and decrease the dependency on human experts, thus reducing the workload of pathologists. To date, several researchers have developed cervical cancer screening tools based on the spectroscopy approaches. Our proposed system can be compared to the other developed system using FTIR spectroscopy. 19 This system used only five features, which are obtained from the ratios of the peak height values (1) glycogen/NA I, (2) NA I/carbohydrates, (3) NA I/amide II, (4) proteins/amide I, and (5) NA I/proteins to differentiate the different types of cervical cell spectra. We also include the developed system that uses Raman spectroscopy 27 in our comparison. The comparison results are shown in Table 8, where the three systems of A 19 , B 27 , and C (proposed system) are tabulated. Table 8 suggests that our proposed system achieved the best performances in term of accuracy with 92%. This is likely because our proposed system used more dominant features (21 features from DA datasets) to differentiate between three classes of the cervical cells (normal, LSIL, and HSIL cells). Therefore, we suggest, based on the aforementioned explanation, that our system simultaneously has better results to differentiate the cervical cells due to the proposed signal smoothing filter, PCABFE algorithm to extract features from the cervical cell FTIR spectra, DA to select the optimum features (21 features), and HMLP network for classification.

Conclusions
In this paper, an automated screening system has been presented to determine the case and classes of cervical precancerous cells based on cervical cell FTIR spectrum. The automated screening system employs signal processing techniques and ANN. The digital signal processing techniques introduce a cascade of direct form smoothing filter and an automated features extraction technique for extraction features from the cervical cell FTIR spectra. Meanwhile, the DA features selection technique and ANN are employed in the classification stage. The effectiveness of the proposed screening system has been demonstrated empirically using 850 cases of cervical cell FTIR spectra to classify the cervical cells into normal, LSIL, or HSIL cell with an accuracy of 92% based on the DA datasets. Although the results obtained so far are encouraging, more investigations on both theoretical and practical aspects are needed to further indicate the applicability of the proposed screening system to screen for cervical precancerous stage-based cervical cell FTIR spectra.

Cervical cell FTIR spectra Screening results
Cervical cell FTIR spectra features Screening button panel Fig. 5 Interfacing of the proposed automated screening system for cervical cancer.  (7)