Near-infrared (NIR) diffuse optical tomography (DOT) is emerging as a potential clinical tool for breast cancer detection due to its ability to quantitatively image the high optical contrast that arises intrinsically from molecular and cellular signals generated through the presence of blood, water, and lipid, as well as cellular density, which are the predominate transformations associated with malignancy. Clinical studies conducted at multiple institutions and countries have repeatedly shown that there exist 2:1 and higher absorption contrasts in breast cancers that can be tomographically imaged. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Since 1997, our laboratory has developed three complete clinical platforms for NIR optical tomography of the breast, evolving from single-wavelength/2-D to multiwavelength/3-D capabilities. These systems have been used to evaluate the potential of our approach through a series of studies designed to quantify the imaging contrast in the normal and abnormal breast and to provide initial assessments of the operating characteristics of the imaging systems for diagnostic decision-making in the setting of screen-detected breast lesions.10, 11, 12, 13, 14 Specifically, both absorption and scattering properties of breast tissues can be obtained to sensitively distinguish between normal and abnormal breast tissues. Cysts can be clearly differentiated from solid tumors based on these two properties alone. Further, Hb and are two important parameters for enhancing sensitivity, consistent with the finding from Chance 15 However, these imaging parameters available from the current NIR tomography, do not appear to be able to fully characterize breast tissues, resulting in limited sensitivity and specificity. In 2003, it was shown for the first time that refractive index/phase contrast could be used as a new imaging parameter for NIR tomography where refractive index and absorption/scattering parameters were reconstructed using two different algorithms.16 Our initial clinical results demonstrate that phase-contrast DOT combined with conventional DOT offers considerably improved sensitivity and specificity compared to that by using conventional DOT alone.13 In addition, our results show that cellular density and size derived from the scattering spectra can characterize the nature of breast lesions more accurately than the scattering property or scattering amplitude/scattering power. Initial results in 14 breast abnormalities show that malignant tumors can be differentiated from benign lesions with high accuracy,14 meaning that phase contrast and cellular density/size together with the functional parameters can provide a more complete spectrum with much improved sensitivity and specificity for accurate characterization of breast lesions.
Given the relatively large set of imaging parameters now available from our NIR reconstruction approach, it is natural to adapt and develop methods for computer-aided classification of breast lesions. Computer-aided diagnosis has been well studied and widely accepted in the fields of conventional imaging.17, 18, 19, 20 This paper reports our initial effort in applying automatic classification algorithms to the analysis of tissue phase contrast, absorption, and scattering parameters. Our classification results of 35 breast masses using a support vector machine (SVM) classifier demonstrate for the first time that the specificity can be significantly improved from 71% based on the visual examination to 92% when phase contrast, absorption, and scattering parameters are all used.
The rest of this paper is organized as follows. First, the visual examination process for detecting breast cancer is reviewed in Sec. 2. Then, the automated procedure for extracting image features is proposed in Sec. 3. These image features in turn will be used in Sec. 4 to detect breast cancer by an SVM classifier. Last, concluding remarks and future studies are discussed in Sec. 5.
Visual Detection of Breast Cancer
Image Presentation of Optical Parameters
Thirty-five breasts from 33 different patients were imaged using a compact, parallel-detection diffuse optical mammography system.21 The absorption and scattering images were reconstructed using a finite-element-based algorithm,22, 23, 24 while the refractive index images were recovered using a finite-element-based phase–contrast DOT algorithm.16 All the images were reconstructed using a finite element mesh consisting of 700 nodes, thus giving the refractive indices, absorption, and scattering coefficients at 700 locations evenly distributed across the entire sample area. To enable visual examination of the reconstructed optical parameters, the partial differential equations (PDE) toolbox in MATLAB is used to display the obtained parameters. Figures 1 and 2 demonstrate the coronal refractive index, absorption, and scattering images from two representative patients. The first case (Fig. 1) is a -old female with a infiltrating ductal carcinoma, and the second case (Fig. 2) is a -old female with biopsy-confirmed benign microcalcifications.
Using the refractive index, absorption, and scattering images plotted by the MATLAB PDE toolbox, an experienced technician may visually distinguish a malignant tumor from a benign one. For instance, examining the absorption and scattering images shown in Figs. 1b and 1c, the tumor area can be identified at around the coordinate ( , ). Inspecting the corresponding area in the refractive index image shown in Fig. 1a, it is clear that the refractive index in this area is lower than the surrounding area. Thus, this is a cancer case. For the images shown in Fig. 2, the lesion area is identified at the coordinate (10, ) by checking the absorption and scattering images given in Figs. 2b and 2c. Examining the image shown in Fig. 2a, the refractive index in the corresponding area is found higher than the surrounding area. Therefore, this is a benign case. However, in these two cases, visually examining only the absorption and scattering images without checking their associated refractive index images cannot distinguish between the malignant case and the benign one. These two examples indicate that it is possible to use the refractive index image in conjunction with absorption and scattering images to differentiate malignant from benign lesions.
To further validate the feasibility of the visual examination method, the absorption, scattering, and refractive index images of 35 breasts were obtained from 33 patients before their biopsy procedures. Using the aforementioned method to classify the images visualized by the MATLAB PDE toolbox, the statistics of the visual examination results over these 35 breast masses [biopsy confirmed 11 invasive carcinomas (malignant group) and 24 benign lesions (benign group)] are presented in Table 1 .13
Statistics of breast cancer detection by visual examination of refractive index, absorption, and scattering images.
|Truepositives||Truenegatives||Falsepositive||Falsenegatives||Sensitivity||Specificity||Rate of false positive(FPR)||Overallaccuracy|
While the results shown in Table 1 are promising, the visual examination process has several drawbacks. First, it is time consuming, because the technician has to manually create MATLAB files to visualize the images. Second, the visual identification of malignant lesions depends on the subjective judgment of human beings: physicians or technicians have to be specially trained to make a reliable classification, and human errors may be inevitable. Third, some images, especially the refractive index images, are relatively noisy, and it is difficult to give a reliable classification using visual examination. For instance, the images shown in Fig. 3 can be classified as either a malignant or a benign case using visual examination because half of the lesion area (the corresponding lighter color areas in the absorption and scattering images) has high refractive index while the other half has low refractive index. The technician or physician has to make a best guess based on his/her experience. Therefore, the accuracy of the classification is questionable. Last, only 700 discrete points of the sample area on a triangular mesh were used to obtain the refractive index, absorption, and scattering parameter values (see Fig. 4 ). These 700 values are then visualized into smooth images, as shown in Figs. 2, 3, 4, using the MATLAB PDE toolbox. This visualization process may introduce imprecision to the visual classification of images.
To address these drawbacks, an automatic procedure is proposed in this paper to first directly analyze the image data to extract the classification attributes and then to detect breast cancer using an SVM classifier.
Automatic Feature Extraction
The first step of our automated classification procedure is to automatically extract classification attributes from the refractive index, absorption, and scattering images. Unlike the manual process, which uses the PDE toolbox in MATLAB to generate color images of the optical parameters, a program is developed using the C programming language to automatically extract interested features from the recovered image data to avoid possible errors introduced by the visualization process. This feature extraction process consists of two phases: image segmentation and parameter extraction.
It is important to identify the lesion area before extracting features for classification. The lesion areas will be identified by analyzing only the distribution of absorption and scattering coefficients because the refractive index data are relatively noisy. Due to their cellular morphology and biochemical compositions, some lesions may show higher absorption and scattering coefficients than normal tissues (e.g., Fig. 1), while other lesions may yield lower absorption and scattering coefficients relative to the surroundings (e.g., Fig. 3). This is another reason that it is unreliable to use only the absorption and scattering coefficient values to distinguish cancers from the benign lesions.
To automatically identify the lesion areas, the image segmentation process must first determine whether the areas with high coefficient values or low coefficient values should be selected. Because our automated classification procedure is designed for early noninvasive detection of breast cancer, it is reasonable to assume that a lesion usually exists in a small area of the entire imaging domain. Therefore, applying statistical analysis on the absorption and scattering coefficients can determine the possible background data range and identify the potential lesion areas. At first, the median absorption and scattering coefficient values of the 700 data samples are calculated. Then, the mean values of the upper quartile (upper 25%) sample data and the lower quartile (lower 25%) sample data are computed. If the difference between the mean of the upper quartile sample data and the median is greater than the difference between the median and the mean of the lower quartile sample data, lesion areas should have high coefficient values; otherwise, the lesion area should have low coefficient values. Although the refractive index images are relatively noisy, the same statistical analysis can be used to identify the background to apply the same segmentation process.
After the data range for the background is determined, possible lesion areas are identified through image segmentation. Image segmentation is a process in which regions or features sharing similar characteristics are identified and grouped together. Image segmentation may use statistical classification,25 thresholding,26 edge detection,27 region detection,28 or any combination of these techniques. The output of the segmentation is usually a set of classified elements, such as tissue regions or tissue boundaries. Most segmentation techniques are either region-based or edge-based. Region-based techniques rely on common patterns of values within a cluster of neighboring pixels or sample points. This cluster is referred to as the region, and the goal of the segmentation algorithm is to group regions according to their anatomical or functional roles. Edge-based techniques rely on discontinuities in image values between distinct regions, and the goal of the segmentation algorithm is to accurately demarcate the boundary separating these regions. A good segmentation procedure is the key to the success of the image processing, while weak or erratic segmentation algorithms almost always guarantee eventual failure.
Because our image data are 700 discrete sample points distributed on a triangular mesh, as shown in Fig. 4, traditional edge-based approaches are not suitable for these data. Therefore, a region-based thresholding segmentation method is used to identify the regions of interest (possible lesion areas). If the high coefficient areas are the targeted areas, the segmentation process can start at any point with a data value above a certain threshold, and the region is expanded by including the points directly connecting to any point in the region with a value above the threshold. This process continues until all points with values above the threshold are examined and included in a region. Conversely, if the low coefficient areas are the targeted areas, the segmentation process can start at any point with a data value below a certain threshold, and the region is expanded by including the points directly connecting to any point in the region with a value below the threshold. Because normal tissue absorption, scattering, and refractive index values vary for different patients, it is undesirable to use an absolute threshold for the image segmentation. Instead, a relative threshold is used in our image segmentation procedure. Assuming the maximum and minimum values of sample points to be and , respectively, for any sample point with a value , the sample point belongs to a high-value region of interest if , and conversely, it belongs to a low-value region of interest if . The threshold values for absorption and scattering coefficient images are set to 0.7, while the threshold for refractive index images is 0.6 based on experiments. The reason that a smaller threshold value is used for the refractive index images is because the variations of the refractive index values at different sample points are much smaller than those of absorption and scattering sample data. Figures 5 and 6 demonstrate the images after the region-based thresholding segmentation was applied on the images presented in Figs. 1 and 2 respectively. We note that the edges of the areas of interest on the images shown in Figs. 5 and 6 are not as smooth as those given in Figs. 1 and 2. This is due to the fact that the segmentation algorithm is directly applied on the 700 discrete sample points. In addition, the background values of these images are the mean value of the sample points in the region of non-interest.
After the segmentation, the classification attributes will be extracted from the regions of interest. If only the absorption or scattering images are used to classify the lesions, the region with the largest size or having the largest mean value is selected as the lesion area for each image. Once the lesion area is identified, the mean coefficient, size, length, and width of this area and the mean coefficient of the background are extracted as the attributes for image classification. However, the method of determining the lesion area is different when both the absorption and scattering coefficients are considered. Based on our experiments, a location correlation exists between the absorption and scattering coefficients. Therefore, when both absorption and scattering images are available, a region of interest on the absorption image will be selected as the lesion area only if it has the largest overlap area with any of the regions of interest in the scattering image, or its distance to any of the regions of interest in the scattering image is minimal if there are not any overlapped regions of interest between the absorption and scattering images. Hence, a lesion area identified using correlation between the absorption and scattering coefficients may be different from that identified by using absorption or scattering image alone. Again, the mean coefficient of the lesion areas and their size, length, and width and the mean coefficient of the background are extracted for image classification. In addition, the overlap ratio of the regions of interest on the absorption and scattering images is also included as a classification attribute. Because using the absorption and scattering images is sufficient to identify the lesion areas and the refractive index images are relatively noisy, the refractive index images are not used to identify the lesion areas. However, once the lesion area is identified using absorption and scattering images, the mean refractive index value at the lesion area and the refractive index value in the surrounding area are included in the classification attributes for cancer detection.
With the extracted diagnostic attributes, a support vector machine (SVM)29, 30 classifier is used to detect the breast cancer. SVMs are a new generation of machine-learning systems based on recent advances in statistical learning theory. SVMs deliver the state-of-the-art performance in real-world applications such as image classification, bio-sequence analysis, etc., and are now considered one of the standard tools for machine learning and data mining.
Given a training data set , where is a data sample and is the associated class label, our breast cancer detection is actually a binary classification problem, i.e., is from a label space where denotes the cancer and the noncancer. Let be a mapping function that projects data samples from the data space to a feature space. The SVM learning algorithm finds a hyperplane in the feature space to solve the following optimization problem:is the penalty parameter of the error term. This optimization problem can be solved in the dual domain using quadratic programming: is the kernel. By solving Eq. 3, the decision function, given an unseen test sample , is expressed as: , , and are kernel parameters.
In our lesion image classification, the RBF kernel is used due to some of its advantages. First, the RBF kernel nonlinearly maps samples into a higher dimensional space so that it can handle the cases when the relation between class labels and attributes is nonlinear. Conversely, the linear kernel cannot deal with the nonlinear relationship between class labels and attributes. In fact, the linear kernel can be viewed as a special case of RBF since one can always achieve the same performance using the RBF kernel with some parameters as that using the linear kernel with a penalty parameter .31 Second, although the sigmoid kernel behaves like RBF for certain parameters, this kernel may be invalid (i.e., not the inner product of two vectors) under certain parameters.32 Last, the polynomial kernel has more hyperparameters than the RBF kernel and may be more complex in model selection.
Using an RBF kernel, two parameters, and , must be determined through the training data because it is impossible to know beforehand which and are the best for a particular problem. In our automated classification procedure, a computer program is implemented using C programming language and the application programming interface (API) provided by Weka data mining tools33 to automatically search for the best parameters. Because a high training accuracy (i.e., classifiers accurately predict training data whose class labels are indeed known) may not necessarily result in a high accuracy in prediction of unknown data due to the overfitting problem with many advanced classification algorithms, a tenfold stratified cross-validation is used to evaluate the accuracy of the SVM classifier.
Classification Results Based Solely on Absorption Coefficient
Our first experiment is to evaluate the SVM classifier trained by the attributes extracted from only the absorption coefficient images. Five attributes are extracted from each absorption coefficient image. They are the size of the lesion area in terms of the number of sample points, the mean coefficient of the lesion area, the mean coefficient of the background, and the length and width of the lesion area. Figure 7 shows the absorption attributes obtained by our feature extraction procedure. The confusion matrix of the 10-fold cross-validation results is depicted in Table 2 . A confusion matrix is a visualization tool typically used in supervised machine learning. Each column of the confusion matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see whether the system is confusing two classes (i.e., commonly mislabeling one as another). As shown in Table 2, the first row represents the number of cancer instances, and the second row represents the number of noncancer instances. On the other hand, the first column of Table 2 represents the number of instances that are classified as cancer by our SVM classifier, while the second column represents the number of instances classified as noncancer by our SVM classifier. The data in the first row show that there are 11 actual cancer instances, and 6 of them are identified as cancer by the SVM classifier. Therefore, the sensitivity of the classification is 54.5% (6/11). On the other hand, the data in the second row show that there are 24 actual noncancer instances, and 17 of them are identified as noncancer by the SVM classifier. Thus, the specificity is 70.8% (17/24). Although these results show a specificity of 70.8% on the classification using only absorption coefficient images, the low sensitivity (54.5%) indicates that using the absorption coefficient images alone cannot distinguish the malignant from the benign cases.
Confusion matrix of the SVM classifier using attributes extracted from absorption coefficient images.
Classification Results Based Solely on Scattering Coefficient
Our second experiment is to evaluate the SVM classifier trained by the attributes extracted from the scattering coefficient images. The same five attributes as in the first experiment are extracted from each scattering coefficient image. Figure 8 shows scattering attributes obtained by our feature extraction procedure. The confusion matrix of the 10-fold cross-validation results is depicted in Table 3 . Again, the results shown in Table 3 indicate that using the scattering coefficient images alone cannot distinguish the malignant from the benign cases since the sensitivity is only 45.5%.
Confusion matrix of the SVM classifier using attributes extracted from scattering coefficient images.
Classification Results Based on Both Absorption and Scattering Coefficients
Our third experiment is to evaluate the SVM classifier trained by the attributes extracted from both the absorption and the scattering images. As discussed earlier, the lesion areas are identified by considering the co-existence of areas of interest in the same location on both the absorption and the scattering coefficient images. Therefore, some data shown in Fig. 9 are different from those shown in Figs. 7 and 8, which were obtained by analyzing only the absorption and the scattering images, respectively. In addition to the 10 attributes extracted from the absorption and the scattering images, respectively (5 attributes for each image), the overlap ratio of the regions of interest on the absorption and scattering images is also included as a classification attribute. The overlap ratio is calculated as twice the number of overlapped points divided by the total number of points in the corresponding regions of interest on the absorption and scattering images. The confusion matrix of the 10-fold cross-validation results is depicted in Table 4 .
Confusion matrix of the SVM classifier using attributes extracted from both absorption and scattering coefficient images.
The results shown in Fig. 9 indicate that combining the attributes extracted from the absorption images with those obtained from the scattering images improves the classification performance. With the combined attributes, the sensitivity, specificity, and overall accuracy of our classification are 63.6%, 83.3%, and 77.1%, respectively.
As discussed in Ref. 13, it is impossible to distinguish the malignant from the benign cases by just visually examining both absorption and scattering images. However, our automated classification procedure can achieve reasonable classification results using both absorption and scattering coefficient images. Especially, the specificity of the results obtained by our automated classification procedure on two parameters (absorption and scattering images) is much higher than the specificity of the visual examination results using three parameters (absorption, scattering, and refractive index images). However, the sensitivity of the automated classification using only absorption and scattering attributes is still low, suggesting that it is necessary to use the refractive index attributes for classification.
Classification Results Based on Absorption and Scattering Coefficients and Refractive Index
Our final experiment is to evaluate the SVM classifier trained by the attributes extracted from the refractive index images combined with the attributes from the corresponding absorption and scattering images. As discussed earlier, the location correlation between the regions of interest on the absorption and scattering images is used to determine the lesion area. In addition to all attributes used in the third experiment, the mean refractive index of the lesion area and the mean refractive index of the area surrounding the lesion area are added into the attribute list. These attributes are listed in Fig. 10 . Training the SVM classifier using the attributes obtained by all three kinds of images, the confusion matrix of the 10-fold cross-validation results is presented in Table 5 .
Confusion matrix of the SVM classifier using attributes extracted from absorption, scattering, and refractive index images.
The results in Table 5 show that the sensitivity, specificity, and overall accuracy of the automated classification procedure are 81.8%, 91.7%, and 88.6%, respectively. Comparing to the classification results using only the attributes extracted from absorption and scattering images, classification using refractive index attributes in conjunction with the absorption and scattering attributes improves the sensitivity and specificity by 19 and 8 percentage points, respectively. These results are also better than the visual examination results listed in Table 1. In particular, the automated classification procedure improves the specificity of the classification by more than 20 percentage points, comparing to the visual examination method presented in Ref. 13.
An automated procedure for detecting breast cancer based on optical tomographic images is developed. This procedure uses a computer program to automatically extract attributes from absorption, scattering, and refractive index images for lesion classification. An SVM classifier is used to distinguish between the malignant and benign lesions based on these automatically extracted attributes. The classification results show that the sensitivity, specificity, and overall accuracy using this automated procedure are 81.8%, 91.7%, and 88.6%, respectively. In contrast, the sensitivity, specificity, and overall accuracy of the classification using attributes extracted from only the absorption and scattering coefficient images are 63.6%, 83.3%, and 77.1%, respectively. These results indicate that combining the refractive index with the absorption and scattering coefficients can achieve significantly improved classification performance over using only absorption and scattering coefficients. Furthermore, these results are also better than the results obtained by visual examination of images, in which the sensitivity, specificity, and overall accuracy are 81.8%, 70.8%, and 74.3% respectively.
It is worth mentioning that it is critical to obtain reliable sample data from the breast masses for accurate image processing and classification. Our experiments show that the automated classification procedure cannot consistently classify the samples obtained from the three large breasts ( in diameter) due to the low signal-to-noise ratio (SNR) of the hardware system for these cases. If these data samples were removed, the automated classification results should have higher sensitivity, specificity, and overall accuracy.
To achieve better classification results, we are currently developing data collection strategies that can improve the SNR of our imaging system so that it can produce reliable coefficient data for large breasts. In addition, our automated procedure used a predetermined threshold for image segmentation. We are currently investigating the entropy-based and iterative selection methods to automatically determine an optimal segmentation threshold for a particular image.
Last, there was possibly cross talk between the refractive index and the absorption/scattering parameters and so the recovered refractive index was just an estimation. However, the cross talk was reduced to a certain extent via a two-step strategy where the refractive index and absorption/scattering parameters were reconstructed using two different algorithms.16 Importantly, the estimation or semiquantitative nature of the recovered refractive index, although limited in accuracy, is sufficient for us to classify the cancer and benign groups effectively using the automated classification algorithms described in this paper. A more accurate estimate of the refractive index will likely improve the accuracy for cancer classification—we are currently developing schemes that can enhance the separation of refractive index from absorption/scattering parameters. In fact, we have recently reported a region-based reconstruction approach that has shown promising results in this regard using phantom studies.34 We plan to evaluate this and upcoming new methods for better refractive index reconstruction on clinical data in the near future.
This work was supported in part by a grant from the National Institutes of Health (NIH), Grant No. R01 CA090533.