## 1.

## Introduction

Recently work in the field of diffuse optical tomography (DOT) has progressed from purely theoretical studies and bench-top experiments to first clinical trials that explore the utility in breast cancer diagnosis,^{1, 2, 3} brain imaging,^{4, 5, 6} and arthritis detection.^{7, 8, 9, 10, 11, 12, 13} While substantial advances have been made in building clinically useful instruments, and developing an image reconstruction algorithms, much less effort has been spend on developing image analysis tools. Other medical imaging fields such as magnet resonance imaging (MRI), computer tomographic imaging (CT), and ultrasound (US) imaging frequently make use of advanced image analysis methods that enhance sensitivity and specificity in many cases. For example computer-aided diagnostics (CAD) systems have been successfully employed in areas such as mammography,^{14} chest CT (Refs. 15, 16), and brain imaging.^{17, 18} In biomedical optics, CAD has only been applied in two studies related to optical coherence tomography (OCT), which explored its utility in esophagial and cervical cancer.^{19, 20} To the best of our knowledge, no studies have been presented where CAD was employed in the analysis of DOT images.

In this paper a CAD system is introduced to enhance the analysis of sagittal laser optical tomographic (SLOT) images obtained from proximal interphalangeal (PIP) joint of patient with rheumatoid arthritis. These SLOT images display spatially varying absorption coefficient
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
and scattering coefficient
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _s$\end{document}
${\mu}_{s}$
across the joint. Previous studies evaluated the potential of using features such as minimal and maximal absorption or scattering coefficients or their ratios in a region of interest, to distinguish between affected and not affected joints.^{11} Using the minimal absorption coefficient as an image feature, sensitivities and specificities of 0.71 could be achieved in identifying affected joints assuming that ultrasound can be considered as an imaging gold standard that provides ground truth.

This study goes beyond previous analyses in several ways. First, it includes the combination of multiple features (e.g., minimum absorption coefficient and ratio of maximum and minimum absorption coefficient) in addition to evaluating classification performances using only a single feature. We also add image features previously not considered, such as the variance of optical properties in the images. Furthermore, instead of using only US as gold standard that provides the ground truth, we evaluate the classification performance using additional ground truth derived from MRI, clinical evaluation, and visual inspection of optical tomographic imaging itself. Finally, a larger data set was used, which includes optical tomographic images of 100 finger joints. These images were obtained with a SLOT system that showed better SNRs and long-term measurement stability than systems used in previous studies.^{11}

To deal with the problem of multiparameter classification we employ in this work a machine learning tool that explores, sorts through, and interprets tomographic image data.^{21} Classifications were performed by an interpretation system based on self-organizing mapping (SOM). The classification technique was originally developed as a physical-mathematical model to mimic the human's visual system.^{22, 23} Indeed, this method has been used in the past in other scientific fields for similar classification problems.^{21, 24, 25} It has shown to produce significant better results than approaches based on discriminant analysis and logistic regression.^{26, 27} To see if this holds true in our case, we also performed a discriminant analysis and compared the results with the SOM machine learning approach.

In the reminder of the paper, we first describe in detail the data used for the analysis. This is followed by a detailed description of the machine-learning-based classification approach applied in this paper. Subsequently, the results obtained with this approach are presented and discussed.

## 2.

## Optical Tomography Data

Data sets were analyzed resulting from tomographic reconstructions of SLOT images to determine best image interpretation results. An example of a SLOT image is shown in Fig. 1. These images were generated by measuring the transmitted light intensities along the central axis of the index, middle, and ring fingers on the left and right hands. The light source was a laser with wavelength λ = 675 nm, which was focused to ≈1-mm spot on 11 different position on the back of each finger. For each position the transmitted light intensities were measured with a Si photo diode. This transmission data became input to a model-based iterative image reconstruction code that used the equation of radiative transfer as light propagation model. For a more detailed description of the experimental setup and the image reconstruction code see Refs. 10, 28, 29.

In total, 100 optical tomographic images of human finger joints were used in this study. A region of interest (ROI) was defined within each image to prepare the images for CAD analysis. Data were eliminated in the first 4 mm on the top and bottom of each image and 7 mm on the left and right. In this way, the chosen ROI did not contain potential image artifacts, which are often encountered near source and detector positions (image boundaries). Within the ROI, different parameters of the absorption coefficients where extracted including the smallest value min(·), maximum value max(·), mutual ratios [e.g., min(·)/max(·)], and statistical variance var(·). All extracted image features were combinatorially combined. Thus, each image was characterized by a *n*-dimensional feature vector
[TeX:]
\documentclass[12pt]{minimal}\begin{document}${\bm x}_n$\end{document}
${\mathbf{x}}_{n}$
consisting of a set of *n* image features. These feature vectors became input to a machine learning tool classifying each image as a finger affected or not affected by rheumatic arthritis (RA).

To perform this analysis, one requires a ground-truth benchmark, that identifies each patient as affected or not affected by RA. In previous works US images were considered to provide such ground truth.^{11} Many researchers consider magnetic resonance images of finger joints as the most accurate indicator for RA. However, no studies have been presented that MRI is indeed the most accurate ground truth. This would require longitudinal studies spanning many years of follow up to establish the predictive and prognostic value of each imaging method. A study like this has not yet been performed. Therefore, we decided to report on the performance of our CAD system for different ground truths, derived from different sources, including MRI, US, clinical evaluation (CE), and optical inspection of SLOT images. For each modality experts scored images and data on a four-point scale: 0 for definitely no synovitis, 1 for probably no synovitis, 2 for possibly synovitis, and 3 for definitely synovitis. Subsequently each finger was labeled by only two different classes: not affected class
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$c_0$\end{document}
${c}_{0}$
= {definitely and probably no synovitis} and affected class
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$c_1$\end{document}
${c}_{1}$
= {possibly and probably synovitis} (Figs. 1 and 2). Feature vectors
[TeX:]
\documentclass[12pt]{minimal}\begin{document}${\bm x}_{\lbrace n, c\rbrace }$\end{document}
${\mathbf{x}}_{\{n,c\}}$
containing various optical parameters were labeled accordingly and the performance of the CAD system was evaluated using various performance measures, including sensitivity and specificity.

## 3.

## Methodology

## 3.1.

### Machine-Learning-Based Classification Method

In the most general case, the goal of any medical classification scheme is to determine ranges of diagnostic parameters for which a person is found to be healthy or afflicted by a certain disease. In this study, we look for features in the optical tomographic images that can be used to determine whether a patient is affected by RA or not affected. In previous works, only single features were considered. For example, we found that patients with min (
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\kern1pt>\kern1pt$\end{document}
$\phantom{\rule{1.0pt}{0ex}}>\phantom{\rule{1.0pt}{0ex}}$
0.272 cm
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$^{-1}$\end{document}
${}^{-1}$
should be considered affected by RA, while patients with min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\kern1pt<\kern1pt$\end{document}
$\phantom{\rule{1.0pt}{0ex}}<\phantom{\rule{1.0pt}{0ex}}$
0.272 cm
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$^{-1}$\end{document}
${}^{-1}$
should be classified as healthy.^{11} However, as mentioned in the introduction, using this criteria, we only achieved a sensitivity and specificity of about 0.71.

If multiple features are used, as in the study at hand, the simple cutoff value [min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)] = 0.272 cm
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$^{-1}$\end{document}
${}^{-1}$
] has to be replaced by a more complex rule. If *n* features are considered, a classification scheme seeks an (*n* − 1)-dimensional hyperplane that separates the *n*-dimensional space into two regions: one region characteristic for patients affected by RA, and another region that characterizes unaffected people.

In the approach presented here, a separation into two regions is achieved in two steps. First, an SOM algorithm is used to cluster the given training data of feature vectors
[TeX:]
\documentclass[12pt]{minimal}\begin{document}${\bm x}_i$\end{document}
${\mathbf{x}}_{i}$
of affected and healthy fingers. The main variable of this algorithm is the cluster size. Therefore, if *k* feature vectors are presented and the cluster size is *q*, the algorithm returns about *k*/*q* clusters that each contain *q* feature vectors (also called members). The members of each cluster are typically located close to each other in the *n*-dimensional feature space [see, for example, the clusters in Fig. 2]. At this point, clusters may contain feature vectors belonging to images of only healthy, only affected, or a mixture of affected and healthy joints. Details of the SOM-based clustering scheme can be found in the appendix (Sec. 6).

In the next step, all members belonging to a given cluster are assigned to either the affected class or the not-affected class, depending on a threshold
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
that is set the same for all clusters. Therefore, if at least
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
% of all members of a given cluster are affected, that cluster will be assigned to the affected class. If less than
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
% of all members of a given cluster are affected, that cluster will be assigned to the nonaffected class. At the end of this process the *n*-dimensional feature space has been divided into clusters that represent features of unaffected and affected patients. If now a new feature vector, derived from an optical tomographic image of a finger of a new patient, is considered, this new feature vector will “fall” into one of these clusters and the patient will be declared affected or unaffected by RA, depending on the status of that cluster.

The following example illustrates how this classification approach works. Figure 2 shows the distribution of a 2-D feature vector, with the variance (
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) as first component (*x* axis) and min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) as the second component (*y* axis). The red squares identify the feature vectors belonging to images of affected joints, while the blue circles identify the feature vectors belonging to images of not-affected joints, as determined, in this case, by using the US-derived ground truth. One can see that if one would attempt to classify a feature vector (representing an image) as affected or not affected using only threshold values for either var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) or min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), it would lead to a large number of misclassifications. For example, postulating that all fingers with var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\kern1pt<\kern1pt$\end{document}
$\phantom{\rule{1.0pt}{0ex}}<\phantom{\rule{1.0pt}{0ex}}$
0.2 are affected would lead to three false negatives [the three red squares to the right of the threshold in Fig. 2] and 16 false positives [the 16 blue circles to the left of the threshold in Fig. 2]. Similarly postulating that all fingers with min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\kern1pt<\kern1pt$\end{document}
$\phantom{\rule{1.0pt}{0ex}}<\phantom{\rule{1.0pt}{0ex}}$
0.2 are affected, this would lead to 13 false negatives [red squares above the threshold in Fig. 2] and 14 false positives [blue circles below the threshold in Fig. 2]. Therefore, classifications on only one feature would be highly flawed.

As stated, the SOM method is used to partition a given data set [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_n$\end{document} ${\mathbf{x}}_{n}$ into subregions or clusters [Fig. 2]. In this particular example the cluster number was set to 13. Thus, each feature vector belongs to a subregion that includes on average five data points.

In the next step, a given cluster is either assigned to the affected class or not-affected class depending on a threshold [TeX:] \documentclass[12pt]{minimal}\begin{document}$p_t$\end{document} ${p}_{t}$ that is set the same for all clusters. In the given example [see inset of Fig. 2] the cluster is populated with three data points representing affected joints (red squares) and two data points representing not affected joints (blue circles). Therefore, choosing a frequency threshold of, for example, [TeX:] \documentclass[12pt]{minimal}\begin{document}$p_t>$\end{document} ${p}_{t}>$ 50% will result in assigning the cluster to the affected class, since 60% of its members are indeed affected. On the other hand, if [TeX:] \documentclass[12pt]{minimal}\begin{document}$p_t$\break$>$\end{document} ${p}_{t}>$ 80% is chosen, the cluster will be assigned to the unaffected class, since less than 80% of its members are actually affected.

Choosing different cluster size *q* and threshold
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
will result in different separations of this 2-D feature spaces into regions typical for affected and unaffected joints. These differences in separations will lead to different classification results. In general, classification performance increases as the cluster size gets smaller (which is equivalent to the number of subgroups gets larger) until an optimum is reached when misclassification is at a minimum. Thus, a too small number of clusters leads to “underfitting” of a given data set, whereas a too large number may lead to a data “overfitting.” In both cases, misclassification increases.^{21, 30}

In this work, cross-validation is employed to determine the optimum cluster size of maximum classification performance with respect to a random sampling of the data points. The number of clusters is varied from 1 to 100 and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
from 0 to 100. For each combination the classification performance was evaluated by using a leave-*q*-out approach.^{30} Therefore, the given *n*-dimensional data manifold of *L* = 100 realizations was randomly split into *q* disjoint subsets (e.g., *q* = 10). The accuracy was determined when performing SOM classification with *q* −1 of *q* subsets (learning step) and applying it to the remaining 1 of *q* subsets (validation step). By leaving one subset out, the procedure was conducted *q* times and the mean and standard deviation of various performance measures (see Sec.
3.3) were calculated.

The meta-algorithm shown below provides a summary of the classification and interpretation procedure of any given *n*-dimensional data manifold of image features, e.g., drawn from tomographic images. More details of the SOM clustering algorithm are given in the appendix (Sec. 6).

Set dimensionality

*n*feature space [TeX:] \documentclass[12pt]{minimal}\begin{document}$X_n$\end{document} ${X}_{n}$Set cluster number

*l*Set target classes for “ground truth”-benchmarks [TeX:] \documentclass[12pt]{minimal}\begin{document}$\lbrace c_0, c_1\rbrace$\end{document} $\{{c}_{0},{c}_{1}\}$

Set all

*j*“ground truth”-benchmarks (MRI, CE, US, SLOT)Generate an ensemble of

*L*SOM with inital and final learning ratesBEGIN Loop for each SOM(

*l*)*m*= 1 →*L*Partition all [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_n$\end{document} ${\mathbf{x}}_{n}$ into

*l*subgroups [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm w}_n$\end{document} ${\mathbf{w}}_{n}$BEGIN Loop for each [TeX:] \documentclass[12pt]{minimal}\begin{document}$p_t =0 \rightarrow 100$\end{document} ${p}_{t}=0\to 100$

Calculate [TeX:] \documentclass[12pt]{minimal}\begin{document}$S_e(\cdot)$\end{document} ${S}_{e}(\xb7)$

Calculate [TeX:] \documentclass[12pt]{minimal}\begin{document}$S_p(\cdot)$\end{document} ${S}_{p}(\xb7)$

Calculate

*J*(·)Calculate

*I*(·)END Loop

END Loop

Determine the maximum argument of

*J*or*I*as best interpreted classification result

The objective in the remainder of this paper is to analyze a given data with respect to (1) dimensionality of the feature space *n*, (2) structure of the SOM (e.g., cluster number *l*, learning rates, etc.), and (3) ground truth benchmark with target classes (e.g.,
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\lbrace \hat{c}_{\rm 0}, \hat{c}_{\rm 1}\rbrace$\end{document}
$\{{\widehat{c}}_{0},{\widehat{c}}_{1}\}$
).

## 3.2.

### Discriminant Analysis

Discriminant analysis (DA) was also applied to quantify classification performances with a more traditional statistical analysis method.^{31} Equivalent to the SOM machine learning approach, the goal of the discriminant analysis is to separate/predict group members (e.g., RA-affected and not affected) from a set of predictors [e.g., min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), min/max, and var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)]. For this purpose, discriminant function scores and statistical significances are estimated to determine the best linear combination of predictors. A discriminant function is predicted for a case from the sum of the series of predictors, which are, in turn, weighted by a coefficient. Thus, each discriminant function is based on a set of coefficients. Performance measures as described below can be used to quantify the quality of classifications/predictions (see Sec.
3.3). In this study, the JMP software package was used to perform the DA. More details on the DA can be found, for example, in Ref. 31.

## 3.3.

### Performance Measures

To quantify the classification performance the following four measures were considered. First, the sensitivity
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
and the specificity
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
are defined as^{32}

## Eq. 1

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{equation} S_e = \frac{T_+}{ (T_+ + F_-) }, \end{equation}\end{document} $${S}_{e}=\frac{{T}_{+}}{({T}_{+}+{F}_{-})},$$## Eq. 2

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{equation} S_p = 1 - \frac{F_+}{ (F_+ + T_-) }, \end{equation}\end{document} $${S}_{p}=1-\frac{{F}_{+}}{({F}_{+}+{T}_{-})},$$*t*) as the target class (+) with respect to the ground truth; and [TeX:] \documentclass[12pt]{minimal}\begin{document}$S_p = [0,\hspace*{-1.5pt}1]$\end{document} ${S}_{p}=[0,1]$ is the relative number of all [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_n$\end{document} ${\mathbf{x}}_{n}$ -vectors that were falsely identified (

*f*) as the target class (+). Using the sensitivity and specificity one can calculate the, third measure, the Youden index

^{33}[TeX:] \documentclass[12pt]{minimal}\begin{document}$J = S_e + S_p - 1$\end{document} $J={S}_{e}+{S}_{p}-1$ .

Furthermore, by varying
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
from 0 to 100%, ROCs were generated and analyzed as they are frequently used in the characterization of medical classification schemes. If
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
is set to 0, all images will be qualified as not affected leading to a sensitivity of 0, however, specificity is 1. If
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
is set to 100%, the specificity will be 0. Intermediate
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
values lead to intermediate
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
values usually maximizing the *J* for a given pair. It should be pointed out that this approach differs from classical ROC analysis, which is typically applied to only on observable parameter,^{11} e.g.,
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\min (\mu _a)$\end{document}
$\mathrm{min}\left({\mu}_{a}\right)$
. By varying the threshold of
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
for which a patient is considered affected
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
can be calculated and ROC curves generated. By introducing
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
as threshold, we effectively extended the ROC analysis to multiple parameter interpretation in the frame work of SOM neural networks.

The fourth and more generalized performance measure^{34} is the mutual information
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$I[C({\bm x},{\bm w}); \hat{C}$\end{document}
$I[C(\mathbf{x},\mathbf{w});\widehat{C}$
]:

## Eq. 3

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{eqnarray} H(C|\hat{C}) &=& H(C, \hat{C}) - H(\hat{C})\nonumber \\ I(C; \hat{C}) &=& H(C) - H(C|\hat{C})\nonumber \\ &=& \sum _{c \in C} \sum _{\hat{c} \in \hat{C}} p(c, \hat{c}) \log \left[ \frac{p(c, \hat{c})}{p(c)\;p(\hat{c})} \right]. \end{eqnarray} \end{document} $$\begin{array}{ccc}\hfill H\left(C\right|\widehat{C})& =& H(C,\widehat{C})-H\left(\widehat{C}\right)\hfill \\ \hfill I(C;\widehat{C})& =& H\left(C\right)-H\left(C\right|\widehat{C})\hfill \\ & =& \sum _{c\in C}\sum _{\widehat{c}\in \widehat{C}}p(c,\widehat{c})\mathrm{log}\left[\frac{p(c,\widehat{c})}{p\left(c\right)\phantom{\rule{0.28em}{0ex}}p\left(\widehat{c}\right)}\right].\hfill \end{array}$$Note
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$I[C({\bm x},{\bm w}); \hat{C}]$\end{document}
$I[C(\mathbf{x},\mathbf{w});\widehat{C}]$
expresses the similarity between the amount of data vectors labeled as class
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\hat{C}$\end{document}
$\widehat{C}$
of the ground truth benchmark and the interpreted/predicted data vectors labeled as class
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$C({\bm x},{\bm w})$\end{document}
$C(\mathbf{x},\mathbf{w})$
, which were estimated by the the SOM neural network. Also *I*(·) is 1 when the class labels of all interpreted data vectors match with the labels of the ground truth.

## 4.

## Results and Discussion

We start our analysis by plotting the distributions of the four single parameters max( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ), min( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ), min( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ )/max( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ), and var( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ) for unaffected and affected finger joints as identified with the four different ground truth, derived from CE, MRI, US, and SLOT (Figs. 4 to 7). The machine intelligent classification was performed entirely independent from the visual inspections of the CE, MRI, US, and SLOT data. Thus, researchers did not have any knowledge of the outcomes from other methods.

The green triangles indicate the mean and standard deviation of the data distribution. Looking at these figures, we can observe several things. First, we notice that the distributions for affected and unaffected fingers for all four parameters are very similar given US and SLOT as ground truth. This indicates that SLOT and US will show similar classification results. The distributions for MRI as ground truth, resemble closer the distribution found for CE.

Furthermore, we observe that the distributions for max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) are very similar for affected and unaffected finger joints across all four ground truths (Fig. 4). This indicates that max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) is a very poor classifier. The plots for min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), and var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) (Figs. 5 to 7) show much larger differences between the mean values of affected and unaffected groups, and yield ANOVA *p* values
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\kern1pt<\kern1pt$\end{document}
$\phantom{\rule{1.0pt}{0ex}}<\phantom{\rule{1.0pt}{0ex}}$
5%. (All sample were Box-Cox-transformed into normal distributions in order to perform ANOVA testing.) The parameter var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) is a measure of the variation in
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
in the images. A healthy joint typically shows higher variation in this parameter, as the almost clear synovial fluid has a very small
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
value compared to adjacent bones, cartilage, and other tissues. In a patient with RA, the
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
value of the synovial fluid increases and overall,
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
of all tissues becomes more simlar, hence var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) decreases. Differences in the ratio of min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) between healthy and affected finger joints can be explained in a similar fashion. In a healthy joint, this ratio should be small, since min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) is small in the synovial fluid. In a joint affected by RA min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) is increased and closer to max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), hence the ration becomes larger. However, results of the DA, which are summarized in Table 1, show that good classification into groups of affected and unaffected fingers will be difficult using these single parameters.

## Table 1

Results of the traditional discriminant analysis with respect to different parameter combinations and the MRI and US ground truth.

Data vector | MRI | US | ||||
---|---|---|---|---|---|---|

${\bm x} = \lbrace \; \rbrace$ x={} | $S_e$ Se | $S_p$ Sp | J | $S_e$ Se | $S_e$ Se | J |

max( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.12 | 0.88 | 0.00 | 0.75 | 0.34 | 0.09 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.21 | 0.92 | 0.13 | 0.80 | 0.47 | 0.27 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.17 | 0.93 | 0.10 | 0.81 | 0.40 | 0.20 |

var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.33 | 0.93 | 0.26 | 0.81 | 0.70 | 0.51 |

max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),min( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.22 | 0.93 | 0.15 | 0.82 | 0.50 | 0.32 |

max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.21 | 0.92 | 0.13 | 0.80 | 0.47 | 0.27 |

max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.32 | 0.93 | 0.24 | 0.87 | 0.61 | 0.47 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ ),min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.19 | 0.92 | 0.12 | 0.85 | 0.49 | 0.34 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ ),var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.38 | 0.93 | 0.30 | 0.81 | 0.74 | 0.55 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.32 | 0.93 | 0.24 | 0.85 | 0.62 | 0.47 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ ),max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.21 | 0.92 | 0.13 | 0.83 | 0.50 | 0.33 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ ),max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.32 | 0.93 | 0.24 | 0.85 | 0.59 | 0.44 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ ),min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.32 | 0.93 | 0.24 | 0.87 | 0.60 | 0.47 |

max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.42 | 0.97 | 0.39 | 0.87 | 0.64 | 0.51 |

min( [TeX:] $\mu _a$ ${\mu}_{a}$ ),max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),min( [TeX:] $\mu _a$ ${\mu}_{a}$ )/max( [TeX:] $\mu _a$ ${\mu}_{a}$ ),var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.44 | 0.97 | 0.41 | 0.43 | 0.97 | 0.40 |

Table 1 shows the classification results in terms of sensitivity, specificity and Youden index *J* for single and multi-parameters using MRI and US to produce the ground truth. Results for CE and SLOT are similar but were omitted here for clarity. We see that *J*-values for the single parameters are comparatively low, except for var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
). max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) yields the lowest
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
,
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
, and *J* values. Using MRI to determine ground truth, the highest *J* value (*J* = 0.41) is achieved when all four parameters are combined. If US is used to determine the ground truth the highest *J* value (*J*=0.55) is achieved with a combination of only two parameters, var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) and min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
). Notable is also that in general
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
is very high (
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\kern1pt>\kern1pt$\end{document}
$\phantom{\rule{1.0pt}{0ex}}>\phantom{\rule{1.0pt}{0ex}}$
0.9) and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
very low (
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\kern1pt<\kern1pt$\end{document}
$\phantom{\rule{1.0pt}{0ex}}<\phantom{\rule{1.0pt}{0ex}}$
0.44), when MRI is used to determine the ground truth. With US as ground truth, these roles seem to be reversed, therefore
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
is in general higher (≈0.8) than
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
(≈0.5 to ≈0.6).

The main hypothesis of this study is that a machine learning approach that makes use of SOM methods applied to multiparameter analysis will yield better classification with respect to RA than currently available methods. To demonstrate this the SOM-network was trained with 100 input *n*-dimensional data vectors, with respect to a cross-validation.

Figures 8 and 9 show the estimated classification and prediction performances, *J* and *I*, of the SOM method with respect to different sets of optical parameters for the four different ground truths (derived from CE, MRI, US, and SLOT). Displayed are the changes of *J* and *I* as a function of the frequency thresholds
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
for 11 different parameter configuration. The error bars in these figures represent the prediction uncertainties (standard deviations), which result from the varying cluster size and the cross-validation methods described in the previous section. To arrive at these particular error bars, the computer-aided algorithm varied the cluster size for each parameter combination. For example, when combining var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) with min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), Fig. 10 shows the classification performances *J* and *I* with respect to US as ground truth and varying cluster sizes. Initially, *J* and *I* improve as the number of clusters is increased. However, once the number of clusters reaches 25 (four feature vectors per cluster) *J* and *I* are almost constant, approximately equal to 0.94 and 0.91, respectively. Therefore, optimal interpretations of the classification results are performed with a SOM neural network architecture of ∼25 subgroups (clusters).

Figures 8 and 9 show that for all ground truths we can find optical parameter combinations and frequency thresholds
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
for which the Youden index *J* is larger than 0.75. This is substantially higher than the highest value obtained with the DA approach (*J* = 0.55, see Table 1). In general, using the machine-learning/SOM approach the highest *J* values are obtained when combing var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) with the min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) ratio. Here
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
is 13% and *J* is 0.81 when using MRI-derived ground truth. The corresponding values for sensitivity and specificity are
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
= 0.96 and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
= 0.85 (see Table 2). With US-derived ground truth, and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t$\end{document}
${p}_{t}$
= 31%, these values increase to *J* = 0.87 (
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
= 0.96 and Sp = 0.91). Figure 11 shows the related ROC curves. Also shown in Fig. 11 are the ROC curves using SLOT and US as ground truth and the curve reported by Scheel
^{11} Scheel's analysis, which yielded *J* = 0.41 (
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
= 0.71 and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
= 0.71) relied on a single parameter [min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)] for classification.

## Table 2

Results of the machine-learning-based classification of all two-parameter combinations that have shown best classification performences including sensifity $S_e$ Se , specificity $S_p$ Sp , the resulting Youden index $J = S_e + S_p -1$ J=Se+Sp−1 , the mutual information I, and the area under the curve (AUC).

Ground Truth | Data Vector [TeX:] ${\bm x} = \lbrace \; \rbrace$ $\mathbf{x}=\left\{\phantom{\rule{0.28em}{0ex}}\right\}$ | [TeX:] $S_e$ ${S}_{e}$ | [TeX:] $S_p$ ${S}_{p}$ | J | [TeX:] $p\,_t(J)$ $p{\phantom{\rule{0.16em}{0ex}}}_{t}\left(J\right)$ (%) | I (%) | [TeX:] $p\,_t(I)$ $p{\phantom{\rule{0.16em}{0ex}}}_{t}\left(I\right)$ | AUC |

CE | max( [TeX:] $\mu _a$ ${\mu}_{a}$ ), min( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.99 | 0.71 | 0.70 | 14 | 0.37 | 8 | 0.44 |

MRI | 1.00 | 0.73 | 0.73 | 9 | 0.40 | 9 | 0.44 | |

US | 0.97 | 0.82 | 0.79 | 27 | 0.53 | 24 | 0.75 | |

SLOT | 0.95 | 0.81 | 0.76 | 34 | 0.5 | 23 | 0.73 | |

CE | max( [TeX:] $\mu _a$ ${\mu}_{a}$ ), [TeX:] ${\min (\mu _a)}/{\max (\mu _a)}$ $\mathrm{min}\left({\mu}_{a}\right)/\mathrm{max}\left({\mu}_{a}\right)$ | 0.97 | 0.82 | 0.79 | 13 | 0.46 | 13 | 0.54 |

MRI | 0.99 | 0.81 | 0.80 | 13 | 0.46 | 11 | 0.55 | |

US | 0.95 | 0.89 | 0.84 | 36 | 0.60 | 34 | 0.81 | |

SLOT | 0.92 | 0.92 | 0.84 | 33 | 0.62 | 32 | 0.84 | |

CE | max( [TeX:] $\mu _a$ ${\mu}_{a}$ ), var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.99 | 0.75 | 0.74 | 12 | 0.41 | 13 | 0.46 |

MRI | 0.99 | 0.73 | 0.72 | 12 | 0.38 | 9 | 0.44 | |

US | 0.97 | 0.80 | 0.77 | 29 | 0.51 | 17 | 0.72 | |

SLOT | 0.96 | 0.84 | 0.80 | 29 | 0.56 | 21 | 0.77 | |

CE | min( [TeX:] $\mu _a$ ${\mu}_{a}$ ), [TeX:] ${\min (\mu _a)}/{\max (\mu _a)}$ $\mathrm{min}\left({\mu}_{a}\right)/\mathrm{max}\left({\mu}_{a}\right)$ | 0.91 | 0.82 | 0.73 | 14 | 0.39 | 20 | 0.53 |

MRI | 0.91 | 0.83 | 0.74 | 13 | 0.42 | 13 | 0.53 | |

US | 0.91 | 0.87 | 0.78 | 37 | 0.54 | 37 | 0.78 | |

SLOT | 0.82 | 0.97 | 0.79 | 35 | 0.58 | 48 | 0.82 | |

CE | min( [TeX:] $\mu _a$ ${\mu}_{a}$ ), var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.96 | 0.79 | 0.75 | 12 | 0.41 | 15 | 0.58 |

MRI | 0.96 | 0.81 | 0.77 | 12 | 0.42 | 12 | 0.55 | |

US | 0.89 | 0.91 | 0.80 | 36 | 0.55 | 30 | 0.80 | |

SLOT | 0.89 | 0.92 | 0.81 | 32 | 0.57 | 46 | 0.82 | |

CE | [TeX:] ${\min (\mu _a)}/{\max (\mu _a)}$ $\mathrm{min}\left({\mu}_{a}\right)/\mathrm{max}\left({\mu}_{a}\right)$ , var( [TeX:] $\mu _a$ ${\mu}_{a}$ ) | 0.98 | 0.85 | 0.83 | 15 | 0.53 | 10 | 0.65 |

MRI | 0.96 | 0.85 | 0.81 | 13 | 0.49 | 13 | 0.60 | |

US | 0.96 | 0.91 | 0.87 | 31 | 0.67 | 42 | 0.86 | |

SLOT | 0.94 | 0.95 | 0.89 | 38 | 0.71 | 46 | 0.88 |

Looking at Figures 8 and 9, we furthermore find that parameter combinations of only two features [e.g., min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) and var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) or max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) and var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)] lead to higher accuracy measures (*J* and *I*) than three of four feature combinations (shown in gray). The reasons for that behavior are not entirely clear. However, using the DA approach with US as ground truth we found a similar result. Here a two-parameter combination [max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
) plus var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)] gave the best *J* value.

Furthermore, curves generated with US-derived ground truth look similar to curves generated with SLOT-derived ground truth. In both cases, the largest *J* values are reached in the range of
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$20\%<p_t<70\%$\end{document}
$20\%<{p}_{t}<70\%$
. The associated
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_e$\end{document}
${S}_{e}$
and
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$S_p$\end{document}
${S}_{p}$
values are all larger than 0.85 in this range. For values of
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$p_t>70$\end{document}
${p}_{t}>70$
the Youden index falls off. This similarity, which we already observed when looking at the distribution of the single parameters (Figs. 6 to 8), suggest that US and SLOT are similar in the assessment of RA in finger joints.

## 5.

## Conclusion

Optical tomographic imaging is increasingly applied in clinical studies concerning the detection of various diseases such as breast cancer, arthritis, or brain hemorraghes. While substantial progress has been made with respect to imaging instrumentation and optical tomographic image reconstruction schemes, relatively little effort has been expended on image analysis schemes that extract useful features from tomographic images and help classifying a patient as free or afflicted by a certain disease.

This study presents the first attempt in the field of optical tomography to use advanced CAD methods. In particular we employ an unsupervised interpretation system based on SOMs

to distinguish between finger joints affected and not affected by RA. Different parameters (e.g., smallest and largest absorption and scattering coefficient and respective ratios) drawn from SLOT images became input to the CAD algorithm, and Youden index, specificity and sensitivity, and mutual information were used as classification performance measures. The performance measures were calculated for four different ground truth (generated by MRI, US, CE, and optical inspection of SLOT images) and compared to results of conventional statistical analysis methods, such as discriminant analysis.

Specificity and sensitivity of 0.85 and 0.96, respectively, could be achieved, when combining the ratio of the minimal and maximal absorption coefficient and the variance in an image, assuming MRI provides a good ground truth. If US is chosen to provide the ground truth, we get [TeX:] \documentclass[12pt]{minimal}\begin{document}$S_p$\end{document} ${S}_{p}$ = 0.91 and [TeX:] \documentclass[12pt]{minimal}\begin{document}$S_e$\end{document} ${S}_{e}$ = 0.96. These values are considerably higher than values obtained with single-parameter analysis reported earlier, or best case scenarios obtained with a discriminant analysis approach. The specificity and sensitivity levels that were reached with this proposed image classification approach make sagittal optical tomographic imaging an attractive tool for the evaluation of arthritis in finger joints. Larger clinical trials are now under way to further explore the clinical usefulness of this medical imaging procedure.

## 6.

## Appendix: Self Organizing Mapping

As outlined in the Sec.
3.1, a machine learning algorithm based on SOM, was employed as part of an automated unsupervised interpretation system. In particular, we use the SOM method to cluster data derived from the optical tomographic images. For the given case, each image is represented by a feature vector
[TeX:]
\documentclass[12pt]{minimal}\begin{document}${\bm x}$\end{document}
$\mathbf{x}$
whose components are given either by min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), min(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
)/max(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), var(
[TeX:]
\documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document}
${\mu}_{a}$
), or a subset of these four parameters. Therefore, depending on what combination of features is considered for the clustering,
[TeX:]
\documentclass[12pt]{minimal}\begin{document}${\bm x}$\end{document}
$\mathbf{x}$
either has two, three, or four dimensions. Given that in this studies 100 finger images were available, 100 feature vectors were derived that were separated into *l* clusters, where *l* took on values between 2 to 80.

To understand how the clustering was performed, one needs to understand the basic structure of a SOM network. A SOM is structured in two layers: an input layer and a Kohonen layer [Fig. 12]. The input layer is a one-on-one representation of each given feature vector. Therefore, the number of neurons in the input layer equals the dimensions of the feature vector. If all four features [min( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ), max( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ), min( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ )/max( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ), and var( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ] are considered at the same time, the input layers has four neurons, one for each feature [Fig. 12]. The Kohonen layer represents a structure with a single 2-D map (lattice) consisting of neurons arranged in rows and columns. For the given example, each neuron in the Kohonen layer represents one cluster. Therefore, if 100 feature vectors need to be distributed into, e.g., 25 clusters, the Kohonen layer will have 25 or 5×5 neurons. Each neuron of the Kohonen layer is fixed and is fully linked with all neurons of the input layer. The links are described by weights [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm w}_k$\end{document} ${\mathbf{w}}_{k}$ , given by

## Eq. 4

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{equation} {\bm w}_k = \lbrace w_{(k, {\rm feature_1})}, w_{(k,{\rm feature_2})}, \ldots, w_{(k,{\rm feature_n})}\rbrace{^{T}}. \end{equation} \end{document} $${\mathbf{w}}_{k}=\{{w}_{(k,{\mathrm{feature}}_{1})},{w}_{(k,{\mathrm{feature}}_{2})},...,{w}_{(k,{\mathrm{feature}}_{\mathrm{n}})}\}{}^{T}.$$*n*is the dimension of the feature vector [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}$\end{document} $\mathbf{x}$ , with

*n*= 2, 3, or 4;

*k*is the index to a specific neuron, representing a specific cluster in the Kohonen layer; with

*k*= 1, 2, …,

*m*, …,

*l*, where

*l*is number of all Kohonen neurons. The weights can take on values between 0 and 1. If

*n*features are considered and

*l*clusters are desired, the number of weights is

*n*×

*l*.

Clustering is now achieved in the following way:

1 First the weights [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm w}_k$\end{document} ${\mathbf{w}}_{k}$ are initialized, i.e., by assigning random values.

2 All feature vectors [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_i$\end{document} ${\mathbf{x}}_{i}$ are presented to the network and the Euclidean distance [TeX:] \documentclass[12pt]{minimal}\begin{document}$\Vert {\bm x}_i - {\bm w}_k\Vert$\end{document} $\parallel {\mathbf{x}}_{i}-{\mathbf{w}}_{k}\parallel $ between each [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_i$\end{document} ${\mathbf{x}}_{i}$ and each [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm w}_k$\end{document} ${\mathbf{w}}_{k}$ is calculated. Note, all original feature vectors [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}^o_i$\end{document} ${\mathbf{x}}_{i}^{o}$ drawn from an image must be scaled by the standard mean [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_{\rm SM}$\end{document} ${\mathbf{x}}_{\mathrm{SM}}$ and normalized by the standard deviation [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_{\rm SD}$\end{document} ${\mathbf{x}}_{\mathrm{SD}}$ to make all features within a feature vector comparable: [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_i = \frac{{\bm x}^o_i - {\bm x}_{\rm SM}}{{\bm x}_{\rm SD}}$\end{document} ${\mathbf{x}}_{i}=\frac{{\mathbf{x}}_{i}^{o}-{\mathbf{x}}_{\mathrm{SM}}}{{\mathbf{x}}_{\mathrm{SD}}}$ .

3 The index

*j*of the Kohonen neuron, whose weight [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm w}_k$\end{document} ${\mathbf{w}}_{k}$ is the closest to vector [TeX:] \documentclass[12pt]{minimal}\begin{document}${\bm x}_i$\end{document} ${\mathbf{x}}_{i}$ is determined by## Eq. 5

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{eqnarray} j({\bm x}_i) = \arg \min _k \Vert {\bm x}_i - {\bm w}_k\Vert \quad | \quad k = 1, 2,\ldots, m,\ldots, l.\nonumber\hspace*{-6pt}\\ \end{eqnarray}\end{document} $$\begin{array}{c}\hfill j\left({\mathbf{x}}_{i}\right)=\mathrm{arg}\underset{k}{\mathrm{min}}\parallel {\mathbf{x}}_{i}-{\mathbf{w}}_{k}\parallel \phantom{\rule{1em}{0ex}}|\phantom{\rule{1em}{0ex}}k=1,2,...,m,...,l.\end{array}$$4 Given theses “winner” neurons, new weights are calculated for the entire network according to

where η(## Eq. 6

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{eqnarray} {\bm w}_k(t+1) = {\bm w}_k(t) + \eta (t)\;\; h_{k,j({\bm x})}(t)\;\; \left[{\bm x}_i(t) - {\bm w}_k(t)\right],\nonumber\\ \end{eqnarray}\end{document} $$\begin{array}{c}\hfill {\mathbf{w}}_{k}(t+1)={\mathbf{w}}_{k}\left(t\right)+\eta \left(t\right)\phantom{\rule{0.28em}{0ex}}\phantom{\rule{0.28em}{0ex}}{h}_{k,j\left(\mathbf{x}\right)}\left(t\right)\phantom{\rule{0.28em}{0ex}}\phantom{\rule{0.28em}{0ex}}\left[{\mathbf{x}}_{i}\left(t\right)-{\mathbf{w}}_{k}\left(t\right)\right],\end{array}$$*t*) is the learning-rate parameter during the calculation step*t*, and [TeX:] \documentclass[12pt]{minimal}\begin{document}$h_{k,j({\bm x})}(t)$\end{document} ${h}_{k,j\left(\mathbf{x}\right)}\left(t\right)$ is the neighborhood function centered around the winning neuron [TeX:] \documentclass[12pt]{minimal}\begin{document}$j({\bm x}_i)$\end{document} $j\left({\mathbf{x}}_{i}\right)$ . The neighborhood function is given byThis Gaussian function depends on the lateral neuron distance## Eq. 7

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{eqnarray} h_{k,j({\bm x})} = {\rm exp} \left(- \frac{d^2_{k,j({\bm x})}}{2 \sigma ^2} \right). \end{eqnarray} \end{document} $$\begin{array}{c}\hfill {h}_{k,j\left(\mathbf{x}\right)}=\mathrm{exp}\left(-\frac{{d}_{k,j\left(\mathbf{x}\right)}^{2}}{2{\sigma}^{2}}\right).\end{array}$$*d*and the effective width, which is a variable of the network that determines how many neighboring neurons become modified.5 After the first update of the weights, the next learning cycle starts by again presenting all feature vectors to the SOM network and repeating steps 2 through 4 etc. The learning rate is reduced in each cycle according to [TeX:] \documentclass[12pt]{minimal}\begin{document}$\eta (t+1) = (1-t/t_F) \eta (t)$\end{document} $\eta (t+1)=(1-t/{t}_{F})\eta \left(t\right)$ .

This process is repeated until all the weights converge to stable values, meaning [TeX:] \documentclass[12pt]{minimal}\begin{document}$\Delta {\bm w}_k(t+1) = \Vert {\bm w}_k(t+1) - {\bm w}_k(t)\Vert$\end{document} $\Delta {\mathbf{w}}_{k}(t+1)=\parallel {\mathbf{w}}_{k}(t+1)-{\mathbf{w}}_{k}\left(t\right)\parallel $ is smaller than a preset value, or a preset number of learning cycles (iterations) [TeX:] \documentclass[12pt]{minimal}\begin{document}$t_F$\end{document} ${t}_{F}$ have been completed.

After training, a feature vector presented to the trained network, will excite exactly one neuron that represents one cluster. Hence each feature vector “belongs” to exactly one cluster. For example, if feature vectors are chosen with two components [e.g. min( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ )/max( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ ) and var( [TeX:] \documentclass[12pt]{minimal}\begin{document}$\mu _a$\end{document} ${\mu}_{a}$ )] and 13 neurons populate the Kohonen layer, each of the 100 feature vectors would excite one of these 13 neurons. The 100 data points would have been divided into 13 clusters as shown in Fig. 2, the example discussed in the main text. Similar Kohonen layer neurons correspond to similar feature vectors of the given input space. The structure and network parameters for the SOM algorithm used in this work are as shown in Table 3. Further details about SOM and the general learning process on tomographic image data can be found in Refs. 21, 35.

## Table 3

SOM algorithm parameters.

Represented feature | [TeX:] $\mathbb {R}^n$ ${\mathbb{R}}^{n}$ | [TeX:] $\lbrace \min(\mu _a), \max(\mu _a),\ldots,$ $\{\mathrm{min}\left({\mu}_{a}\right),\mathrm{max}\left({\mu}_{a}\right),...,$ |

space | [TeX:] $\min(\mu _a)/\max(\mu _a), var(\mu _a)\rbrace ^T$ $\mathrm{min}\left({\mu}_{a}\right)/\mathrm{max}\left({\mu}_{a}\right),var\left({\mu}_{a}\right){\}}^{T}$ | |

Number of input neurons | 2,3 or 4 | |

Number of feature/ input vectors | 100 | data drawn from images |

Number of Kohonen neurons | 2,...,80 | (e.g., 4×4 lattice) |

Number of weights | up to 4×80 | |

Number of iterations [TeX:] $t_F$ ${t}_{F}$ | 10,000 | |

Initial learning rate η(t = 0) | 0.5 | |

Final learning rate [TeX:] $\eta (t_F)$ $\eta \left({t}_{F}\right)$ | 0.01 | |

Initial neighbourhood size σ(t = 0) | up to 10 | depends on the structure |

Final neighbourhood | 1 | decreasing every |

size [TeX:] $\sigma (t_F)$ $\sigma \left({t}_{F}\right)$ | 2000 iterations |

## Acknowledgment

The authors thank Ludguier Montejo, Columbia University, for providing some critical input concerning the content and structure of the paper. This work was supported in part by a grant (2R01 AR46255) from the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), which is part of the National Institutes of Health (NIH).

## References

*Optical and Imaging Techniques for Biomonitoring*,” Proc. SPIE, 3196 194 –204 1998). Google Scholar