## 1.

## Introduction

Optical coherence tomography (OCT) is a noninvasive, low coherent interferometry technique that uses backscattered light to generate tomographic images of tissue microstructures.^{1} With advances in imaging technology and optical devices, OCT currently has extensive biomedical applications in various fields including dermatology.^{1}^{,}^{2} Although OCT offers high-resolution, three-dimensional (3-D) images of tissues with 2- to $15\text{-}\mu \mathrm{m}$ resolution and 1- to 2-mm imaging depth,^{3} OCT images suffer from a grainy texture artifact, speckle, due to the broadband/low coherent light source used in the configuration of OCT.^{3} Speckle is formed, if the out-of-phase backscattered signals reach the detector within the coherence time of the laser, and it is the summation of multiple optical backscattered wavefields from the sample.^{4} Speckle reduces the image quality, e.g., spatial resolution of the borders and edges in the image.^{5} Since speckle is an artifact that carries submicron structural information, devising an appropriate speckle reduction algorithm becomes a challenging task.^{6} Speckle reduction methods can be categorized into hardware-based methods^{7} and software-based methods.^{8}^{,}^{9} Compounding techniques are the most common hardware-based methods. In these methods, the imaging is performed several times and each time one system parameter is altered, e.g., angle of light illumination (angular compounding), polarization of the incident light (polarization compounding), and central wavelength of the light source (spectral compounding).^{4}^{,}^{7}^{,}^{9} Software-based methods are digital filters relying on a mathematical model of the speckle and are implemented based upon the local or overall statistics of the image. Among speckle reduction digital filtering methods, the most common ones, such as averaging and median, are time efficient; however, they reduce the spatial resolution in the image. Improved performance is provided by adaptive digital filters including Lee,^{10} Kuwahara,^{11} and Wiener filter.^{12} Moreover, wavelet-based filtering methods^{9} and a diffusion-based filter with fuzzy logic thresholding as transform domain techniques have demonstrated adequate results.^{13} Some of these techniques are compared in Ref. 8 and authors concluded that the combination of enhanced Lee method and Wiener filter could significantly improve the quality of OCT images of *ex vivo* bovine retina. The other filters were developed based on an A-scan reconstruction procedure^{14} or Bayesian estimation^{15} for OCT image speckle reduction. Several other studies have been carried out on speckle reduction by tuning the internal parameters in a particular despeckling algorithm.^{16} Relying on more advanced filtering methods, several speckle reduction methods were developed based on the total variation (TV) concept,^{17} nonlocal mean (NLM) filtering,^{17} and block-matching 3-D filtering (BM3D).

A tissue’s cellular specification can be represented by optical and textural features extracted from OCT images. Together with the attenuation coefficients,^{18} textural features can provide valuable information about the texture and scattering properties of a biological tissue.^{19}^{,}^{20} We introduced an expandable despeckling framework, named learnable despeckling framework (LDF). LDF decides which speckle reduction algorithm is most effective for a given image by learning a single quantitative image assessment measure: the figure of merit (FOM). Among several algorithms developed for speckle reduction in OCT images, we utilized the above-mentioned digital filters to evaluate our proposed framework. With any given image to LDF, it is retrained and its performance improved. Any other algorithm can easily be added to this framework; this is the expandable characteristic of LDF. LDF’s architecture is composed of two main parts; an autoencoder neural network and a filter classifier. In our proposed scheme, initially the autoencoder learns to construct a weighted FOM measure based on quality assessment measures, i.e., signal-to-noise ratio (SNR), contrast-to-noise ratio (CNR),^{8}^{,}^{21}^{–}^{23} equivalent number of looks (ENL), mean structural similarity (MSSIM) index, and edge preservation index (EPI), extracted from filtered OCT images in the following categories: (a) sliding window filters including median, mean, and SNN; (b) adaptive statistical-based filters including Wiener, homomorphic Lee, and Kuwahara; and (c) edge preserved patch or pixel correlation-based filters including NLM, TV, and BM3D. Then, the filter classifier identifies the most efficient filter.

## 2.

## Materials and Methods

In Sec. 2.1, we present the digital filtering methods used in this study. In Sec. 2.2, we describe statistical features that can be extracted from an OCT image. In Sec. 2.3, we explain the image quality assessment measures that are utilized to evaluate the quality of the filtered images. Following that, in Sec. 2.4, we introduce the architecture of the LDF in details. The OCT system specifications are given in Sec. 2.5.

## 2.1.

### Digital Filtering Methods

Digital filters are means to implement mathematical operations characterized by its transfer functions or, equivalently, on a sampled signal or image, to improve its quality. A speckle reduction method is designed and implemented based on the statistical characteristics of speckle. There are three main classes of digital filters such as sliding window, adaptive statistical based, and edge preserved patch or pixel correlation-based filters. We explored 25 filters. Among them, filters #1 to #4 are median filter with window sizes 3, 5, 7, and 9 pixels, respectively; filters #5 to #8 are averaging filter with window sizes 3, 5, 7, and 9 pixels, respectively; filters #9 to #12 are symmetric nearest neighborhood (SNN)^{16} with window sizes 3, 5, 7, and 9 pixels, respectively; filters #13 and #14 are Kuwahara^{24} with window sizes of 3 and 5 pixels, respectively; filters #15 to #18 are adaptive Wiener filter with window sizes 3, 5, 7, and 9 pixels, respectively;^{25} filters #19 to #22 are Lee filter with window sizes 3, 5, 7, and 9 pixels, respectively; filter #23 is a pixel-wise NLM filter;^{26} filter #24 is a TV^{16} filter; and filter #25 is a BM3D.^{27}

## 2.1.1.

#### Sliding window filters

This class of filter includes mean, median, and SNN. These filters are time-efficient and can be used in real-time speckle reduction applications, such as video-rate OCT imaging. Although they effectively reduce speckle noise in the OCT image, they smooth edges in the image and create blurriness. Mean filter is a linear convolutional low-pass filter. In this filter, a pixel value is replaced by the average of its neighboring pixel values. In median filtering a pixel value in a window, $\mathrm{M}\times \mathrm{N}$ pixels, is replaced by the value of the middle pixel in a vector of pixel values sorted in an ascending order.^{28} This nonlinear filter is more robust than the mean filter and preserves edges more effectively. SNN is considered as an edge preserving sliding window speckle reduction method. In SNN, initially the opposite pairs of pixels in the support are compared and replaced with the pixel value that is closest to the central pixel value.^{16} Each pixel value is then replaced by the average of processed pixel values in the support.

## 2.1.2.

#### Adaptive statistical-based filters

This class of despeckling filter includes Kuwahara filter,^{11} adaptive Wiener filter, and Lee filter and utilizes statistical features e.g., mean and variance, extracted from the image or a part of the image. Kuwahara works by dividing the support into four subregions.^{24} The central pixel is replaced with the average of the quadrants with the lowest variance. Wiener filter tailors itself to image local mean and local variance, i.e., the larger the variance, the less smoothing is applied.^{12} Lee filter is an adaptive filter that determines each pixel value according to the weighted sum of the center pixel value; the local statistics (mean and variance) calculated in a square kernel surrounding the pixel with a minimum mean square error (MMSE) approach.^{29}

## 2.1.3.

#### Patch or pixel correlation-based filters

This class of despeckling filter, includes NLM, TV, and BM3D, are based on high inter- or intra-correlations among nearby pixels or patch of pixels. The NLM filter algorithm changes the value of the target pixel by taking the average value of all or selected pixels in the image and weighting them based on their similarity to the target pixel. NLM filters are known to preserve the textures.^{26} TV filters are based on an edge preserving TV regularization process that relies on the fact that signals with excessive detail have high TV, implying a large integral of absolute gradient for the signal.^{30} TV provides a close match to the ground truth image. TV efficiently suppresses the noise while preserving the image details. BM3D is a collaborative filtering method designed considering the fact that the image has a locally sparse representation in the transform domain.^{27} The procedure begins with grouping similar image patches into three-dimensional (3-D) groups. Then, a 3-D linear transformation is applied on the image and a shrinkage procedure is performed. Following this process, an inverse transformation is applied on spectrum coefficients. Combining the patches results in an estimation of the ground truth image. At the end, a Wiener filter is used to form the final denoised image.^{31}

## 2.2.

### Image Textural and Optical Features

To quantify tissue properties, 27 features including 26 texture features^{32}^{,}^{33} and 1 optical property are extracted from OCT images. For each image, 6 first-order statistical features including mean, variance, skewness, median, kurtosis, and entropy are calculated. Twenty features from the gray-level co-occurrence matrix (GLCM)^{34} i.e., homogeneity, contrast, energy, entropy, and correlation in four directions, 0 deg, 45 deg, 90 deg, and 135 deg, are calculated as textures.^{35} The optical property calculated for the OCT image is the attenuation coefficient. We used Vermeers’ method^{36} to calculate the attenuation coefficient for each pixel in the OCT intensity image. We utilized principal components analysis (PCA) algorithm to reduce the dimension of the features (from 27 to 5) with the least loss of information. PCA is a projection-based method that reduces the computational complexity through construction of orthogonal principal components.^{37}

## 2.3.

### Image Quality Assessment Measures

The performance of the filtering methods is assessed using well-established objective assessment measures, including SNR, CNR,^{8}^{,}^{21}^{–}^{23} ENL, MSSIM, and EPI.^{9}^{,}^{38}^{,}^{39} SNR compares the signal of an object in the OCT image to its background noise. CNR is a measure of the signal fluctuations to the noise. The definition of SNR and CNR is given in Eqs. (1) and (2), respectively. Both SNR and CNR are calculated over a number of ROIs (10 in this study)

## Eq. (1)

$$\mathrm{SNR}=10\text{\hspace{0.17em}}{\mathrm{log}}_{10}\left[\frac{\text{median}({I}_{\mathrm{lin}}^{2})}{{\sigma}_{\mathrm{lin}}^{2}}\right],$$## Eq. (2)

$$\mathrm{CNR}=\frac{1}{R}\left[{\sum}_{r=1}^{R}\frac{({\mu}_{r}-{\mu}_{b})}{\sqrt{{\sigma}_{{r}^{2}}+{\sigma}_{{b}^{2}}}}\right],$$^{21}ENL is a measure of smoothness in a homogeneous ROI and can be calculated as where ${\mu}_{h}^{2}$ and ${\sigma}_{h}^{2}$ are the mean and variance of homogeneous ROIs ($H$). The MSSIM index quantifies image quality referring to its structural similarities and is based on local statistic calculations

## Eq. (4)

$$\mathrm{MSSIM}=\frac{1}{MN}{\sum}_{i=1}^{M}{\sum}_{j=1}^{M}\frac{[2{\mu}_{{\stackrel{\u0301}{I}}_{(i,j)}}{\mu}_{{\hat{\stackrel{\u0301}{I}}}_{(i,j)}}+{C}_{1}][2{\sigma}_{{\stackrel{\u0301}{I}}_{(i,j)}}{\sigma}_{{\hat{\stackrel{\u0301}{I}}}_{(i,j)}}+{C}_{2}]}{[{\mu}_{{\stackrel{\u0301}{I}}_{(i,j)}}^{2}+{\mu}_{{\hat{\stackrel{\u0301}{I}}}_{(i,j)}}^{2}+{C}_{1}][{\sigma}_{{\stackrel{\u0301}{I}}_{(i,j)}}^{2}+{\sigma}_{{\hat{\stackrel{\u0301}{I}}}_{(i,j)}}^{2}+{C}_{2}]},$$^{39}The ground truth image is generated by averaging 170 (can be considered as spatial compounding method

^{40}since the images are taken from slightly misaligned samples due to imperfect optical scanners used in the OCT machine) B-scan images to calculate MSSIM index. EPI is a correlation-based method that shows how the edges in the image degrade

## Eq. (5)

$$\mathrm{EPI}=\frac{{\sum}_{i=1}^{M}{\sum}_{j=1}^{N}[\mathrm{\Delta}{I}_{(i,j)}-{\mu}_{\mathrm{\Delta}{I}_{(i,j)}}][\mathrm{\Delta}{\widehat{I}}_{(i,j)}-{\mu}_{{\widehat{I}}_{(i,j)}}]}{\sqrt{{\sum}_{i=1}^{M}{\sum}_{j=1}^{N}[\mathrm{\Delta}{I}_{(i,j)}-{\mu}_{\mathrm{\Delta}{I}_{(i,j)}}][\mathrm{\Delta}{\widehat{I}}_{(i,j)}-{\mu}_{{\widehat{I}}_{(i,j)}}]}},$$## 2.4.

### Learnable Despeckling Framework

The architecture of LDF includes two main parts: (i) an autoencoder neural network and (ii) filter classifier. In Sec. 2.4.1, we explain the architecture of the autoencoder utilized to learn the FOM. In Sec. 2.4.2, we describe the architecture of the classifier that can be trained based on FOM to predict the most effective despeckling filter.

## 2.4.1.

#### Autoencoder architecture

An FOM is defined as a single representative measurement to assess the performance of each filter. In this study, we define the FOM based on a set of five OCT quality measures including SNR, CNR, ENL, EPI, and MSSIM. The goal is to find an FOM to best represent the quality of an image. To do this, we utilized an autoencoder artificial neural network (ANN) with three layers for unsupervised training of FOM. The structure of the autoencoder is shown in Fig. 1. As it is shown, layer 1 contains six neurons, including SNR, CNR, ENL, EPI, MSSIM, and a bias neuron. Layer 2 includes one neuron to estimate the FOM and a biased neuron. Sigmoid transfer function, the stochastic gradient descent optimizer, and MSE as the loss function are used in training the autoencoder. Autoencoders work well (accuracy above 90% and faster convergence) if initially all the weights are the same. Following the work in Ref. 8, for applications similar to ours, the initial weights of one is the best choice.^{41} The autoencoder is trained to calculate the FOM by utilizing the quality assessments measures obtained from the filtered images. In this experiment, the final weights of the encoder layer were calculated as: [${w}_{1},{w}_{2},{w}_{3},{w}_{4},{w}_{5}$] = [0.1237, 0.2387, 0.0296, 0.1987, 0.4093], where [${w}_{1},{w}_{2},{w}_{3},{w}_{4},{w}_{5}$] corresponds to [SNR, CNR, ENL, EPI, MSSIM]. Based on this experiment, one can conclude that the MSSIM has a more significant effect on the FOM, whereas ENL has less significant effect.

## 2.4.2.

#### Classifier architecture

We used FOM measure to classify the despeckling filters. In this study, we utilized another ANN as the classifier. The classifier predicts the most effective filter (the winner filter) for the given input image. The designed ANN classifier includes three layers, the input layer, the hidden layer, and the output layer. The input layer includes five neurons corresponding to five features extracted from the image (the number of extracted features is initially 27, which is reduced to 5 by utilizing a PCA algorithm). The hidden layer includes 10 neurons. Finally, the output layer includes 25 neurons, which is equal to the number of filters in the experiment. The stochastic gradient descent optimizer and sum square error (SSE) as the loss function is used in training the classifier. Figure 2 shows the architecture of the classifier. The value of each neuron in the output layer is a real number between 0 and 1, which represents the probability of the corresponding filter being the winner filter. At the end, the filter with the highest probability will be selected as the winner filter. Before training, the features are normalized, once over each image and once for the ensemble of all images.

The flowchart of the LDF (including classifier’s training and testing steps) is given in Fig. 3. We are provided with a set of OCT row skin images. The images are initially linearized and normalized. The processed images are then passed through two parallel channels, A and B. In channel A, we extract the optical and textural features of the images and convert the set of the correlated features to a smaller set of linearly uncorrelated features by utilizing the PCA algorithm. In channel B, we apply all the filtering methods on the image to create a set of filtered images. Then, we calculate the FOM of each filtered image using our pre-trained autoencoder. Thereafter, the filter that achieves the highest FOM is chosen as the winner filter and hence is the class label of the selected image. This process repeats until all images have a class label. At the end, the classifier is trained and tested using the selected features and the class labels.

## 2.5.

### OCT System Specifications

We use a multibeam swept-source OCT system (SS-OCT) (VivoSight, Michelson Diagnostic^{™} Inc., United Kingdom) for this study (Fig. 4). The lateral and axial resolutions are 7.5 and $10\text{\hspace{0.17em}\hspace{0.17em}}\mu \mathrm{m}$, respectively. The scan area of the OCT system is 6 mm (width) $\times 6\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{mm}$ (length) $\times 2\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{mm}$ (depth). A tunable broadband laser source with the central wavelength of $1305\pm 15\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{nm}$, successively sweeps through the optical spectrum, leads the light to four separate interferometers, and forms four consecutive confocal gates. The interference spectrum generated by the frequency sweep, over the whole bandwidth in time, is given with respect to frequency. The 10-kHz sweep is the frequency that one reflectivity profile (A-scan) is generated. A B-scan is then generated by combining several adjacent A-scans for different transversal positions of the incident beam.

Due to the multibeam configuration, our Vivosight OCT can be considered approximately as a discrete dynamic focus OCT,^{42} and with a good approximation, these parameters can be neglected. Therefore, compensation for confocal parameter of the lens and for the fall in laser coherence was not performed.

## 3.

## Results and Discussion

We design three sets of experiment to demonstrate whether the FOM is a reliable merit and if it could be used as the class label in training the classifier (see Sec. 3.1). The experiments are performed based on three main classes of digital filters, i.e., sliding window, adaptive statistical based, and edge preserved patch or pixel correlation-based filters. Following these experiments, in Sec. 3.2, we use the fivefold cross-validation technique to estimate the accuracy of the trained classifier.

## 3.1.

### Figure of Merit Learning and Validation

The 27 features are computed from 25 ROIs in each image. For SNR and CNR calculations, 20 ROIs are selected from the tissue region and 10 ROIs from the background region. For computation of other three quality assessment measures, the entire filtered image is used. All the 25 digital filters described in Sec. 2 were implemented in MATLAB^{®} 2016. We used a Dell desktop computer with an Intel Core i7, 3.10 GHz CPU and 8 GB of RAM to implement the algorithms. The OCT machine is an FDA approved system for skin imaging, thus, *in vivo* skin images were collected. Images were acquired from both healthy and diseased individual’s skin. OCT images of healthy skin were taken from various body locations, to account for the variety of skin architecture found on the body. Additionally, OCT images of diseased skin were collected, including nonmelanoma skin cancer, psoriasis, and acne. The imaging was performed in Oakwood Clinic, Dearborn, Michigan. The institutional review board at Wayne State University (Independent Investigational Review Board, Detroit, Michigan) approved the study protocol. For training the classifier, the number of input images was $285\times 25=7125$, where 285 is the number of images and 25 is the number of filters. Based on a fivefold cross-validation method, out of 7125 OCT images, $228\times 25=5700$ images were used for training the classifier, and the remaining (i.e., 1425 images) were used for test.^{43} The accuracy of the classifier is obtained as 97%. In the following, three experiments are performed based on the three main classes of digital filters, i.e., sliding window, adaptive statistical-based, and edge preserved patch or pixel correlation-based filters.

The histogram of winner filters for 285 test sets in the sliding window filter category as well as their execution time are shown in Fig. 5. According to the graph, the average filter with the window size 5 is the winner filter for despeckling, most of the time. Median filter with the window size 5, and SNN with the window size 5 are the next two winner filters.

In Fig. 6, three original OCT images and their despeckled images using sliding window filters are shown. In the results presented in Fig. 6, the classifier has only learned the sliding window filters subgroup rather than the entire filter pool. The yellow boxes in the figure indicate the winner filters based on FOM criterion. For three test images here, the average filter with the window size 5 was chosen as the winner filter. The red boxes indicate the winner filter chosen by the classifier in sliding window filters subgroup.

Figure 7 shows the histogram of winner filters in the adaptive statistical filter category for 285 OCT images as well as their execution time. According to this graph, for our image set, the Lee filter with window size 5 is chosen for all the images that were used in this study when FOM was the quality assessment criterion.

In Fig. 8, original OCT images and despeckled ones using adaptive statistical filters are shown. Here the classifier has only learned the adaptive statistical filtering subgroup. The yellow boxes in the figure indicate the winner filters chosen based on FOM criterion, i.e., here, Lee filter with window size 5. The red boxes indicate the winner filters chosen by the classifier in the adaptive statistical filtering subgroup.

Figure 9 shows the histogram of winner filters in the patch or pixel correlation filter category as well as their execution time. According to this graph, when FOM is the quality assessment criterion, in most cases BM3D filter is chosen as the winner filter. NLM and TV are the next winners, respectively.

In Fig. 10, original OCT images and despeckled ones using patch or pixel correlation filters are shown. In this case, the classifier has only learned the patch or pixel correlation filtering subgroup. The yellow boxes in the figure indicate the winner filters selected based on FOM criterion, i.e., BM3D and NLM. The red box indicates the winner filter chosen by classifier.

## 3.2.

### Classifier Training and Performance Evaluation

With the above experiments and the obtained results, we demonstrated that FOM is a reliable merit and could be used as the class label in the training of the classifier. This was shown by comparing the winner filter chosen based on FOM (calculated regardless of image textural features and relying only on quality assessment measures) and that chosen based on the classifier (trained based on image’s features). We evaluated the selection rate of filters in all three categories together for the 285 images (see Fig. 11).

Figure 12 (I_1–I_6) shows the results of LDF on six OCT test images (${a}_{1}$ to ${a}_{6}$) with their corresponding winner filter predicted by the classifier when all 25 filters are considered. Figures 12(b)–12(d) show normalized FOM values for all 25 filters for the OCT images. Comparing the results of the winner filter chosen by classifier with that chosen by the normalized FOM, it shows that the classifier can predict the winner filter with high accuracy, without having to know the result of each individual filter and only based on the features extracted from the image. In this experiment, the classifier accuracy was measured as 97% based on a fivefold cross-validation method.^{42}

Regarding the execution time, we observed that even though the patch or pixel correlation category filters filtered images most efficiently, their execution time is in the order of seconds, whereas the execution time of most of the sliding window filters are in millisecond range. The choice of having a better quality (based on FOM) or shorter execution time can be added to LDF criteria.

Speckle is an artifact that affects image quality and resolution of OCT images. Computer reduction of speckle has been a subject of interest since the early times of laser speckle and digital image processing,^{2}^{,}^{3}^{,}^{44}^{,}^{45} and their application to OCT have led to a large number of publications over nearly 20 years. However, several groups including our group have studied different speckle reduction methods for OCT images, finding an optimum speckle reduction filter for an image or image set has not been comprehensively explored. We propose a procedure; we called it LDF, to learn the most effective speckle reduction method for one given set of images. We believe that LDF could help in choosing the most appropriate despeckling filter based on tissue morphological, textural, and optical features.

Using a more diverse and a larger number of OCT images (both healthy and diseased) for training the classifier, utilizing a more efficient training algorithm along with a larger number of features and quality assessment measures are planned as our future work.

## 4.

## Conclusion

Speckle is an artifact that affects image quality and resolution. Speckle “noise” reduction has been a subject of interest since spatially coherent lasers were invented. The necessity of such algorithms for OCT images has led to a large number of algorithms in the literature. We propose an expandable and learnable framework to organize these filters in an intelligent manner. The framework finds the most suitable speckle reduction algorithm for a given image. We called this framework LDF. LDF learns an FOM as a single quantitative image assessment measure constructed from SNR, CNR, SNL, EPI, and MSSIM using an autoencoder neural network. LDF was then trained to decide, which speckle reduction algorithm is most effective for a given image based on the image textural and optical features. The classification accuracy of LDF is determined as 97%. We tested LDF on 285 images. The quality of the despeckled images using LDF is not only improved in one quality metric but also in a combination of all quality metrics. LDF can also be customized based on a sole feature that the user requests. Utilizing a larger dataset for training the encoder and classifier as well as utilizing a more efficient training algorithm along with a larger number of uncorrelated features as well as quality assessment measures are planned as our future work.

## Disclosures

The authors have no relevant financial interests in the article and no other potential conflicts of interest to disclose.

## Acknowledgments

This project has been funded by Michelson Diagnostics and by Wayne State University Startup fund.