Artificial intelligence deep learning algorithm for discriminating ungradable optical coherence tomography three-dimensional volumetric optic disc scans

Abstract. Spectral-domain optical coherence tomography (SDOCT) is a noncontact and noninvasive imaging technology offering three-dimensional (3-D), objective, and quantitative assessment of optic nerve head (ONH) in human eyes in vivo. The image quality of SDOCT scans is crucial for an accurate and reliable interpretation of ONH structure and for further detection of diseases. Traditionally, signal strength (SS) is used as an index to include or exclude SDOCT scans for further analysis. However, it is insufficient to assess other image quality issues such as off-centration, out of registration, missing data, motion artifacts, mirror artifacts, or blurriness, which require specialized knowledge in SDOCT for such assessment. We proposed a deep learning system (DLS) as an automated tool for filtering out ungradable SDOCT volumes. In total, 5599 SDOCT ONH volumes were collected for training (80%) and primary validation (20%). Other 711 and 298 volumes from two independent datasets, respectively, were used for external validation. An SDOCT volume was labeled as ungradable when SS was <5 or when any artifacts influenced the measurement circle or >25% of the peripheral area. Artifacts included (1) off-centration, (2) out of registration, (3) missing signal, (4) motion artifacts, (5) mirror artifacts, and (6) blurriness. An SDOCT volume was labeled as gradable when SS was ≥5, and there was an absence of any artifacts or artifacts only influenced <25% peripheral area but not the retinal nerve fiber layer calculation circle. We developed and validated a 3-D DLS based on squeeze-and-excitation ResNeXt blocks and experimented with different training strategies. The area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy were calculated to evaluate the performance. Heatmaps were generated by gradient-weighted class activation map. Our findings show that the presented DLS achieved a good performance in both primary and external validations, which could potentially increase the efficiency and accuracy of SDOCT volumetric scans quality control by filtering out ungradable ones automatically.


Introduction
Optical coherence tomography (OCT) is a noncontact and noninvasive imaging technology offering objective and quantitative assessment of human eye structures, including the cornea, macula, and optic nerve head (ONH) in vivo. The introduction of spectral-domain optical coherence tomography (SDOCT) in recent years has improved scanning speed and axial resolution, enabling high-resolution, three-dimensional (3-D) volumetric imaging that has made a great contribution to the wide application in clinics. 1 However, poor scan quality due to patients' poor cooperation, operators' skills, or device-dependent factors (e.g., inaccurate optic disc margin delineation) can affect the metrics generated from the SDOCT. 2,3 Specifically, insufficient image quality potentially leads to inaccurate measurements of retinal nerve fiber layer (RNFL) thickness, which is an important metric for detection of optic neuropathy such as glaucoma, a leading cause of irreversible blindness. 4 Other morphologies from ONH, such as neuroretinal rim and lamina cribrosa, 5 are also used to assess glaucoma, which also require sufficient quality of SDOCT volumetric data for such assessment. Thus, it is necessary to filter out ungradable scans and reoperate on patients with subpar images before any clinical assessment.
Conventionally, signal strength (SS) is the main parameter to include or exclude SDOCT scans for further quantitative analysis. 6 For the Cirrus high-definition SDOCT, image quality is indicated by SS ranging from 0 (worst quality) to 10 (best quality), representing the average of signal intensity of SDOCT volumetric scans, and scans with SS of 6 or above are usually defined as sufficient for further analysis. [7][8][9] However, even with acceptable SS, it is still hard to assess other SDOCT image quality issues, such as off-centration, out of registration, signal loss, motion artifacts, mirror artifacts, or blurriness of SDOCT volumetric data. 3 Such image quality assessment indeed requires highly trained operators and interpreters with specialized knowledge in SDOCT, which is a big challenge due to the lack of manpower and insufficient training time in clinics. In addition, it is impractical for human assessors to grade every SDOCT volumetric scan, which could be a time-consuming and tedious process in busy clinics.
Previous studies have proposed traditional computer-aided systems using hand-crafted features for automated image quality control in natural images. 10 However, the hand-crafted features were based on either geometric or structural quality parameters such as signal-to-noise ratio, which do not generalize well to new datasets. Moreover, unlike natural images, the gradability of medical images is not simply related to pixels, signals, noises, or distortion of an image itself. Human assessors' judgment on whether the quality of the entire image is sufficient for disease detection or further analysis is essential for discriminating the gradability of medical images.
Machine learning, under the broad name of artificial intelligence (AI), adopts a class of techniques called deep learning (DL). 11 In terms of image processing, convolutional neural networks (CNNs) are proven to be useful in image-related tasks. It is more efficient to extract and weigh features automatically rather than in a hand-crafted manner. Currently, CNN has been used for image quality control in various medical imaging, such as magnetic resonance imaging (MRI), 12 ultrasound imaging, 13 and fundus photography. 14 Generally, using DL for image quality control can be achieved with either unsupervised or supervised methods. Unsupervised anomaly detection is mainly used in highly imbalanced datasets to detect rare cases. It learns features from only one kind of input, then computes the similarity between the future input and the learned one. Generative-based works, such as variational autoencoder-based methods 15 and generative adversarial networks-based methods, 16 are commonly applied on more than one neural network. Nongenerative models, such as one-class neural networks, 17 require a pretrained deep autoencoder as well. Hence, a higher computational cost and a larger graphics processing unit (GPU) memory are needed for applying anomaly detection, especially on 3-D image tasks, which would be impractical in our study. The second method is binary classification, a supervised anomaly detection method to train a CNN model to recognize input images as binary labels. With residual connections proposed from ResNet, 18 deeper CNNs can be trained without degradation by reducing gradient vanishing or exploding problems. Other variants such as ResNeXt further improved the performance on classification benchmarks such as ImageNet data. 19 Apart from those, SENet proposed squeeze-and-excitation (SE) blocks, which introduced a channel-wise attention mechanism in a simple plug-in manner that could be applied in any DL models, and it surpassed other architectures in the competition ImageNet 2017. 20 Since the ground-truth label of each SDOCT volumetric scan is from highly trained human assessors, a model trained in a supervised manner would be better in our study. As far as we know, though CNN has been applied in medical imaging quality control, there is still a lack of DLbased method for quality control of SDOCT volumetric scans.
In this study, we aim to develop and validate a 3-D deep learning system (DLS) using SDOCT volumetric scans as input for filtering out ungradable volumes. We hypothesize that the 3-D DLS for filtering out ungradable SDOCT volumetric scans without hand-crafted features would perform well in both primary and external validations. The DLS would eventually increase the accuracy and efficiency of SDOCT volumetric data quality control and further make a contribution on accurate quantitative analysis and detection of diseases.

Data Acquisition and Data Pre-Processing
The dataset for training and validation was collected from the existing database of electronic medical and research records at the Chinese University of Hong Kong (CUHK) Eye Center and the Hong Kong Eye Hospital (HKEH) dated from March 2015 to March 2019. The inclusion criteria were any subjects who have undergone ONH SDOCT imaging by Cirrus SDOCT (Carl Zeiss Meditec, Dublin, California). A total of 5599 SDOCT volumetric scans from 1479 eyes were included for the development of the DLS. These data were from normal subjects or patients with any pathologies, and most of the patients had glaucoma. Two nonoverlapping datasets collected from Prince of Wales Hospital (PWH) and Tuen Mun Eye Center (TMEC) in Hong Kong were used as two external validation datasets, including 711 SDOCT scans from 509 eyes and 298 scans from 296 eyes, respectively. (Table 1) An SDOCT volume was labeled as ungradable when SS was <5 or when any artifacts influenced the measurement circle or >25% of the peripheral area. Artifacts included (1) offcentration, (2) out of registration, (3) missing signal, (4) motion artifacts, (5) mirror artifacts, and (6) blurriness. An SDOCT volume was labeled as gradable when SS was ≥5 and absence of any artifacts or artifacts only influenced <25% of the peripheral area but not the RNFL calculation circle. The RNFL calculation circle was a circle of 3.46 mm in diameter evenly around its center based on the location of the optic disc, and it was automatically placed by Cirrus SDOCT machine (Cirrus User Manual). Before starting to grade, two highly trained human assessors were tested. A separated set of images containing 200 SDOCT volumetric scans were reviewed by the two assessors and kappa value of 0.96 was achieved, which indicated an almost perfect agreement. 21 Disagreed cases were further discussed with the senior assessor, a trained doctor with more than 5 years of clinical research experience in glaucoma imaging. After training and testing, two assessors worked separately to label each SDOCT volumetric scan from all the datasets as ungradable or gradable. Disagreements between the two assessors were resolved by consensus, and the cases without consensus were further reviewed by the senior assessor to make the final decision (examples are shown in Fig. 1).
Data augmentation strategies, including random flipping, random rotating, and random shifting, were used to enhance the training samples and alleviate overfitting. The original SDOCT volumes were with size of 200 × 200 × 1024 in three axes, x axis, y axis, and z axis, respectively. To mimic the real SDOCT imaging in the clinics, some data augmentation methods were only applied on one or two axes for the whole volume. For instance, 20% chance random flipping and 15-deg random rotation were applied on only x (200) and y-axes (200), respectively. The color channel was set to 1 since all OCT images were grayscaled.

Irrelevancy Reduction and Attention Mechanism
Generally, for this specific task, i.e., discriminating the ungradability from an SDOCT scan, there is a high level of information that could disturb the ungradable features, such as the anatomic changes of ONH, the shadow of vessels, and the noise speckles in the choroid or vitreous. Hence, the features in ungradable SDOCT volumes do not follow any specific feature patterns, which may lead the neural networks to misinterpret the appearance of those aforementioned irrelevances as ungradable features. To address the problem, we trialed two methods-irrelevancy reduction and attention mechanism-for a better model performance.
Irrelevancy reduction omits the parts of irrelevant signals that should not be noticed by the signal receiver, which potentially improves the performance. 22 Intuitively, denoising was used as one of the strategies to reduce the irrelevancies of OCT scans since the noise of SDOCT scan impeded the medical analysis either visually or programmatically. 23 Thus, in experiment 1, we used a model based on ResNet blocks to compare the performance between the original and the irrelevancy reduced data. For denoising, we used nonlocal means 24 as the strategy, which performed both vertically (along x, z facets) and horizontally (along x, y facets) with different sets of parameters. Vertically, the template window size was set to 10, whereas the search window size was set to 5 with a filter strength of 5. Horizontally, the template window size was set to 5, and search window size was set to 5 with a filter strength of 5.
In experiment 2, we applied a self-attention mechanism by combing the SE block that introduced a channel-wise attention mechanism to the ResNet model. The self-attention mechanism could make the model pay attention to the more important areas and extract features automatically in the original SDOCT volumes. Furthermore, we experimented on the combination of data denoising and the attention mechanism by training the SE-ResNet model with denoised volumes, with the aim of achieving a better performance.
In experiment 3, we substituted the ResNet blocks with ResNeXt blocks with the consideration of the performance improvement. Then we fine-tuned the cardinality of transformation layers to reduce the GPU cost.

Development and Validation of the Deep Learning System
The model for the DLS was implemented with Keras and Tensorflow, on a workstation equipped with i9-7900X and Nvidia GeForce GTX 1080Ti. Figure 2(a) shows the building block of the ResNet model. First, there were 32 filters with 7 × 7 × 7 kernel convolution layer with the stride of 2, along with a 3 × 3 × 3 max pooling with the same stride setting. Second, the obtained feature maps went through 18 ResNet blocks. A pooling size 2 with stride 2 average pooling was performed every three blocks to aggregate the learned features. Channel-wise batch normalization and rectified linear unit activation were performed after all convolution operations. Finally, a global average pooling followed by a fully connected softmax layer was used to produce the binary output as gradable or ungradable. This ResNet-based model was taken as the benchmark model. Next, we further experimented with the SE-ResNet-block 20 [ Fig. 2
In each SE-ResNet or SE-ResNeXt block, the SE reduction ratio was set to 4 and the cardinalities of the transformation layer were set to 8, with 32 filters. The constructed models are depicted in Fig. 2

(d).
A total of 1353 ungradable and 4246 gradable SDOCT volumetric scans collected from CUHK Eye Center and HKEH were randomly divided for training (80%) and primary validation (20%). In each set, we kept the similar distribution of ungradable versus gradable scans and distributed the eyes from the same patient to the same set in order to prevent data leakage and biased estimation of the performance. Cross entropy and Adam were used as the loss function and the optimizer. During the training, 3000 volumetric scans were selected with data balancing. Batch size was set to 1 due to the limited GPU memory. The initial learning rate was set to 0.0001, and then reduced by multiplying 0.75 in every two epochs. In addition, to validate the generalizability of the proposed DLS, SDOCT scans from PWH (181 ungradable versus 530 gradable) and TMEC (60 ungradable versus 238 gradable) were utilized for external validation separately.

Evaluation Metrics
In the experiments, the area under the receiver operating characteristic (ROC) curve (AUC) with 95% confidence intervals (CIs), sensitivity, specificity, and accuracy were used to evaluate  the diagnostic performance of the DLS discriminating ungradable or gradable scans. Training-validation loss curves were observed (Fig. 3). Heatmaps were generated by gradientweighted class activation map (Grad-CAM) 25 to evaluate the performance qualitatively.

Performance Comparison
We tested the feasibility of irrelevancy reduction and attention mechanism in experiments 1 and 2, respectively. We also explored whether the performance would improve by combining the two approaches. Experiment 3 was further performed by refining the model structures. The experimental results were shown in Table 2  In experiment 2, the SE block was implemented to introduce the channel-wise attention to the benchmark model, which could help the method suppress the noisy features for the more essential features to discriminate ungradable patterns. As illustrated in Figs. 3(c) and 3(d), with the introduced attention mechanism, the training-validation loss was well converged without significant oscillations. As shown in Table 2 and Fig. 4, the SE-ResNet model fed with original volumes performed much better than  Fig. 4. More importantly, the overall diagnostic of SE-ResNeXt fed with denoised volumes was the best with sensitivity of 86.2% (95% CI: 80.0% to 92.4%), specificity of 92.6% (95% CI: 86.8% to 96.9%), and accuracy of 91.0% (95% CI: 87.3% to 93.5%) in primary validation and sensitivities of 69.1% (95% CI: 58.0%  In general, the model with denoised scans was better than the one with original scans with a significant improvement on the primary dataset and similar performance on external validations. The introduced SE blocks achieved a comparable result even on original volumes. It proved that either irrelevancy reduction or attention mechanism could significantly improve the performance, compared to the benchmark model-ResNet-fed with original volumes. Moreover, our proposed method combining both irrelevancy reduction and attention mechanism has achieved the highest AUCs in our experiments in both primary and external validations (primary validation: 0.954 versus 0.640 to 0.943, external validation 1: 0.816 versus 0.535 to 0.815, external validation 2: 0.857 versus 0.697 to 0.857). Referring to other diagnostic metrics, such as sensitivity, specificity, and accuracy, this model also outperformed the other models in both primary and external validations in general.

Qualitative Evaluation
We generated heatmaps (Fig. 5) based on the best performing model-SE-ResNeXt fed with denoised volumes-where the red-orange color represented more discriminative areas for ungradable information. We observed that for the truly discriminated ungradable volumes, there was no regular pattern in the area highlighted by the DLS due to the variances from different artifacts. However, in general, we still found that the DLS could detect ungradable features well, especially signal loss, mirror artifacts, or blurriness. Area with the appearance of these artifacts was exactly covered by warmer color.

Discussion
In this study, we developed and validated a 3-D DLS to discriminate ungradable SDOCT ONH volumetric scans automatically. The proposed method, SE-ResNeXt fed with denoised volumes, achieved best performance in both primary and external validations. Experimental results show that ungradable SDOCT volumetric scans can be discriminated without any human interventions. It may potentially increase the efficiency of SDOCT image quality control and further help with disease detection, which could be a novel application in clinics.
Our proposed DLS offers a powerful tool for filtering out ungradable scans in clinics. Currently, one of the main challenges for SDOCT image quality control is the irregular ungradable feature patterns due to varying artifacts. Using traditional index for the quality assessment, such as SS, is insufficient to assess different kinds of artifacts such as off-centration, out of register, motion artifacts, and mirror artifacts. Scans with acceptable SS could also be ungradable for disease detection, as illustrated by the examples in Figs. 1(c) and 1(d). Nevertheless, manually assessing all the scans would be tedious and impractical in clinics. To address this problem, a DLS with the proposed SE-ResNeXt model was developed and trained for the auto-assessment. From experiment 1, we found that the overall diagnostic performance improved significantly by denoising, which indicated that the irrelevant information in SDOCT scans could strongly affect the model training. Better performance and better generalizability were obtained by reducing the irrelevant information. In experiment 2, we introduced the attention mechanism to further extract the important features out of the whole SDOCT volume automatically. The AUCs of SE-ResNet models were significantly increased, compared to ResNet models. In addition, the results proved that the combination of attention mechanism and irrelevancy reduction, the nonlocal means denoising, showed a more stable training-validation curve and outperformed either one of the previous two strategies. It proved that the channel-wise attention could help the model learn from noisy data with a much stable loss and a better generalizability.
In experiment 3, we replaced all the ResNet blocks with ResNeXt blocks for a better performance and a lower GPU cost. Our final proposed model was developed by SE-ResNeXt structure and trained with denoised full-size SDOCT volumes. According to the activation heatmaps in Fig. 5, the proposed model learned ungradable features similar to what human assessors would observe. Referring to the color distribution of the color bar, the red-orange color represented more discriminative areas for ungradable information, which were of value 0.8 to 1, whereas the blue-purple color represented the nondiscriminative areas that were of value 0 to 0.2. In general, we found that the DLS could well detect the ungradable features such as signal loss, mirror artifacts, and severe motion artifacts. Meanwhile, some ungradable features, i.e., blurriness or optic disc dislocation, were highlighted on the whole retina. There was seldom warmer color in the truly discriminated gradable volumes, but the relatively highlighted regions were mainly distributed in the vitreous and choroid for almost every truly detected gradable volumes. It might be caused by the appearance of more noise speckles in vitreous and choroid, compared to retina. The results from all the experiments illustrated that our proposed model trained with the denoised full-sized volumes was the best-fitted model that also achieved the optimum diagnostic performance among all the models in both primary and external validations.
It is hard to apply traditional computer-aided image quality control to a new dataset since the hand-crafted features are usually based on the objective features, either geometric or structural quality parameters, while some features are subjective in the real-world cases. A previous study on MRI image quality control also proved that deep neural networks got an overall better performance compared with the traditional machine learning method. 26 In our work, the proposed DLS achieves good performance on two totally unseen datasets from different clinics, which means the model has a good generalizability that may be applied to other clinics directly.
At present, multiple DLSs have been developed based on OCT in ophthalmology, such as referable retina diseases detection, 27,28 glaucoma quantification and classification, [29][30][31] and antivascular endothelial growth factor treatment. 32 These studies perfectly underscored the promise of DL to lower the cost of disease interpretation from OCT images. Hence, it would be necessary to filter out ungradable images beforehand for a better precision. However, at present, most of the ungradable images were filtered out manually before ground-truth labeling for abnormalities. Thus, our DLS could potentially be incorporated with other DLSs for further disease detection. Another important future application of our DLS is to be installed in SDOCT machines so that operators could be informed to repeat image acquisitions immediately if the DLS classifies the acquired image as ungradable. It would largely alleviate the burden of image quality control manually and efficiently provide images with better quality for further analysis.
There are several points to strengthen the training and evaluation of our proposed model. First, highly trained SDOCT human assessors reviewed both volumetric scans and reports rather than reviewing printout reports only for the precise labeling. Second, the external validation datasets were collected from different eye clinics, which enlarged the distribution variances of the dataset. Third, we generated activation heatmaps to visualize the discriminative regions for the model output reasoning. However, in our study, only optic disc scans from one type of SDOCT device were used, which might limit the applicability to other devices. In the next version, we shall develop a DLS trained with more kinds of scans, such as macular scan, from various types of devices. In addition, 3-D CNNs consume higher GPU memories, which might cause great extra cost for clinic usage. In the future, a model compression shall be applied to save the GPU memory cost.

Conclusions
Image quality control for the SDOCT volumetric scans is vital for accurate disease detection. Since it is time-consuming and requires the expertise of human graders, manual assessment for every volumetric scan would be tedious and even unfeasible, especially in a clinical center without experienced graders. To improve the efficiency and accuracy of image quality control, a computer-aided system based on DL was developed in our study.
The proposed DLS utilized irrelevancy reduction methods and an attention mechanism for the best diagnostic performance with the highest AUCs, better sensitivity, specificity, and accuracy in both primary and external validations, compared with other experimented models. Combining the observation from the heatmaps, it proved that the proposed DLS learns similar features as human assessors do. Hence, as an automated filtering system, our proposed DLS could give more accurate and reasonable predictions. It would further advance the research on SDOCT image quality control as well as make SDOCT more feasible and reliable for disease detection.

Disclosures
No conflict of interest exists for any of the authors.