Lateral and axial measurement differences between spectral-domain optical coherence tomography systems

Abstract. We assessed the reproducibility of lateral and axial measurements performed with spectral-domain optical coherence tomography (SDOCT) instruments from a single manufacturer and across several manufacturers. One human retina phantom was imaged on two instruments each from four SDOCT platforms: Zeiss Cirrus, Heidelberg Spectralis, Bioptigen SDOIS, and hand-held Bioptigen Envisu. Built-in software calipers were used to perform manual measurements of a fixed lateral width (LW), central foveal thickness (CFT), and parafoveal thickness (PFT) 1 mm from foveal center. Inter- and intraplatform reproducibilities were assessed with analysis of variance and Tukey-Kramer tests. The range of measurements between platforms was 5171 to 5290 μm for mean LW (p<0.001), 162 to 196 μm for mean CFT (p<0.001), and 267 to 316 μm for mean PFT (p<0.001). All SDOCT platforms had significant differences between each other for all measurements, except LW between Bioptigen SDOIS and Envisu (p=0.27). Intraplatform differences were significantly smaller than interplatform differences for LW (p=0.020), CFT (p=0.045), and PFT (p=0.004). Conversion factors were generated for lateral and axial scaling between SDOCT platforms. Lateral and axial manual measurements have greater variance across different SDOCT platforms than between instruments from the same platform. Conversion factors for measurements from different platforms can produce normalized values for patient care and clinical studies.


Introduction
Optical coherence tomography (OCT) provides high-resolution, cross-sectional tomographic images of the human retina and permits direct evaluation of retinal thickness. 1 Recent technological developments in spectral-domain OCT (SDOCT) have greatly increased imaging capabilities compared to earlier time-domain technology. SDOCT provides estimates of retinal layer thicknesses across the macula to aid in clinical diagnosis and treatment decisions for a variety of ocular diseases. [2][3][4][5][6] Interpretation of data has been complicated by the variety of platforms designed by commercial SDOCT instrument manufacturers, each with different proprietary software technologies. Previous studies have identified OCT-derived retinal thickness measurement variability due to differences in their segmentation algorithms, their reported axial resolutions in tissue, their scan density options, and their ability to correct for subject fixation. [7][8][9][10][11][12][13] Additional anatomic factors vary between individual patients, including axial length, refractive focal length, and macular curvature. 14 These anatomic variations may affect the accuracy of comparing lateral and axial measurements between SDOCT instruments in clinical studies. 14 Other studies have addressed measurement differences inherent to individual instruments with the same time-domain OCT (TDOCT) platform. [15][16][17] These TDOCT studies have used large sample sizes and built-in retinal segmentation software to show retinal thickness measurements with widespread variation between instruments, but differences reported in each study were not consistent. [15][16][17] A model eye eliminates variability caused by anatomic differences between human patients and by potential morphologic changes between imaging sessions due to diurnal fluctuations, vascular changes, head tilt, or subject fixation. In a recent study, a customized model eye with a retinal nerve fiber layer phantom has been used to assess thickness differences between SDOCT platforms and individual instruments. 18 However, this study used automated retinal segmentation software from each SDOCT platform, which causes reproducible thickness differences between platforms by using different anatomic definitions to identify retinal layer boundaries. 7-10 Furthermore, previous studies have not addressed SDOCT measurements of lateral width, which are important for novel SDOCT methods of disease analysis, such as drusen diameter and geographic atrophy in age-related macular degeneration. 6 Accurate interpretation of retinal measurements for the treatment of macular diseases and for clinical research requires consistency and reproducibility between different SDOCT platforms and between instruments from the same platform. Significant differences in the quantitative measurements obtained manually from different SDOCT platforms may support the use of a conversion scale to compare data obtained from different systems. The purpose of this study is to determine the variability of lateral and axial retinal measurements among SDOCT instruments from the same commercial platform and across different systems.

Model Eye
A commercially available Rowe model eye (Rowe Technical Designs, Orange County, California) was selected for SDOCT imaging in this study. The manufacturer's technical details describe the solid-state retinal tissue phantom as a 4.8-mmdiameter cylinder made of translucent polymethyl methacrylate. 19 The retinal tissue phantom has ∼300 μm thickness in the axial plane and a central depression of 0.9 mm radius and 180 μm central thickness, designed to simulate the natural foveal pit. 19 A single model eye was used for all imaging. The model eye was removed and realigned on the same horizontal and vertical axis prior to each scan in order to reduce error from image tilt between different instruments. Alignment was confirmed by securing the model eye to a bracket attached to each SDOCT instrument and then centering the flat base of the tissue phantom with the 0-deg horizontal axis on the display screen. This process was repeated for every scan obtained with each instrument. Portable instruments were held and centered by hand with the 0-deg horizontal axis on the display screen.

SDOCT Instruments and Imaging Protocols
Eight separate SDOCT instruments were selected from three manufacturers and four SDOCT system platforms. We used two Spectralis devices (Spectralis™ OCT software version 5.3, Heidelberg Engineering, Carlsbad, California), two Cirrus devices (Cirrus™ HDOCT software version 5.2, Carl Zeiss Meditec, Dublin, California), and four Bioptigen OCT devices: two portable hand-held Envisu devices and two tabletop SDOIS devices (Envisu™ software version 2.0 and SDOIS software version 1.3, Bioptigen Inc., Morrisville, North Carolina).
All systems used superluminescent diode light sources with broad bandwidths centered between 800 and 900 nm, achieving an axial resolution of ∼5 μm per pixel. In order to make fair comparisons between instruments, raster scanning protocols were matched between platforms as closely as permitted by their respective software. The Cirrus platform (840 nm) and both Bioptigen platforms (820 nm) captured 6 × 6-mm raster scans consisting of 128 B-scans with 512 A-scans per B-scan. Due to its software restrictions, the Spectralis platform (870 nm) captured 20 deg ×20 deg raster scans (∼5.9 × 5.9 mm) consisting of 97 B-scans with 512 A-scans per B-scan. To assess reproducibility, 10 raster scans were performed on each instrument. Scans from both Bioptigen platforms were optimized for dispersion mismatch during imaging due to refractive index differences between the Rowe model eye and the average human eye. Cirrus and Spectralis software performed automatic optimization of dispersion during scan acquisition.

SDOCT Measurements and Statistical Analysis
Two graders viewed all SDOCT scans and agreed upon the one B-scan with the minimum central thickness that best approximated the foveal center of the retinal tissue phantom. Images were viewed in each platform's standard display screen, and image processing was not allowed (i.e., magnification, brightness, contrast, summation, or Gaussian smoothing). Each grader performed measurements on the central B-scan of 10 raster scans obtained with each SDOCT instrument in masked and independent fashion. We selected anatomic landmarks on the tissue phantom that could be readily identified and measured in the lateral or axial planes of the central B-scan image. The lateral measurement was performed on the lateral width (LW) of the tissue phantom. Axial measurements were performed on the central foveal thickness (CFT), parafoveal thickness (PFT) at 1 mm to the left of center, and PFT at 1 mm to the right of center. These measurements included the largest dimensions of the tissue phantom in the lateral and axial planes in order to capture as much range of error as possible across SDOCT platforms. Figure 1 shows the borders defined for each manual measurement on different SDOCT platforms. Instruments from the same SDOCT platform had the same version of software and built-in screen calipers to take manual measurements. On all platforms, measurement accuracy was limited by pixel resolution and automatically converted to microns or millimeters by built-in software.
Intergrader reproducibility of retinal measurements was assessed with intraclass correlation coefficients (ICC) and 95% confidence intervals (CI). Due to high intergrader agreement, data from both graders were combined to assess intraplatform variability between instruments and interplatform variability between SDOCT systems. Coefficients of variance (COV) were calculated for each instrument and measurement, and instruments were compared with two-tailed t-tests. Intra-and interplatform differences for each measurement were assessed with analysis of variance models and Tukey-Kramer tests. All statistical analysis was performed with SAS statistical modeling software (SAS JMP 10, SAS Institute, Cary, North Carolina), and p values <0.05 were considered statistically significant.

Results
Qualitative image differences were observed between SDOCT platforms (Fig. 1). Spectralis instruments suppressed the most reflections, but signal suppression also complicated layer identification and observer measurements. Cirrus scan images appeared to be more saturated, illustrated by broadening of the hyperreflective bands created by laminations within the tissue phantom. Images from the Bioptigen systems (Envisu and SDOIS) had intermediate signal strength and were similar in appearance to each other.

Intergrader Reproducibility
There was excellent agreement between the two independent graders, with similar mean and standard deviation obtained for each measurement (Table 1). There was good agreement for LW measurements (ICC 0.71, CI 0.58 to 0.80). There was excellent agreement for all axial thickness measurements (ICC ≥ 0.95 for central and PFT measurements). These results showed excellent reproducibility of SDOCT image acquisition and measurement with the model eye.

Intraplatform Reproducibility Between Instruments
The differences between instruments from the same manufacturer and differences between SDOCT platforms are shown in Table 2. Serial measurements on each instrument were tightly grouped; however, average measurements between instruments were significantly different for all SDOCT platforms. For LW measurements, Spectralis had the greatest variance between two instruments (17-μm difference in mean width, p ¼ 0.002) and Bioptigen SDOIS had the least (4-μm difference in mean width, p ¼ 0.042). For LW measurements, Spectralis had the greatest single-instrument variance (COV ¼ 1.309) and Bioptigen Envisu had the least (COV ¼ 0.236). For CFT measurements, Cirrus had the greatest variance between instruments (9-μm difference in mean CFT, p < 0.001) and Bioptigen Envisu had the least (3-μm difference in mean CFT, p < 0.001). For PFT measurements, Bioptigen Envisu had the greatest variance between instruments (9-μm difference in mean PFT, p < 0.001), whereas Cirrus and Bioptigen SDOIS had the least (2-μm difference in mean PFT, p ¼ 0.037 and p ¼ 0.016, respectively).

Interplatform Reproducibility Between Systems
Results of comparison between SDOCT platforms are shown in Table 3. All measurements between different SDOCT platforms were significantly different, except for the difference in LW measurements between two SDOCT platforms from the same manufacturer, Bioptigen SDOIS and Envisu (p ¼ 0.272). Mean LW measurement differences ranged between 15 μm (Envisu versus SDOIS, 0.3%) and 106 μm (Cirrus versus Spectralis, 2%) among different SDOCT platforms. Mean axial thickness measurement differences ranged between 5 μm (Cirrus versus Spectralis, 1.1%) and 45 μm (Cirrus versus SDOIS, 17%) among different SDOCT platforms. Differences between instruments from the same platform were significantly smaller than between different platforms for lateral and axial measurements, including LW (p ¼ 0.020), CFT (p ¼ 0.045), and PFT (p ¼ 0.004). Conversion factors were calculated from mean single-platform measurements in order to allow investigators to translate quantitative data from one SDOCT platform to another. Conversion factors are presented for LW scaling in Table 4 and axial thickness scaling in Table 5.

Discussion
This study examined the variability in lateral and axial manual measurements between several commercial SDOCT platforms. Dimensions were measured by hand with each instrument's caliper tool, rather than by the manufacturer's segmentation program. A single model eye was used to test for variability and to serve as a standardized solid-state target for SDOCT imaging. Under consistent imaging conditions, we found statistically significant differences in all lateral and axial manual measurements between instruments from the same manufacturer and different manufacturers, but intraplatform differences between instruments were significantly smaller than interplatform differences. From these results, we generated conversion factors to facilitate the comparison of manual measurements between different SDOCT platforms in future clinical trials and in daily treatment of macular diseases.
Before the appearance of numerous commercial SDOCT systems, several studies looked at errors and variability between instruments with the same platform. [15][16][17] Barkana et al. evaluated several TDOCT instruments and they found substantial differences between devices, few being statistically significant. 16 Interestingly, they found that the differences observed were significantly correlated with signal strength. Our findings differ from Barkana et al. and others, who reported no statistically significant difference between instruments. [15][16][17] However, these reports only evaluated TDOCT instruments and had higher standard deviation of thickness measurements  Note: CFT, central foveal thickness; PFT, parafoveal thickness; SD, standard deviation; COV, coefficient of variance; ANOVA, analysis of variance. than recent SDOCT studies, in part due to the inferior pixel resolution of TDOCT systems. [7][8][9][10] This study is the first to rigorously compare quantitative manual measurements from several commercial platforms utilizing a commercially available model eye. We decided to evaluate two commercial platforms that are commonly used in human adult imaging, clinical research, and randomized clinical trials. [2][3][4] We chose a commercial hand-held portable platform approved for retinal imaging in pediatric human subjects 5,14,[20][21][22] and in basic animal research. [23][24][25] Furthermore, the largest ongoing randomized trial for age-related macular degeneration (AMD), the NEI-sponsored Age-Related Eye Disease Study 2, exclusively allows the Bioptigen SDOIS platform for its longitudinal, observational ancillary SDOCT study (AREDS2 Ancillary SDOCT Study). 6,26 The baseline dataset and measurements for both control and AMD eyes in this study has been made publicly available. 6 Several studies have concluded that comparing retinal thickness with instruments from different manufacturers is not advised for clinical studies. 7-10 Determining the true variability in these measurements with a cohort of patients would be biased by errors in lateral and axial scaling. For example, Spectralis machines are programmed to offer scan parameters based on degrees of visual angle; however, it provides caliper measurements in millimeter distance. The same visual angle would span a shorter diameter in an eye with shorter axial length, but the distance would be converted to the same millimeter distance as a scan distance on a longer eye. Axial measurement differences may be caused by variability in the default algorithms for automated segmentation line placement, refractive index correction, or dispersion compensation across different SDOCT platforms. Since these calculations are proprietary components of each platform's software, it is difficult for third party investigators to test their separate contributions to measurement variability.
We have also demonstrated statistically significant variability in manual measurements of a single retinal tissue phantom between two different instruments with the same SDOCT platform. Variability between these instruments may result from inherent variability in the optical path length measured at two different time points, variability in the degree of decalibration between instruments that occurs over time with regular use, or measurement variability caused by speckle noise. We attempted to control for decalibration by selecting same-platform instruments with similar frequency of use in daily clinical care. In SDOCT, speckle noise results from interference between densely packed reflectors, reducing contrast between highly scattering structures in tissue. 27 However, the averaging methods commonly used by commercial SDOCT platforms were not applicable to the motionless imaging protocol of this study, where speckle noise was highly correlated across images and instruments. Figure 1 showed acceptably low image noise, and even state-of-the-art denoising algorithms produce some level of image blur, 27 permitting us to perform measurements on the unprocessed images shown. Based on the small differences between graders (Table 1) and between same-platform instruments (Table 2), we concluded there was negligible effect of speckle noise on measurement variability.
Measurement differences between platforms were statistically significant; however, the clinical significance of this difference is less clear. With the exception of the Bioptigen SDOIS, the SDOCT systems evaluated in this study had low variability from a clinical standpoint, albeit statistically significant. Lateral scaling variability was 0.3 to 2% between platforms, which represents a range of 15 to 106 μm in width difference between images (based on nominal 6-mm scans divided by sampling density of 512 A-scans). Axial measurements performed in this study suggest that variability across all platforms was 1.1 to 17% between platforms, equivalent to a difference of 5 to 45 μm based on the nominal axial resolution of these SDOCT platforms. Excluding the axial measurements from the Bioptigen SDOIS, which were consistently smaller than all other platforms, the mean difference decreased to <3.7% (∼1.3 pixels or 8 μm) across the other three systems. Low variability between Cirrus, Spectralis, and the portable Envisu system suggested that hand motion or instability of a human operator does not introduce additional error while holding the hand-held probe over the target. These differences may not affect disease management with uniform scanning protocols and manual measurements based on the small number of pixels required for the observed differences and the larger errors associated with automated segmentation, sampling density, and fixation variability. 7,10-13 However, clinical studies gathering repeated measurements over time to evaluate disease modification may obtain statistically significant differences that remain within the range of instrument variability.
In conclusion, we have shown significantly greater variability across different platforms than between instruments from the same platform, while controlling for the influence of anatomic variations in human imaging and differences created by automated segmentation programs. This report suggests that clinical investigators may need to account for inherent variances in quantitative SDOCT data collected for clinical trials and routine patient follow-up. Standardized conversion factors may improve the accuracy of data collected from different SDOCT platforms. These conversion tools require further validation with larger samples and human imaging studies. We note that optical imaging instruments may perform differently with eyes of different axial length, refraction, and optical scattering. Accurate quantification of such parameters is part of our ongoing research. Robust, precise, and reproducible conversion factors between commercial SDOCT platforms may allow for the use of a greater range of SDOCT systems in clinical studies and can improve the clinical interpretation of statistically significant differences obtained from study results.