Joint space narrowing (JSN) and bone erosions are important outcome measures in rheumatoid arthritis (RA). While MRI and ultrasound are suited to measure actual disease activity, conventional radiographs are widely used to assess structural damage. In clinical research, radiological outcome is often expressed with the semiquantitative Sharp–van der Heijde-score (SHS).1 This scoring system is not well suited to clinical practice or the analysis of large datasets, because it is time consuming and subject to inter- and intrascorer variability.1 To overcome these limitations, computerized methods were proposed for the measurement of joint space width (JSW) of finger joints.126.96.36.199.–7 These systems promise elimination of subjective issues and a continuous measurement scale resulting in a higher sensitivity and improved reproducibility.4,188.8.131.52.13.–14
Automated JSW measurements depend on the quality of radiological equipment, variation in positioning and exposure of the joint, and on the measurement software.9 Evaluation of these measurements, aiming at a precision of ,4 is notoriously difficult. Several studies have reported good discriminatory ability of automated JSW measurement compared to the .1314.15.–16 A few studies of automated measurements found variation in JSW in the healthy population, demonstrating that decrease in JSW of metacarpophalangeal (MCP) and proximal interphalangeal (PIP) joint with aging is normal. Against this background noise, it is important, as Finckh indicated, to evaluate the diagnostic performance of automated scoring systems between consecutive radiographs.13
In this study, we present a method to validate the accuracy of an automated JSW measurement system. The process includes systematic evaluation of the sources of errors, from image acquisition to automated measurements, in order to define steps for further improvement of these systems. In contrast to earlier studies of automated JSW measurements, we used serial images without significant progression of damage. We assumed that the absence of change provides the best background to detect measurement errors related to image acquisition and/or software. In order to assess the impact of image acquisition, we used automated measurements on images from a multicenter study made with conventional radiography and standardized digital images from a single hospital.
Patients and Methods
Digital copies of hand radiographs of early RA patients from two studies were obtained. The randomized controlled combinatietheraptie bij reumatoïde artritis (COBRA) trial in new onset RA was performed between 1993 and 1997.17 Conventional radiographic assessments, obtained every 6 months, had been digitized using a pixel size of . The second dataset originates from the Dutch Rheumatoid Arthritis Monitoring (DREAM) remission induction cohort. This observational study in early RA was performed between 2006 and 2009. Radiographic assessments were obtained every 6 months with up-to-date digital equipment with pixels size of and .18 All images were anonymized and randomly analyzed. Results of the scoring of the radiographs according to the -method17 were retrieved from the original study data.
Using a power analysis (two-tailed, 99% confidence interval, ), we estimated that at least 60 pairs of consecutive images from each dataset would comprise sufficient variation for this study. Radiographs were selected on the following criteria: availability of two consecutive (interval between 6 and 12 months) hand radiographs with known ; from the COBRA trial: pairs with all joint margins delineated correctly with digital methods in an earlier study;4 from the DREAM cohort radiographs from a single hospital, using a constant acquisition protocol during the study. To enable adequate comparison, images with pixel size other then were excluded. In line with previous studies for automated scoring, our analyses were restricted to MCP 2-5 and PIP 2-5.
Automated Joint Space Width Measurement
The JSW of MCP and PIP joints was measured by a single operator (YH) blinded for the SHS results. The method is described elsewhere in detail.5,16 Briefly, as a first step, the hand outline is extracted. The next step is detection of the finger midlines followed by localization of the joints using geometric relationships of the fingers. The joint span is located using the sharpest convex corners of the head of the phalanx, and its margins are determined. The JSW is calculated as the average Euclidean distance in mm from all points on the upper margin to the lower margin and vice versa, within the middle 60% of the joint span. During the measurement process, each joint is visually checked. When the detected joint location was correct but the margin was (partly) inaccurate, the margin was adapted by the operator. Obvious misfit of joint and/or margins was marked as an algorithm malfunction by the operator. These joints were excluded from further analysis.
The success of automated analysis was calculated as the percentage of joints that could be measured successfully, allowing minimal intervention by the operator. To judge the external validity of the JSW in mm, we compared the average results with comparable measurements reported in the literature.
Precision of the measurements is based on the assumption that two consecutive radiographs with the same SHSJSN should have the same JSW in mm. When there was no change in SHS assessed by human observers, a difference between the two measurements () of more than 0.2 mm was regarded as incorrect. For mild progression of JSN (), we defined as correct between and ; for (), we chose a . The percentage of joints that showed no discrepancy according to these definitions and the standard deviation of each dataset with respect to the associated and (precision) were calculated. To detect the causes of discrepant results, the joint margins marked by the software were presented to two rheumatologists (H. K. and H. M.), experienced in scoring radiographs. They independently classified the cause of the discrepancy as repositioning of the hand, inconsistent exposure, or incorrect delineation of the joints by the software.
In the COBRA set, SHS scores were incomplete for 2 of 80 pairs (baseline and follow-up), leaving 78 pairs of radiographs for further analysis. A schematic of the exclusion and failures is presented in Fig. 1. In just four pairs of radiographs from the DREAM cohort, the pixel size appeared to be more than 0.1 mm and was thus excluded, leaving 76 pairs for measurements. The computerized method recognized joints and performed measurements of JSW automatically in 68 of the 78 sets of radiographs (87%) from the COBRA set. In 10 pairs, the software was unable to recognize both hands, due to the digitization (2 with smudged image), positioning of the joints (6) and failure of the software (2). In the remaining 68 consecutive paired hands, 544 MCP and 544 PIP joints (1088 individual joints) were measured. In the DREAM dataset, the automated method failed in a single MCP joint due to abnormal positioning of the hand leaving 75 (99%) successfully measured pairs of digital radiographs (600 MCP and 600 PIP joints) for analysis. There was no damage in 1188/1248 (95%) of the MCP and PIP joints in the COBRA study and in 1129/1200 (94%) in the DREAM cohort.
In 1003/1088 (92.2%) individual joints in the COBRA set, the software could automatically measure the JSW. In 51 (4.7%), it failed to define correct margins and in 26 (2.4%) to determine an accurate joint span. These could be corrected by the operator. Eight joints (1.5%) were excluded from further analysis, because recognition failure could not be corrected. Causes for these failures were a collapsed joint (1), abnormal joint margin (1), finger overlap (1), finger tissue overlap (2), and contrast issues (3). This left 536 joint pairs for further analysis. Mean and standard deviation of JSW in mm are given per in Table 1.
Automated joint space width measurements in MCP and PIP joints in the two series of images related to the Sharp–van der Heijde scores (mean±standard deviation per SHS).
|COBRA (1072 joints)||DREAM (1178 joints)|
|SHSJSN||MCP (mm)||PIP (mm)||MCP (mm)||PIP (mm)|
In the DREAM images, the software was successful without human intervention in 1143/1200 (95.3%) individual joints. Obvious visual misfits of the margins in 27 (2.3%) and joint span in 19 (1.6%) were corrected by the operator. Eleven (0.9%) joints were not measurable with the system due to contrast issues, leaving 589 consecutive joint pairs for further analysis. Mean and standard deviation are given per in Table 1.
To estimate the accuracy of the system under scrutiny, the average results of the measurements in mm for all MCP and PIP joints are compared to published results from other populations and systems in Table 2. This table illustrates that there is an overall agreement with measurements in other studies.3,10,1920.–21 The JSW of MCP joints with an SHS of one or more in our study matches the measurements by van ‘t Klooster in osteoarthritis (OA).3
JSW in MCP and PIP joints reported in various study populations.
|Study and population||Diagnosis||MCP||PIP|
|COBRA with||Early RA||()||()|
|DREAM with||Early RA||()||()|
|Pfeil et al.20||Hand trauma||()||()|
|Goligher et al.18||Early RA||1.63 (1.54, 1.72) mmb||0.99 (0.94, 1.04) mm 1.28 (1.18, 1.38) mm|
|Females ()||2.0 (1.91, 2.09) mmb|
|Van ‘t Klooster et al.3||Very early or advanced hand OA||a|
|Without OA ()||a|
|With OA ()|
|Angwin et al.11||Early RA||c|
|Burghardt et al.19||RA||MCP2 ()|
|RA ()||MCP3 ()|
|Mandl et al.25||RA||0.3 to 2.7 mm ()|
Mean, (95% confidence interval)
As a result of the short interval between consecutive images in early RA, the majority of joints had an unchanged () (Table 3). In these unchanged joints, for both COBRA and DREAM, the mean and standard deviation for MCP and PIP joints were . This agreement at group level conceals individual joints showing a mismatch, in some with a up to . The precision of the software for unchanged MCP and PIP joints is plotted in Fig. 2 and suggests a better performance in the DREAM compared to the COBRA images. Progression of JSN was scored “1” in 11 joints (, , DREAM remission induction: : Table 3). Mean automated measurements in these were (COBRA) and (DREAM). Six joints in the COBRA group were scored , and these had mean automated measurements of .
Change in JSW in serial images of the same joint with an interval of 6 to 12 months, related to change in Sharp–van der Heijde score (mean±standard deviation per ΔSHS).
|COBRA (536 pairs)||DREAM (589 pairs)|
|ΔSHSJSN||MCP (mm)||PIP (mm)||MCP (mm)||PIP (mm)|
Finally, an in-depth review was performed of the 55 joint pairs (COBRA 36, DREAM 19) that fulfilled our predefined criteria for discrepancy between and JSW. The causes for discrepancy as assessed by two rheumatologists were: exposure 15% (COBRA 14%, DREAM 17%), repositioning 57% (COBRA 62%, DREAM 46%), incorrect delineation by the software 25% (COBRA 22%, DREAM 32%), no explanation 3% (COBRA 1%, DREAM 5%). Examples of these causes of error are illustrated in Fig. 3.
Fully automated measurements of structural joint damage in RA has been a goal for several research groups. Despite promising results, these systems are not used routinely in trials let alone in population wide studies of RA or OA. We believe that structured evaluation will be an important step in acceptance of these methods. In this study, we describe a stepwise validation that may help to further improve automated JSW measurements. The use of images from different sources and the use of consecutive images without change as gold standard demonstrate sources of error and emphasize the importance of a strict protocol when making serial radiographs in the follow-up of RA.
There are no standard methods to validate automated measurements of JSW in plain hand radiographs. Several studies have compared JSW in mm with the current gold standard of joint damage in RA, the SHS. In general, these have produced adequate results, often indicating a somewhat improved sensitivity to change with computer-assisted methods. However, the number of failed measurements and/or the need for human intervention during the “automated” measurements is seldom reported in detail. Dissimilar results in individual joints may not be due to failing software, but can be caused by misinterpretation in the SHS, or variation in image acquisition, such as repositioning of the hand in consecutive images or variations in exposure.
The first step in validation is the success rate of a system to automatically locate joints and joint margins in a large number of images. In the two datasets from different sources in this study, the first step of automated location of the joints was more successful (99% versus 87%) in the DREAM images from a single hospital, than in the multicenter data from the COBRA trial. This underlines the importance of a standard protocol to obtain serial images. We do not believe that the process of digitization of the COBRA images has influenced these results, since the pixel size we used was equal to the DREAM images. Also, we have selected COBRA images that were measured with success in an earlier study22,23 using an operator mediated system, which biased toward success in these images.
In the second step of automated image analysis, individual joints are measured, which again was slightly more successful (95% versus 92%) in the DREAM images. With operator intervention, the overall performance increased to almost 99%, suggesting that this system provides a rapid tool to measure large numbers of digital radiographs. However, joints that could not be measured because of severe damage need special attention, and future software must be able to assign a correct measure to these in order to gain wider acceptance.
The third step addresses the precision of the measurement result in millimeters. The best validation would be to compare with micro-CT measurements of the 3-D volume of the joint space, but in this study, we have no CT data. The average JSW we found was in line with the results at group level of reported measurements in early RA patients and this provides support for the quality and comparability of the software.
Progression of RA is exclusively measured in serial images over time. Scoring of progression means quantitation of changes between these images. As a fourth step of validation of the automated system, we used pairs of images with little or no change over time as a gold standard. The accuracy of our measurements was good at group level in both datasets, with a mean and standard error of in repeated images that are unchanged in the eyes of human observers. This promises sufficient sensitivity to measure changes below the annual joint space decrease of 0.3 mm. As has been pointed out, we may even need higher sensitivity given the improved outcome with current treatment strategies in early RA. In recent years, structural damage has become less common, and therefore, the performance of automated measurements in joints with minimal damage is of prime importance. However, a potential system to measure decreases in JSW will also need tests in images with more advanced damage than we have studied.
In order to improve automated measurements, we must analyze the instances in which the system fails. To this end, we explored in depth the pairs of images that showed discrepancy between the automated measurement and SHS. It turned out that technical factors during the process of image acquisition occurred more frequently than failure of the automated analysis. Differences in positioning and exposure settings caused 9% of the joints to be excluded. Moreover, these influences were the dominant explanation for discrepancy between and automated JSW measurements (repositioning 57% and exposure 15%). Repositioning, as defined by Neumann et al., includes all possible movements of the hand (e.g., flexion, abduction, and adduction), but also radiographic technique, equipment, beam geometry and operator variability. It is considered as a cause or error.9 Over- or underexposure, which is only detectable after acquisition, was not mentioned by Neumann. All of these factors may increase or decrease the change in JSW in serial images. This is confirmed by our analysis and stresses that standardizing procedures will potentially improve the smallest detectable differences of JSW measurement systems. The importance of quantifying minimal changes or structural integrity was discussed by Landewe and Pfeil.24,25
The strength of this validation study is that it is based on a large series of images obtained in different conditions. Furthermore, the use of unchanged joints provides a good reference to validate the accuracy of serial measurements and to explore sources of error. This strength is also a limitation, since we included only a small number of images with serious joint damage. Deformed joints with asymmetric JSN or ankyloses are often hard to measure with fully automated systems. Obviously, it will be important to test systems in groups of patients with longer disease duration. In most studies of automated joint measurements, the lack of measurements on the wrists is another limitation. These are difficult to measure with computerized systems due to the variation in wrist anatomy, repositioning, and the overlap of carpal bones on 2-D images. We may need to explore other approaches than the various efforts to copy the SHS measures of wrist damage. This also raises the question “which joints can reliably and consistently provide information when the goal is to assess long-term damage in arthritis?”26 Further research and development of automated assessment of joint damage help to solve these issues.
In conclusion, this study confirms that an automated computerized JSW measurement method for MCP and PIP joints can produce measurements that are in line with the labor intensive . It is obvious that such systems may provide an alternative to human observations and therefore, save time and costs. Standardization of image acquisition is an important and possibly underestimated factor that may help to improve the performance of automated systems. These considerations may support further development of a fully functioning program for quantification of disease progress of RA. This should incorporate not only JSW measurements but also erosion measurements for hands and feet. Accurate measurement of JSW will also be of importance in the follow-up of OA, in particular, in trials of potential interventions for this condition. The proposed method should also be further validated for normal and RA joints in order to establish reference values. In the future, such automated programs may not only be relevant for clinical studies but also be beneficial for assessment of joint damage in clinical practice. This can happen only after careful validation and comparison of these programs.
We thank Maarten Boers for providing the COBRA-radiographs and the associated SHS-scores. This research was financially supported by the Stichting ReumaOnderzoek Twente (Foundation of Research in Rheumatology Twente) and the Center for Translational Molecular Medicine (CTMM). The authors state no conflict of interest and have nothing to disclose.
Olga Schenk finished her bachelor’s and master’s degree in biomedical engineering at University of Twente in 2013. After that, she started her PhD within the MIRA Research Institute for Biomedical Engineering and Technical Medicine, University of Twente. Her interests are in medical imaging, image analysis, and computer vision. Her current project focuses on improving and evaluating automatic measurement methods of disease progression in rheumatoid arthritis.
Yinghe Huo studied computer engineering, specializing in computer vision, at the Leiden University and graduated in 2010. From 2011, he started as a PhD candidate at the Image Sciences Institute, University Medical Center Utrecht, focusing on automated measurement of joint space width in early rheumatoid arthritis hand radiographs. Since 2016, he has been working at the same faculty, focusing on automatic segmentation of breast cancer and quantification of its response to chemotherapy in contrast-enhanced MRI images.
Koen L. Vincken graduated in informatics (computer science) at the Delft University of Technology in 1989. In 1995, he received his PhD from Utrecht University. He is currently working as an associate professor at the Image Sciences Institute of the Utrecht Medical Center, Utrecht. He developed several applications that are currently being used by other departments, such as radiology, anatomy, orthopedics and rheumatology. His main interests are medical image applications and quantitative image analysis.
Mart A. van de Laar studied medicine at the University of Amsterdam. After training as an internist, he specialized in rheumatology. His expertise lies in the epidemiology of chronic disease especially rheumatic diseases through the use of ICT support; analysis of the perception of health and the development of computer-adaptive testing by making use of modern testing theory.
Ina H. H. Kuper studied medicine at the University of Groningen. She was trained as a rheumatologist and has been practicing rheumatology since 1997 in several hospitals. Her expertise lies in the epidemiology of chronic rheumatic diseases, in particular, rheumatoid arthritis.
Kees C. H. Slump received his MSc degree in electrical engineering from Delft University of Technology, the Netherlands, in 1979. In 1984, he obtained his PhD in physics from the University of Groningen, the Netherlands. From 1983 to 1989, he was employed at Philips Medical Systems in Best, the Netherlands, as head of a predevelopment group on x-ray image quality and cardiovascular image processing. In 1989, he joined the Department of Electrical Engineering, University of Twente. In June 1999, he was appointed as a full professor in signal processing. His research interest is in image analysis as part of medical imaging.
Floris P. J. G. Lafeber is a professor of experimental rheumatology at University Medical Centre Utrecht. He is a manager of research and head of the research laboratory of the Department of Rheumatology & Clinical Immunology, UMCU. Currently, his main interest is in novel treatments; diagnostic tools for automated analyses of radiographic joint damage and inflammation in OA, HA, and RA, as well as unique animal in vivo models of osteoarthritis, which have all been developed over the past years.
Hein J. Bernelot Moens is a practicing rheumatologist in large teaching hospitals since 1986 and has been involved in various research projects involving medical imaging of arthritis and its consequences. His research is focused on automated analysis of radiological damage and optical imaging of synovitis.