Cleaning and harmonizing medical image data for reliable AI: Lessons learned from longitudinal oral cancer natural history study data

Zhiyun Xue; Tochi Oguguo; Kelly J. Yu; Tseng-Cheng Chen; Chun-Hung Hua; Chung Jan Kang; Chih-Yen Chien; Ming-Hsui Tsai; Cheng-Ping Wang; Anil K. Chaturvedi; Sameer Antani

doi:10.1117/12.3005875

2 April 2024 Cleaning and harmonizing medical image data for reliable AI: Lessons learned from longitudinal oral cancer natural history study data

Zhiyun Xue, Tochi Oguguo, Kelly J. Yu, Tseng-Cheng Chen, Chun-Hung Hua, Chung Jan Kang, Chih-Yen Chien, Ming-Hsui Tsai, Cheng-Ping Wang, Anil K. Chaturvedi, Sameer Antani

Author Affiliations +

Proceedings Volume 12931, Medical Imaging 2024: Imaging Informatics for Healthcare, Research, and Applications; 129310E (2024) https://doi.org/10.1117/12.3005875
Event: SPIE Medical Imaging, 2024, San Diego, California, United States

Abstract

For deep learning-based machine learning, not only are large and sufficiently diverse data crucial but their good qualities are equally important. However, in real-world applications, it is very common that raw source data may contain incorrect, noisy, inconsistent, improperly formatted and sometimes missing elements, particularly, when the datasets are large and sourced from many sites. In this paper, we present our work towards preparing and making image data ready for the development of AI-driven approaches for studying various aspects of the natural history of oral cancer. Specifically, we focus on two aspects: 1) cleaning the image data; and 2) extracting the annotation information. Data cleaning includes removing duplicates, identifying missing data, correcting errors, standardizing data sets, and removing personal sensitive information, toward combining data sourced from different study sites. These steps are often collectively referred to as data harmonization. Annotation information extraction includes identifying crucial or valuable texts that are manually entered by clinical providers related to the image paths/names and standardizing of the texts of labels. Both are important for the successful deep learning algorithm development and data analyses. Specifically, we provide details on the data under consideration, describe the challenges and issues we observed that motivated our work, present specific approaches and methods that we used to clean and standardize the image data and extract labelling information. Further, we discuss the ways to increase efficiency of the process and the lessons learned. Research ideas on automating the process with ML-driven techniques are also presented and discussed. Our intent in reporting and discussing such work in detail is to help provide insights in automating or, minimally, increasing the efficiency of these critical yet often under-reported processes.

Conference Presentation

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Zhiyun Xue, Tochi Oguguo, Kelly J. Yu, Tseng-Cheng Chen, Chun-Hung Hua, Chung Jan Kang, Chih-Yen Chien, Ming-Hsui Tsai, Cheng-Ping Wang, Anil K. Chaturvedi, and Sameer Antani "Cleaning and harmonizing medical image data for reliable AI: Lessons learned from longitudinal oral cancer natural history study data", Proc. SPIE 12931, Medical Imaging 2024: Imaging Informatics for Healthcare, Research, and Applications, 129310E (2 April 2024); https://doi.org/10.1117/12.3005875

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available