Translator Disclaimer
Paper
1 June 2020 Extraction of distinctive keywords and articles from untranscribed historical newspaper images
Author Affiliations +
Proceedings Volume 11515, International Workshop on Advanced Imaging Technology (IWAIT) 2020; 115151K (2020) https://doi.org/10.1117/12.2566612
Event: International Workshop on Advanced Imaging Technologies 2020 (IWAIT 2020), 2020, Yogyakarta, Indonesia
Abstract
This paper proposes a novel approach to extract distinctive keywords from historical newspaper images without using character recognition. We converted an image of the text block on an entire newspaper page into a sequence of codes based on discretization of the feature vectors, an approach that eliminated the errors introduced by optical character recognition (OCR). This conversion makes it possible to analyze untranscribed newspaper images by using text-processing methods. We examined the daily occurrence of every tri-gram string, and extracted strings with a dense appearance as distinctive keywords. In addition, we highlighted articles that contain distinctive keywords as distinctive articles. The proposed method was evaluated on an archive of Japanese newspaper images published in the 19th century, and the results were promising.
© (2020) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Sora Ito and Kengo Terasawa "Extraction of distinctive keywords and articles from untranscribed historical newspaper images", Proc. SPIE 11515, International Workshop on Advanced Imaging Technology (IWAIT) 2020, 115151K (1 June 2020); https://doi.org/10.1117/12.2566612
PROCEEDINGS
6 PAGES


SHARE
Advertisement
Advertisement
Back to Top