8 February 2015 Software workflow for the automatic tagging of medieval manuscript images (SWATI)
Author Affiliations +
Digital methods, tools and algorithms are gaining in importance for the analysis of digitized manuscript collections in the arts and humanities. One example is the BMBF-funded research project “eCodicology” which aims to design, evaluate and optimize algorithms for the automatic identification of macro- and micro-structural layout features of medieval manuscripts. The main goal of this research project is to provide better insights into high-dimensional datasets of medieval manuscripts for humanities scholars. The heterogeneous nature and size of the humanities data and the need to create a database of automatically extracted reproducible features for better statistical and visual analysis are the main challenges in designing a workflow for the arts and humanities. This paper presents a concept of a workflow for the automatic tagging of medieval manuscripts. As a starting point, the workflow uses medieval manuscripts digitized within the scope of the project Virtual Scriptorium St. Matthias". Firstly, these digitized manuscripts are ingested into a data repository. Secondly, specific algorithms are adapted or designed for the identification of macro- and micro-structural layout elements like page size, writing space, number of lines etc. And lastly, a statistical analysis and scientific evaluation of the manuscripts groups are performed. The workflow is designed generically to process large amounts of data automatically with any desired algorithm for feature extraction. As a result, a database of objectified and reproducible features is created which helps to analyze and visualize hidden relationships of around 170,000 pages. The workflow shows the potential of automatic image analysis by enabling the processing of a single page in less than a minute. Furthermore, the accuracy tests of the workflow on a small set of manuscripts with respect to features like page size and text areas show that automatic and manual analysis are comparable. The usage of a computer cluster will allow the highly performant processing of large amounts of data. The software framework itself will be integrated as a service into the DARIAH infrastructure to make it adaptable for wider range of communities.
© (2015) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Swati Chandna, Swati Chandna, Danah Tonne, Danah Tonne, Thomas Jejkal, Thomas Jejkal, Rainer Stotzka, Rainer Stotzka, Celia Krause, Celia Krause, Philipp Vanscheidt, Philipp Vanscheidt, Hannah Busch, Hannah Busch, Ajinkya Prabhune, Ajinkya Prabhune, "Software workflow for the automatic tagging of medieval manuscript images (SWATI)", Proc. SPIE 9402, Document Recognition and Retrieval XXII, 940206 (8 February 2015); doi: 10.1117/12.2076124; https://doi.org/10.1117/12.2076124

Back to Top