Digital methods, tools and algorithms are gaining in importance for the analysis of digitized manuscript collections
in the arts and humanities. One example is the BMBF-funded research project “eCodicology” which
aims to design, evaluate and optimize algorithms for the automatic identification of macro- and micro-structural
layout features of medieval manuscripts. The main goal of this research project is to provide better insights into
high-dimensional datasets of medieval manuscripts for humanities scholars. The heterogeneous nature and size
of the humanities data and the need to create a database of automatically extracted reproducible features for
better statistical and visual analysis are the main challenges in designing a workflow for the arts and humanities.
This paper presents a concept of a workflow for the automatic tagging of medieval manuscripts. As a starting
point, the workflow uses medieval manuscripts digitized within the scope of the project Virtual Scriptorium St.
Matthias". Firstly, these digitized manuscripts are ingested into a data repository. Secondly, specific algorithms
are adapted or designed for the identification of macro- and micro-structural layout elements like page size,
writing space, number of lines etc. And lastly, a statistical analysis and scientific evaluation of the manuscripts
groups are performed. The workflow is designed generically to process large amounts of data automatically with
any desired algorithm for feature extraction. As a result, a database of objectified and reproducible features is
created which helps to analyze and visualize hidden relationships of around 170,000 pages. The workflow shows
the potential of automatic image analysis by enabling the processing of a single page in less than a minute.
Furthermore, the accuracy tests of the workflow on a small set of manuscripts with respect to features like page
size and text areas show that automatic and manual analysis are comparable. The usage of a computer cluster
will allow the highly performant processing of large amounts of data. The software framework itself will be
integrated as a service into the DARIAH infrastructure to make it adaptable for wider range of communities.