XClose

UCL Centre for Digital Humanities

Home
Menu

[ONLINE] Enhancing OCR with AI for Historical Documents

23 April 2024, 3:30 pm–4:45 pm

sloane lab 2024 series

Sloane Lab and HDSM Darmstadt are pleased to welcome Dr Wolfgang Göderle, Associate Researcher at the Max Planck Institute of Geoanthropology (Jena) and David Fleischhacker, Technical University of Graz.

This event is free.

Event Information

Open to

All

Availability

Yes

Cost

Free

Organiser

Marco Humbel, UCLDH Associate Director (ECR)

Enhancing OCR with AI for Historical Documents: A Breakthrough in the Research on Habsburg State-Manuals

State manuals are a quite common source in European modern history, and contain valuable information to tackle questions of social and economic history, history of knowledge, science, and technology, as well as administrative history. For Habsburg Central Europe, 155 volumes of the so-called Hof- und Staatsschematismus document the evolution of the social elites of the Habsburg Monarchy from 1702 to 1918. These books gather information on members of the court and administrative elites and allow to reconstruct the careers of c. 150.000 bureaucrats in the imperial service.

However, the nested and complex structure of these documents has so far effectively prevented any large-scale extraction of the information contained in the Schematismus. In our work, we developed an extraction pipeline that allows to effectively process the information stored in these documents at scale. In a first step, we identified and trained a suitable CNN (YOLOv9) to perform layout detection. The model identifies the smaller text modules that make up the pages and forwards them to a finetuned state-of-the-art OCR engine (currently Kraken OCR). In a final postprocessing step, we perform entity recognition and re-assemble the extracted information in the correct order, to prepare the data for exportation into a graph database.

Register for the Zoom event and view the full seminar series programme: https://critical-creative.eventbrite.co.uk


The Sloane Lab Seminar Series is convened by Marco Humbel (Sloane Lab & UCLDH), Nadezhda Povroznik (TU Darmstadt), Julianne Nyhan (TU Darmstadt & UCL) and Andrew Flinn (UCL). Administrative support is provided by Lucy Stagg (UCLDH & UCL IAS).

This joint virtual seminar is co-hosted by University College London, TU Darmstadt, the British Museum and the Natural History Museum.

The symposium is funded by the Towards a National Collection programme (Arts and Humanities Research Council) as an activity of the Sloane Lab Discovery Project.

About the Speakers

Dr Wolfgang Göderle

Dr Wolfgang Göderle is a Digital Historian with a background in electrical engineering. His research interests lie within the field of Central European Social and Environmental History and History of Knowledge in the long 19th Century. He is working as a Postdoc at the University of Innsbruck and an Associate Researcher at the Max Planck Institute of Geoanthropology (Jena).

David Fleischhacker

David Fleischhacker is a computer science master’s student at the Technical University of Graz. His interests lie in the field of machine learning driven computer vision, in particular the field of object detection and pattern recognition within images.