Community Knowledge Spaces: Transforming UCL Special Collections Utilising Wikidata
This collaborative project brings together the UCL Department of Information Studies, UCL Library, Culture, Collections and Open Science (LCCOS), UCL School of Management and Wikimedia UK.
The project explores a selection of digitised artefacts: issues of Pi (1946–1954) from the UCL College Periodicals special collection, to support research, teaching, and public engagement.
Combining computational methods with the Wikidata knowledge base, the project investigates the processing of scanned historical documents through Optical Character Recognition (OCR), article segmentation and extraction, Named-Entity Recognition (NER), catalogue creation, and data reconciliation with Wikidata. As a student-run periodical, Pi offers unique insights into student life at UCL and in London during the 1940s and 1950s. By transforming these digitised issues into structured, open data, the project enhances accessibility and discoverability, fittingly contributing to UCL’s Bicentenary celebrations by opening new pathways for engagement with the university’s history and heritage.
The project culminates in a workshop on 4 November 2025, showcasing the outcomes and sharing lessons learned. The session will include hands-on training in developing and applying similar computational pipelines for processing historical documents. The workshop will be open to students, staff, information professionals, community participants, and Wikidata/Wikipedia contributors interested in digital heritage and open knowledge practices.
Book your place / find out more
Background
UCL Special Collections is one of the largest repositories of its kind in south-east England, responsible for managing over 10 kilometres of material, including more than 500 archive collections and approximately 150,000 rare books. The College Periodicals form part of this vast collection, encompassing published works and ephemera that reflect nearly two centuries of UCL’s history.
Within this project, we focused on Pi, examining the digitised issues published between 1946 and 1954. Among UCL’s digitised periodicals, Pi represents the largest and most complete set, comprising 88 issues and 379 pages of high-resolution scans. Our goal was to enhance access and visibility for these publications by linking and enriching their data through Wikidata.
Wikidata is an open, collaborative knowledge base that serves as the central repository of structured data for Wikimedia’s sister projects, including Wikipedia. As a freely licensed and community-maintained platform, Wikidata enables institutions to integrate their local collections into the global web of open data, supporting discovery, research, and reuse. Through this project, we created and published structured catalogue records for Pi, making the periodicals more searchable, interconnected, and accessible to a wider audience.
Project Plan
Throughout this project, we aimed to develop a method for datafying UCL’s College Periodicals, transforming them into structured, machine-readable data to enhance searchability and research potential. This approach was tested using the Pi sub-collection. The digitised issues of Pi, provided by Library, Culture, Collections and Open Science (LCCOS) prior to the project, served as the foundation for our work.
The project pipeline is illustrated in the diagram below, alongside key insights and lessons learned throughout the process.
Project pipeline for processing of digitised Pi Periodicals
After acquiring high-resolution digitised issues of Pi from UCL Special Collections in JPG and PDF formats, we experimented with both pre-OCRed PDF documents and additional Optical Character Recognition (OCR) tools to further enhance text extraction accuracy. Several OCR platforms were tested, including Azure Document Intelligence, Amazon Textract, and Transkribus. However, these tools encountered difficulties in processing the complex, multi-column layouts typical of historical periodicals, often resulting in fragmented or disjointed text blocks. We also looked at past research projects such as NewsEye, LayoutParser, and Newspaper Navigator.
OCR and Article Extraction
We decided to experiment with emerging trends of using generative AI tools for for OCR, layout detections and article extraction, as well as some of the pre-trained models:
- OpenAI API: Generative AI tools can, with some level of accuracy, be used to recognise both text and newspaper layouts, which means they can extract individual articles from newspapers with complex layout with some level of accuracy. The challenge is that these tools prone to“hallucinate,” changing the original content. We tested the OpenAI API using GPT version o4-mini-2025-04-16 with carefully designed prompts to reduce these issues to an acceptable level.
- Local LLM from UCL ARC (Advanced Research Computing Centre): Looking for more reliable LLM tools, we worked with UCL Advanced Research Computing Centre (ARC), who kindly supported our project by providing access to a locally installed LLM model version gemma-3-27b-it. The ARC’s local LLM showed encouraging results, with fewer halucinations, but we did not have the needed computational resource to process all the pages in the repository.
- Pretrained models, e.g. NewsEye: There are existing research tools for historical newspaper OCR, such as the NewsEye project, which developed text recognition and article separation methods for European digital heritage collections. However, adopting the models trainined as a result of these projects as out-of-the-box solutions has proven to be challenging, and was abandoned early in the project.
Tesseract and other libraries: The solutions that adopted open source libraries such as Tesseract were also experimented with. Tesseract appeared to be performing farily well given the high-resolution nature of the images, but the complexity of the historical newspaper-style layout of the pages posed the challenges for extracting individual articles.
Processing Individual Articles (NER)
We applied Named-Entity Recognition (NER) to the extracted articles to identify and classify key entities within the text. NER is a computational technique that automatically detects and labels references to people, organisations, places, and dates. In this project, we focused primarily on people and places mentioned in Pi articles, as these entities offer valuable insights into UCL’s historical community and broader cultural context. The resulting data provides a foundation for further digital humanities research and opportunities to link UCL’s heritage materials with external datasets through platforms like Wikidata.
For this task, we primarily explored two tools: spaCy and the Newswire models. spaCy is a widely used open-source library that employs machine learning techniques. We used it to process all extracted articles and identify names of people and places. Additionally, we drew inspiration from the Newswire project and tested its pulished models for topic classification. Ultimately, we produced both NER and topic classification outputs in JSONL format and published the resulting dataset for future research and reuse.
Publishing Data on Wikidata
We published two complementary datasets to Wikidata as part of this project. The first consisted of the Pi sub-collection catalogue, which connects UCL’s Special Collections records to the wider Web of open cultural data. Each record was enhanced with persistent identifiers and URLs linking directly to the digitised PDF copies held by LCCOS. This made the collection more discoverable and interoperable across Wikidata projects and beyond.
The second dataset focused on people and places mentioned within the extracted Pi articles. Drawing on the JSONL dataset generated through our Named-Entity Recognition (NER) and topic classification pipelines, we transformed the data into structured tables for import into OpenRefine: a tool widely used for data cleaning, reconciliation, and Wikidata publication. We designed a bespoke Wikidata schema aligned with community standards (for example, linking periodical issues via properties such as “instance of” (P31), “publication date” (P577) and “described by source” (P1343)). This enables searching the dataset by types of named entities (“person” (Q215627) ) of by the properties they are inter-related with, as depicted in the diagram that shows the issues that mention Jemermy Bentham in the Pi issues.
OpenRefine’s reconciliation feature can automatically match named entities to existing Wikidata items. However, without sufficient contextual information, such as a person’s year of birth or a place’s geographic hierarchy, the software often returns a list of possible, and sometimes incorrect, matches. One of the main challenges is that the first suggestion is typically the most popular or widely cited entity on Wikidata, rather than the correct one. Since there is currently no perfect model for automatically disambiguating individuals on Wikidata, we retrieved the birth dates of potential matches and cross-checked these against the publication dates of Pi issues to ensure greater accuracy. This additional step improved the reliability of our data but also reduced the number of entities that could be processed automatically.
In addition to automated reconciliation, substantial manual verification and disambiguation were required. These manual checks indicated a reasonable level of accuracy in our published dataset, though many of the individuals identified through NER could not yet be referenced on Wikidata. Overall, we successfully created 88 Wikidata entries corresponding to the digitised Pi issues and linked 1,662 references to 231 distinct individuals. Future work will aim to expand these connections by incorporating additional people, as well as locations, institutions, and thematic topics mentioned across the periodicals.
Outputs
Those interest in the main outputs of the project, including the source code for extracting articles theor automated processing can check the following links. Additional information can also be found on Wikidata Project Links and Wikimedia pages.
- Wikidata Project Page: https://www.wikidata.org/wiki/Wikidata:UCL_periodicals, with SPARQL queries to search for the wikidata entries we published.
- GitHub Repository for OCR: https://github.com/kstepanyan/Newspaper-OCR-LLM/blob/main/README.md
- GitHub Repository for Semantic Processing: https://github.com/kstepanyan/Newspaper-Semantic-Enrichment/blob/main/README.md
UCL Data Repository for the Pi Dataset (JSONL): https://doi.org/10.5522/04/29973145.v1
Lessons Learned
While the project’s outcomes remain exploratory in nature, they represent an important step towards developing data-driven methods for processing and enriching historical periodical collections.
- Layout analysis and article segmentation for historical periodicals require further work.
During the OCR stage, we found that the complex, irregular layouts of historical publications such as Pi, present significant challenges for many existing tools. Commercial OCR systems are capable of accurate text recognition of high resolution images but often fail to capture article-level structure and reading order. Conversely, tools developed within digital humanities or archival research projects tend to perform well in specific contexts but are difficult to reuse across collections or fully automatically. There is still a gap for developing solitions that can accommodate wider range of historical document formats. - High-quality automated processing depends on domain-specific models.
In this project, we employed general-purpose NER tools trained on contemporary datasets and topic modelling tools based on historical newspaper wires. As expected, the accuracy of NER was limited when applied to mid-20th-century material. Similarly, the topics covered in student periodicals may not have been fully aligned with models based on newspaper corpora from the US. Many successful digital history initiatives have demonstrated improved results through the fine-tuning of models on collection-specific data; a strategy that appears increasingly essential. Looking ahead, there is clear value in developing more robust and transparent pre-trained models with stronger performance on historical and cultural materials, to support future research and digitisation efforts of this kind.
Acknowledgements
We would also like to express our gratitude to Dr David Pérez-Suárez, UCL Advanced Research Computing Centre, and his colleagues Andrew Esterson and Ed Lowther for providing an opportunity to use the locally administered LLM and technical support. We would also like to thank colleagues at LCCOS including Colin Penman, Amy Howe, Steven Wright for facilitating access to the special collections and support through their subject expertise. Lastly we would like to thank the colleagues administering the UCL’s Knowledge Exchange and Innovation Funding Scheme for the early-stage projectidea funding.
Project partners contributing to the published results includes from the following departmentas and organisations: colleagues and Wikimedia UK, comprising the following members:
- UCL, Department of Information Studies: Karen Stepanyan, Jiayu Li, Gabriel Chow;
- UCL LCCOS: Jo Baines
- UCL School of Management: Maya Cara, and
- Wikimedia UK: Richard Nevel and Stuart Prior