Partners in time: HPC opens new horizons for Humanities research
24 October 2017
From new discoveries about the hidden history of London to fresh insights into gender bias down the centuries, Research IT Services (RITS) is showing how leading-edge computing can add a powerful extra dimension to the Arts & Humanities research toolbox.
Working closely with the UCL Digital Humanities team, RITS is deploying its infrastructure and expertise to enable researchers beyond those in science and engineering to harness high-performance computing (HPC) and generate fascinating research outcomes.
Demonstrating HPC's ability to marshal, manipulate and mine huge digitised archives, RITS is showing how the right tools can accelerate research, make it easy to ask interesting questions and evaluate the answers effectively - and really let the data speak.
Big Data, Big Questions
In this game-changing digital age, HPC represents one of the most significant developments to date: Advanced platforms like UCL's Legion cluster can handle huge volumes of data with incredible speed and are routinely exploited to outstanding effect throughout science & engineering.
"For us, the key question was whether we could adapt our HPC architectures to the needs of Arts & Humanities researchers too," says James Hetherington, Head of the RITS Research Software Development Group. "For instance, how quickly could this infrastructure locate data and run queries when faced with stores of material of unaccustomed types?"
"Scientific datasets tend to be massive, structured and composed of very large files," explains David Beavan of the UCL Centre for Digital Humanities. "Digital datasets in the Humanities space are generally smaller, much more unstructured and composed of very small files."
Over the last 2 years, RITS has successfully tackled the challenge. Through its involvement in a series of pioneering projects, it has built a firm foundation designed to ensure that Arts & Humanities can make the most of UCL's powerful HPC resources and unlock the potential of 'big data'. By providing expert insight, technical skills and broader support, RITS has shown how digital datasets can be exploited in ways that far exceed the capabilities of regular PCs or conventional data interrogation methods.
Productive Projects, Practical Progress
In the British Library's Good Books
In 2015, at the invitation of digital solutions provider Jisc, RITS joined forces with UCL Digital Humanities on an important pilot project: to copy the British Library's digitised versions of 68,000 books dating from between 1789 and 1914 - around 30 trillion bytes of material - and explore how best to extract useful information using advanced computational queries.
"We proved that we could, with confidence, investigate complex questions, beyond simple word searches, in a large text-based archive", says James Hetherington, who wrote the code needed to run the searches. "This was an important step in showing that architectures designed for scientific research could be harnessed by the Humanities too. We converted the data to a more efficient file format, placed it on our archival storage system and worked closely with historians to identify how search queries should be phrased to meet Humanities researchers' needs."
The British Library project also provided the basis for work on the vast Times Digital Archive (TDA), consisting of every article published in the newspaper between 1785 and 2009. Using Legion, Raquel Alegre and Roma Klapaukh from the Research Software Development Group have helped a range of researchers to tap into this remarkable resource and generate all kinds of historical, social and cultural insights - from investigating the level of attention given to overseas events to sentiment analysis looking at different ways that texts refer to men and women.
"There's no end to the uses that the TDA could be put to," says Raquel Alegre, Senior Research Software Developer at RITS. "A lot of our work has focused on how to work around errors made by Optical Character Readers (OCRs). It's also crucial to write code that enables rapid and sophisticated querying of data across multiple files - for example, to pinpoint the frequency of use of certain words or identify how often certain words are used together. A key strand of our work involves using tools like word clouds to make results accessible and comprehensible."
Shining a Light on Whitechapel
As part of the Survey of London, an investigation of the city's built environment led by UCL's Bartlett School of Architecture, the Whitechapel Initiative is building a unique picture of one of London's most iconic (and, at one time, most notorious) districts. Encouraging experts and amateurs to provide material and memories, the aim is to generate a resource spanning newspapers, photographs, audio files and much more.
"This ground-breaking project is giving us huge scope to test and extend our capabilities, and to add another dimension to the initiative itself," Raquel Alegre comments. "We've already been able to discover hundreds more articles relevant to Whitechapel in the TDA and to address issues such as variant spellings of place names. We found the additional articles because of the extra speed, flexibility and reliability of the tools we used. We want to help deliver an unprecedentedly advanced online search tool enabling users to access, enjoy and learn from data content in fantastic new ways."
Putting the Art into Partnership
These examples not only show how HPC can contribute to Arts & Humanities - they underline the value of bringing very different skillsets together, with RITS expertise complementing Arts & Humanities expertise and the Digital Humanities team further underpinning the bridge between them.
For David Beavan, the advantage of working with RITS is that they're so much more than simply code writers or an IT helpline provider. "They explain everything in terms that non-IT specialists can understand and really immerse themselves in whatever research environment they're asked to work in. This enables them to make suggestions of immense value in driving the research process. It's a creative, iterative approach that's highlighting HPC's huge potential contribution to Arts & Humanities research - here at UCL, across the UK and all over the globe."
"Using the facilities available to us at UCL RITS is key to being able to undertake this type of interdisciplinary research," concludes Melissa Terras, Honorary Professor at the UCL Centre for Digital Humanities*. "As textual databases continue to grow exponentially, we need to find effective ways of looking through them, and analysing, understanding and visualising their contents. These pilot projects have been essential in enabling the large scale digital infrastructure that has been traditionally used by the sciences to be fine-tuned for use to understand culture, heritage, and human behaviour. By working together closely with RITS, we can unlock the growing repositories of digitised historical texts, including books and newspapers, to be able to ask questions at scale that were previously unimaginable".
* Melissa Terras is now Professor of Digital Cultural Heritage at the University of Edinburgh's College of Arts, Humanities, and Social Sciences.