UCL Centre for Digital Humanities


The Hillary Clinton emails: corpus linguistics meets the real world

28 March 2017, 5:30 pm–6:30 pm

UCLDH seminar series

Event Information

Open to



G31, ground floor, Foster Court, UCL

The release into the public domain of Hillary Clinton's emails was an exceptional event of interest to many communities, not least linguistics researchers. This talk presents a project currently underway to turn them into an orderly, easily searchable, publicly available corpus, and some results from the first linguistic study to be carried out on this data.

This project is designed to facilitate access to this data by linguists and researchers in other fields. To achieve this, there are many technical challenges to overcome. The data has been released as unordered PDF files, featuring many redactions, making data extraction difficult. Further, the medium of email itself presents challenges such as threading (with duplicate text), boilerplate text, attachments, multiple classes of recipients, and metadata of uncertain value. Our project has made a start on dealing with these issues, and the current talk focuses on a 500-email subcorpus prepared for preliminary research.

We describe the compilation of the corpus and discuss the more interesting aspects of the process, such as how to deal with redactions and omissions, which technical aspects of the messages to preserve, and how to determine and record the relationships between participants. We also present some early results on the patterns of communication we can identify in the data, considering issues such as formality and directness in requests and other interactions. 

This seminar is based on work-in-progress, with the aim to scope out opportunities for collaboration. It will be followed by drinks and discussion. Please register to attend.


Rachele De Felice is a Senior Teaching Fellow in the Department of English Language and Literature at UCL. Her current research is in the field of corpus pragmatics. It focuses on the creation of pragmatic profiles of Business English by applying corpus analysis and natural language processing (NLP) techniques to large collections of real-world data.

Gregory Garretson is a senior lecturer in the Department of English at Uppsala University. As a linguist, he is involved in a variety of studies of semantics, pragmatics and discourse and specializes in corpus methods.