UCL Centre for Digital Humanities


What might a corpus of spoken data tell us about language?

15 November 2017, 5:30 pm–7:30 pm

UCLDH seminar logo, purple

Event Information

Open to





UCL Centre for Digital Humanities
Gower Street
United Kingdom

The last two decades have seen corpus resources grow in scale, breadth, representativeness and linguistic variety. Thus an online survey of primarily English language corpora documents around 100 publicly-available corpora. Corpora are sources of three types of evidence: discovery (evidence of unknown phenomena), distribution (relative proportions of phenomena), and interaction (how distributions may correlate or interact).

However it would be fair to comment that the primary result of this data acquisition has been the collection of large volumes of written text, mostly from sources, like student essays, newspaper archives or the World Wide Web, which were easy to acquire. By contrast, collecting and transcribing spoken data is laborious and costly.

The focus on the written word appears to be at odds with the scientific goal of universal linguistic description. Speech predates writing, in almost every aspect: the historical development of language, literacy spread, and individual child development. According to many theoretical accounts, internal speech is a precursor of writing. An hour of English speech in DCPSE consists of around 8,000 words: more than most authors could write in one day. Speech is output linearly, most writing permits post-editing. It is difficult to see how corpora that exclude speech data can be said to be linguistically representative, and theoretical claims based on the written word must be empirically suspect.

So what kinds of spoken data should we prioritise in our collections? What can we learn from existing corpora? Which research questions are possible with existing resources, and how might our future efforts be focused?

In this talk I will briefly examine some of the kinds of research questions we might be able to address with a particular state-of-art corpus of spoken English, the Diachronic Corpus of Present-day Spoken English (DCPSE). This is a fully-parsed corpus of spoken British English collected over the decades and in a number of different formal and informal settings. We will look at the types of research questions we can answer at present and what we might be able to answer were new data collected or new layers of annotation added.

The slides from the talk are available to view below, and the full paper that the talk is based on is available on the speaker's blog


Sean Wallis is a Principal Research Fellow in the Survey of English Usage, the first Corpus Linguistics research centre established in Europe (and the second in the world), based in the Department of English Language and Literature at UCL.  His background is in artificial intelligence (AI). He is currently working on the Teaching English Grammar in Schools research project, developing the platform.He also teaches corpus linguistics, research methodology and statistics on the UCL MA in English Linguistics.