XClose

UCL English

Home
Menu

DCPSE

DCPSE is the Diachronic Corpus of Present-day Spoken English. Published by the Survey of English Usage, it contains over 800,000 words of fully-parsed spoken English from 1957-1993.

The Diachronic Corpus of Present Day Spoken English

23 September 2006

Introduction

DCPSE is a parsed corpus available from the Survey of English Usage designed for investigating recent change in the grammar of spoken British English.

It contains more than 400,000 words each from ICE-GB (collected in the early 1990s) and the London-Lund Corpus (late 1960s-early 1980s).

The orthographic transcriptions have been normalised and annotated according to the same criteria. ICE-GB was used as a gold standard for the parsing of DCPSE. The parsing has been corrected by a variety of methods to provide as high a quality of result as possible.

DCPSE contains:

  • Around 885,000 words (87,000 trees) of fully-parsed and annotated spoken British English from the 1950s to 1990s.
  • Sociolinguistic information on texts, speakers and authors.
  • Searchable with ICECUP 3.1.
  • Supplied with extensive on-line help.

Why is DCPSE still special?

DCPSE is fully grammatically analysed to the same standard as ICE-GB. All sentences in the corpus have been given a detailed parse tree.

DCPSE contains precisely 87,188 parse trees, comprising 885,436 words of English. The entire material is spoken.

This is the biggest single collection of parsed and checked orthographically transcribed spoken English material anywhere. The picture below shows ICECUP 3.1 browsing a text in the corpus.

Browsing a text and tree in DCPSE. ‘Which printer is this?’ asks speaker A.

DCPSE has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional ‘post-checking’ strategy and also by cross-sectional error-based searches. We do not believe that the analysis in the corpus is perfect, but it is not systematically imperfect — unlike the best parser output.

DCPSE comes complete with ICECUP. ICECUP allows you to perform a variety of different queries, including using the parse analysis in the corpus to construct Fuzzy Tree Fragments to search the corpus.

  • A sample corpus from DCPSE Release 1 and ICECUP 3.1 is available for download. 

Corpus Design

The Diachronic Corpus of Present Day Spoken English is sampled from ICE-GB and from the London-Lund Corpus. ICE-GB texts are 2,000 words each while the LLC contains 5,000-word texts. To achieve a balanced sample, it was necessary to take proportionately fewer texts from the LLC.

Only ~130,000 words are found in corpus texts with one speaker, the remainder are conversations or multi-speaker presentations.

The texts were sampled into the following categories as follows:

DCPSE text categories and statistics
 Text categoryICE textsLLC textstarget wordsactual words
A.Face-to-face conversations, formal20880,00090,775
B.Face-to-face conversations, informal9036360,000403,844
C.Telephone conversations (mostly informal)10440,00047,242
D.Broadcast discussions (disparates/equals)20880,00089,157
E.Broadcast interviews (disparates/equals)10440,00043,046
F.Spontaneous commentary23991,00095,381
G.Parliamentary language5220,00021,083
H.Legal cross-examination219,0009,658
I.Assorted spontaneous (unscripted) speech5220,00021,675
J.Prepared speech (mostly monologue)15660,00063,575
 Total20080800,000885,436

Notes

The actual number of words included exceeds the target figure by 10% or so. It is quite common in corpus linguistics to include more material rather than less. There is also a slight variation in the number of words per ICE-GB or LLC text for the categories F and H, however this variation is minor.

This variation is not a problem provided that experiments are carried out properly, and in general, the more data available, the better. Experimental results should always be considered relatively, i.e., in proportion to the number of words, clauses, or set of circumstances under investigation. Statistical methods used on corpus data, such as ratio statistics, chi-square, etc., assume that samples are likely to be unequal in size.

DCPSE text codes are given the prefix "DI-" (for ICE-GB) or "DL-" (for LLC), the letter code A-J (above), followed by an index number. For example

  • DI-B07 is the seventh text in the ICE-GB (1990s) informal face-to-face conversations (B).

The sociolinguistic variable ‘Source corpus’ stores the source corpus (ICE-GB or LLC) for every text.

The LLC corpus material

The London-Lund Corpus was recorded over several decades, from the earliest tape dated 1953, to the last, S-06-09, recorded in 1987. The time span of LLC texts included in DCPSE ranges from 1958 to 1977. In DCPSE the ‘Date’ variable stores the year of recording.

The LLC was transcribed at the Survey of English Usage on paper and famously typed up and stored on paper cards or 'slips', which were archived at the Survey. The LLC corpus was originally stored in card index cabinets. Without computers, 'indexing' consisted of manually underlining constituents on slips, and 'retrieval' consisted of opening card indexes. It was only in the 1980s that the LLC was made accessible via a computer.

Many of the recordings in the LLC were made without the knowledge of all of the participants, a practice which today would not be considered ethical (and unlike in the case of ICE-GB). DCPSE contains an ‘Awareness’ variable that codes for whether the speaker was aware of the recording or not.

The DCPSE project took these LLC texts and re-annotated them in a way that was as consistent with ICE-GB as possible. This meant importing ICE-GB transcription conventions, phonetic and prosodic information and segmentation, and recovering sociolinguistic information from dusty files — as well as carrying out the part-of-speech tagging and parsing of the text.


How DCPSE compares with other treebanks

The table below is a list of fully-parsed and checked phrase structure treebanks of English that are publicly available.

We exclude corpora which were parsed automatically but not checked. Parsing a corpus properly is a very difficult task, precisely because the grammar of natural language is extremely complex. Automatic algorithms are generally poor at distinguishing between different structures, although the simpler the analysis scheme deployed, the easier the task will tend to be.

It is also probably fair to say that not all corpora may have been checked to the same degree — with some teams being satisfied after one ‘post-correction’ pass, and others (including ourselves) only being content to release corpora after a great deal of cross-checking.

As a minimum all corpora listed below have been manually completed (so that 100% coverage is obtained) and checked and corrected by teams of linguists trained on the parsing scheme. Schemes vary in the level of the detail of the grammar, with TOSCA/ICE and SUSANNE at the ‘detailed’ end of a spectum.

Major hand-checked parsed phrase structure grammar corpora of English (available)
NameSize (x1,000 words)Ratio spoken:writtenVarietyAnalysis
University of Pennsylvania (Penn) Treebank [2]2,900<144:~2,756USTreebank I, II
American Printing House for the Blind Treebank2000:200USskeleton
Associated Press (AP) Treebank1,0000:1,000USskeleton
Canadian Hansard Treebank [1]7500:750Canskeleton
Nijmegen Parsed Corpus (limited availability)13010:120BritTOSCA/ICE
Polytechnic of Wales Corpus [3]6161:0BritPOW (SFG)
Leeds-Lancaster Treebank (limited availability)450:45BritLOB (skeleton)
Lancaster Parsed Corpus1400:140BritLOB (skeleton)
IBM / Lancaster Spoken English Corpus (SEC)5252:0BritLOB (skeleton)
CHRISTINE & SUSANNE260130:130BritSUSANNE
British Component of ICE (ICE-GB)1,000600:400BritTOSCA/ICE
Diachronic Corpus of Present Day Spoken English (DCPSE)800800:0BritTOSCA/ICE

Notes

  1. ‘Spoken’ material here is limited to orthographically transcribed spoken data. Legal and political transcriptions of material are paraphrased, hence the Canadian Hansard is strictly a ‘written’ corpus. The grammar of paraphrases is ‘cleaned up’, and therefore highly misleading as a guide to the grammar of speech.
  2. The figures of spoken material for the Penn Treebank are slightly uncertain for the same reason. We have excluded Hansard-type material but included Air Traffic Control and telephone subcorpora transcribed for the purpose of linguistic analysis.
  3. SFG stands for a Hallidayan Systemic-Functional Grammar.
  4. We have excluded constraint grammar corpora such as the ENCG corpora for two reasons. First, because the level of correction applied is unclear (we are not ENCG experts), and second, because comparability between constraint and phrase structure grammars is still a matter of debate.

See also