DCPSE
DCPSE is the Diachronic Corpus of Present-day Spoken English. Published by the Survey of English Usage, it contains over 800,000 words of fully-parsed spoken English from 1957-1993.

23 September 2006
Introduction
DCPSE is a parsed corpus available from the Survey of English Usage designed for investigating recent change in the grammar of spoken British English.
It contains more than 400,000 words each from ICE-GB (collected in the early 1990s) and the London-Lund Corpus (late 1960s-early 1980s).
The orthographic transcriptions have been normalised and annotated according to the same criteria. ICE-GB was used as a gold standard for the parsing of DCPSE. The parsing has been corrected by a variety of methods to provide as high a quality of result as possible.
DCPSE contains:
- Around 885,000 words (87,000 trees) of fully-parsed and annotated spoken British English from the 1950s to 1990s.
- Sociolinguistic information on texts, speakers and authors.
- Searchable with ICECUP 3.1.
- Supplied with extensive on-line help.
Why is DCPSE still special?
DCPSE is fully grammatically analysed to the same standard as ICE-GB. All sentences in the corpus have been given a detailed parse tree.
DCPSE contains precisely 87,188 parse trees, comprising 885,436 words of English. The entire material is spoken.
This is the biggest single collection of parsed and checked orthographically transcribed spoken English material anywhere. The picture below shows ICECUP 3.1 browsing a text in the corpus.

Browsing a text and tree in DCPSE. ‘Which printer is this?’ asks speaker A.
DCPSE has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional ‘post-checking’ strategy and also by cross-sectional error-based searches. We do not believe that the analysis in the corpus is perfect, but it is not systematically imperfect — unlike the best parser output.
DCPSE comes complete with ICECUP. ICECUP allows you to perform a variety of different queries, including using the parse analysis in the corpus to construct Fuzzy Tree Fragments to search the corpus.
- A sample corpus from DCPSE Release 1 and ICECUP 3.1 is available for download.
Corpus Design
The Diachronic Corpus of Present Day Spoken English is sampled from ICE-GB and from the London-Lund Corpus. ICE-GB texts are 2,000 words each while the LLC contains 5,000-word texts. To achieve a balanced sample, it was necessary to take proportionately fewer texts from the LLC.
Only ~130,000 words are found in corpus texts with one speaker, the remainder are conversations or multi-speaker presentations.
The texts were sampled into the following categories as follows:
Text category | ICE texts | LLC texts | target words | actual words | |
---|---|---|---|---|---|
A. | Face-to-face conversations, formal | 20 | 8 | 80,000 | 90,775 |
B. | Face-to-face conversations, informal | 90 | 36 | 360,000 | 403,844 |
C. | Telephone conversations (mostly informal) | 10 | 4 | 40,000 | 47,242 |
D. | Broadcast discussions (disparates/equals) | 20 | 8 | 80,000 | 89,157 |
E. | Broadcast interviews (disparates/equals) | 10 | 4 | 40,000 | 43,046 |
F. | Spontaneous commentary | 23 | 9 | 91,000 | 95,381 |
G. | Parliamentary language | 5 | 2 | 20,000 | 21,083 |
H. | Legal cross-examination | 2 | 1 | 9,000 | 9,658 |
I. | Assorted spontaneous (unscripted) speech | 5 | 2 | 20,000 | 21,675 |
J. | Prepared speech (mostly monologue) | 15 | 6 | 60,000 | 63,575 |
Total | 200 | 80 | 800,000 | 885,436 |
NotesThe actual number of words included exceeds the target figure by 10% or so. It is quite common in corpus linguistics to include more material rather than less. There is also a slight variation in the number of words per ICE-GB or LLC text for the categories F and H, however this variation is minor. This variation is not a problem provided that experiments are carried out properly, and in general, the more data available, the better. Experimental results should always be considered relatively, i.e., in proportion to the number of words, clauses, or set of circumstances under investigation. Statistical methods used on corpus data, such as ratio statistics, chi-square, etc., assume that samples are likely to be unequal in size. DCPSE text codes are given the prefix "DI-" (for ICE-GB) or "DL-" (for LLC), the letter code A-J (above), followed by an index number. For example
The sociolinguistic variable ‘Source corpus’ stores the source corpus (ICE-GB or LLC) for every text. |
The LLC corpus material
The London-Lund Corpus was recorded over several decades, from the earliest tape dated 1953, to the last, S-06-09, recorded in 1987. The time span of LLC texts included in DCPSE ranges from 1958 to 1977. In DCPSE the ‘Date’ variable stores the year of recording.
The LLC was transcribed at the Survey of English Usage on paper and famously typed up and stored on paper cards or 'slips', which were archived at the Survey. The LLC corpus was originally stored in card index cabinets. Without computers, 'indexing' consisted of manually underlining constituents on slips, and 'retrieval' consisted of opening card indexes. It was only in the 1980s that the LLC was made accessible via a computer.
Many of the recordings in the LLC were made without the knowledge of all of the participants, a practice which today would not be considered ethical (and unlike in the case of ICE-GB). DCPSE contains an ‘Awareness’ variable that codes for whether the speaker was aware of the recording or not.
The DCPSE project took these LLC texts and re-annotated them in a way that was as consistent with ICE-GB as possible. This meant importing ICE-GB transcription conventions, phonetic and prosodic information and segmentation, and recovering sociolinguistic information from dusty files — as well as carrying out the part-of-speech tagging and parsing of the text.
How DCPSE compares with other treebanks
The table below is a list of fully-parsed and checked phrase structure treebanks of English that are publicly available.
We exclude corpora which were parsed automatically but not checked. Parsing a corpus properly is a very difficult task, precisely because the grammar of natural language is extremely complex. Automatic algorithms are generally poor at distinguishing between different structures, although the simpler the analysis scheme deployed, the easier the task will tend to be.
It is also probably fair to say that not all corpora may have been checked to the same degree — with some teams being satisfied after one ‘post-correction’ pass, and others (including ourselves) only being content to release corpora after a great deal of cross-checking.
As a minimum all corpora listed below have been manually completed (so that 100% coverage is obtained) and checked and corrected by teams of linguists trained on the parsing scheme. Schemes vary in the level of the detail of the grammar, with TOSCA/ICE and SUSANNE at the ‘detailed’ end of a spectum.
Name | Size (x1,000 words) | Ratio spoken:written | Variety | Analysis |
---|---|---|---|---|
University of Pennsylvania (Penn) Treebank [2] | 2,900 | <144:~2,756 | US | Treebank I, II |
American Printing House for the Blind Treebank | 200 | 0:200 | US | skeleton |
Associated Press (AP) Treebank | 1,000 | 0:1,000 | US | skeleton |
Canadian Hansard Treebank [1] | 750 | 0:750 | Can | skeleton |
Nijmegen Parsed Corpus (limited availability) | 130 | 10:120 | Brit | TOSCA/ICE |
Polytechnic of Wales Corpus [3] | 61 | 61:0 | Brit | POW (SFG) |
Leeds-Lancaster Treebank (limited availability) | 45 | 0:45 | Brit | LOB (skeleton) |
Lancaster Parsed Corpus | 140 | 0:140 | Brit | LOB (skeleton) |
IBM / Lancaster Spoken English Corpus (SEC) | 52 | 52:0 | Brit | LOB (skeleton) |
CHRISTINE & SUSANNE | 260 | 130:130 | Brit | SUSANNE |
British Component of ICE (ICE-GB) | 1,000 | 600:400 | Brit | TOSCA/ICE |
Diachronic Corpus of Present Day Spoken English (DCPSE) | 800 | 800:0 | Brit | TOSCA/ICE |
Notes
|