The Diachronic Corpus of Present-Day Spoken English
DCPSE is a new parsed corpus of spoken English available on CD-ROM.
It contains more than 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s).
The orthographic transcriptions have been normalised and annotated according to the same criteria. ICE-GB was used as a gold standard for the parsing of DCPSE. The parsing has been corrected by a variety of methods to provide as high a quality of result as possible (see the project pages for more information).
DCPSE is an incomporable resource for examining recent change in the grammar of spoken English.
DCPSE Release 1 with ICECUP 3.1 has been released.
See the order form for details.
- at least 800,000 words (87,000 trees) of fully-parsed and annotated spoken British English from the 1950s to 1990s.
- Sociolinguistic information on texts, speakers and authors.
- Bundled with the ICECUP 3.1 exploration software designed for parsed corpora. This is simply updated to the very latest ICECUP 3.1.1 Windows 64bit-compatible version.
- Supplied with extensive on-line help.
There are numerous English corpora available.
WHAT IS SPECIAL ABOUT DCPSE?
DCPSE is fully grammatically analysed to the same standard as ICE-GB. All sentences in the corpus have been given a detailed parse tree.
DCPSE contains precisely 87,188 parse trees, comprising 885,436 words of English. The entire material is spoken.
This is the biggest single collection of parsed and checked orthographically transcribed spoken English material anywhere. The picture below shows ICECUP 3.1 browsing a text in the corpus.
DCPSE has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional post-checking strategy and also by cross-sectional error-based searches. We do not believe that the analysis in the corpus is perfect, but it is not systematically imperfect - unlike the best parser output.
A sample corpus from DCPSE Release 1 and ICECUP 3.1 is now available for download. We also invite linguists to contribute to the development of cutting-edge corpus linguistics tools by participating in our beta programme.
Corpus text categories
|Face-to-face conversations (154) 494,000 words||Formal (28) 90,000 words||A|
|Informal (126) 403,000 words||B|
|Telephone conversations (14) 47,000 words||C|
|Broadcast discussions (28) 89,000 words||D|
|Broadcast interviews (14) 43,000 words||E|
|Spontaneous commentary (32) 95,000 words||F|
|Parliamentary language (7) 21,000 words||G|
|Legal cross-examination (3) 9,000 words||H|
|Assorted spontaneous (7) 21,000 words||I|
|Prepared speech (21) 63,000 words||J|
Figures have been rounded down to the lower thousand of words. Only ~130,000 words are found in corpus texts with one speaker, the remainder are conversations or multi-speaker presentations.
Sampling, by year
DCPSE is derived from the London-Lund Corpus (LLC), sampled between 1958 and 1977, and the British Component of the International Corpus of English (ICE-GB, sampled between 1990 and 1992. As a result there are a lot more words in ICE-GB concentrated over a shorter period. This uneven sampling distribution is easily addressed during data analysis, but it is something to bear in mind.
This page last modified 8 July, 2016 by Survey Web Administrator.