DCPSE

The Diachronic Corpus of Present-Day Spoken English

funded by
ESRC

DCPSE is a parsed corpus of spoken English available from the Survey of English Usage.

It contains more than 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s).

The orthographic transcriptions have been normalised and annotated according to the same criteria. ICE-GB was used as a gold standard for the parsing of DCPSE. The parsing has been corrected by a variety of methods to provide as high a quality of result as possible (see the project pages for more information).

DCPSE is an incomporable resource for examining recent change in the grammar of spoken English.

DCPSE Release 1 for DRM digital download

Alongside ICE-GB, DCPSE remains a state-of-art resource for research into the grammar of spoken and written English.

Our research software platform, ICECUP 3.1, has been continually maintained and developed over this period. It is an unparalleled resource for research, and is used for teaching in universities all over the world. It is designed around an exploratory cyclic methodology for research, meaning that you can work top-down or bottom-up, and you can exploit our grammatical analysis without being compelled to accept it!

Like ICE-GB, we offer a special student price of £25 to ensure that it is affordable, and we support researchers with their queries.

See the order form for details.

  • At least 800,000 words (87,000 trees) of fully-parsed and annotated spoken British English from the 1950s to 1990s.
  • Sociolinguistic information on texts, speakers and authors.
  • Searchable with ICECUP 3.1.
  • Supplied with extensive on-line help.

Free download - the latest version of ICECUP 3.1.1 with DCPSE sampler

The latest version of our state-of-the-art ICECUP software is available for download from our website.

Watch a 'flash' preview demo of ICECUP
Download the software

There are numerous English corpora available.

WHAT IS SPECIAL ABOUT DCPSE?

DCPSE is fully grammatically analysed to the same standard as ICE-GB. All sentences in the corpus have been given a detailed parse tree.

DCPSE contains precisely 87,188 parse trees, comprising 885,436 words of English. The entire material is spoken.

This is the biggest single collection of parsed and checked orthographically transcribed spoken English material anywhere. The picture below shows ICECUP 3.1 browsing a text in the corpus.

DCPSE has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional ‘post-checking’ strategy and also by cross-sectional error-based searches. We do not believe that the analysis in the corpus is perfect, but it is not systematically imperfect - unlike the best parser output.

DCPSE comes complete with ICECUP. ICECUP allows you to perform a variety of different queries, including using the parse analysis in the corpus to construct Fuzzy Tree Fragments to search the corpus.

A sample corpus from DCPSE Release 1 and ICECUP 3.1 is now available for download. We also invite linguists to contribute to the development of cutting-edge corpus linguistics tools by participating in our beta programme.

Corpus text categories

Face-to-face conversations (154) 494,000 words Formal (28) 90,000 words A
Informal (126) 403,000 words B
Telephone conversations (14) 47,000 words C
Broadcast discussions (28) 89,000 words D
Broadcast interviews (14) 43,000 words E
Spontaneous commentary (32) 95,000 words F
Parliamentary language (7) 21,000 words G
Legal cross-examination (3) 9,000 words H
Assorted spontaneous (7) 21,000 words I
Prepared speech (21) 63,000 words J

Figures have been rounded down to the lower thousand of words. Only ~130,000 words are found in corpus texts with one speaker, the remainder are conversations or multi-speaker presentations.

This page last modified 28 November, 2018 by Survey Web Administrator.