DCPSE
The Diachronic Corpus of Present-Day Spoken English
funded by

DCPSE is a new parsed corpus of spoken English available on CD-ROM.
It contains more than 400,000 words from ICE-GB (collected in the
early 1990s) and 400,000 words from the London-Lund Corpus (late
1960s-early 1980s).
The orthographic transcriptions have been normalised and annotated
according to the same criteria. ICE-GB was used as a gold standard
for the parsing of DCPSE. The parsing has been corrected by a variety
of methods to provide as high a quality of result as possible (see
the project pages for more information).
DCPSE is an incomporable resource for examining recent change in
the grammar of spoken English.
NEW! DCPSE Release 1 on
CD-ROM
DCPSE Release 1 with ICECUP 3.1 has been released.
See the order form for details.
- at least 800,000 words (87,000 trees) of fully-parsed and annotated
spoken British English from the 1950s to 1990s.
- Sociolinguistic information on texts, speakers and authors.
- Searchable with ICECUP
3.1.
- Supplied with Getting Started with ICECUP 3.1 (40pp)
and extensive on-line help.
Order
DCPSE Release 1 | Download DCPSE Release
1 Sample and ICECUP 3.1
There are numerous English corpora available.
WHAT IS SPECIAL ABOUT DCPSE?
DCPSE is fully grammatically analysed
to the same standard as ICE-GB.
All sentences in the corpus have been given a detailed parse tree.
DCPSE contains precisely 87,188 parse trees, comprising 885,436
words of English. The entire material is spoken.
This is the biggest single collection of parsed and checked
orthographically transcribed spoken English material anywhere.
The picture below shows ICECUP 3.1
browsing a text in the corpus.

DCPSE has been
fully checked. It was checked by linguists at several stages
in its completion, using both a traditional post-checking
strategy and also by cross-sectional error-based searches. We do
not believe that the analysis in the corpus is perfect, but it is
not systematically imperfect - unlike the best parser output.
DCPSE comes complete with ICECUP.
ICECUP allows you to perform a variety of different queries, including
using the parse analysis in the corpus to construct Fuzzy
Tree Fragments to search the corpus.
A sample corpus from DCPSE Release 1 and ICECUP 3.1 is now available
for download. We also invite linguists
to contribute to the development of cutting-edge corpus linguistics
tools by participating in our beta programme.
Corpus text categories
| Face-to-face conversations (154) 494,000 words |
Formal (28) 90,000 words |
A |
| Informal (126) 403,000 words |
B |
| Telephone conversations (14) 47,000 words |
C |
| Broadcast discussions (28) 89,000 words |
D |
| Broadcast interviews (14) 43,000 words |
E |
| Spontaneous commentary (32) 95,000 words |
F |
| Parliamentary language (7) 21,000 words |
G |
| Legal cross-examination (3) 9,000 words |
H |
| Assorted spontaneous (7) 21,000 words |
I |
| Prepared speech (21) 63,000 words |
J |
Figures have been rounded down to the lower thousand of words.
Only ~130,000 words are found in corpus texts with one speaker,
the remainder are conversations or multi-speaker presentations.
ICECUP 3.1
ICE-GB
This page last modified
23 October, 2009
by Survey Web Administrator.
|