DCPSE Corpus Design

The Diachronic Corpus of Present Day Spoken English is sampled from ICE-GB and from the London-Lund Corpus. ICE-GB texts are 2,000 words each while the LLC contains 5,000-word texts. To achieve a balanced sample, it was necessary to take proportionately fewer texts from the LLC.

The texts were sampled into the following categories as follows:

number of words
Text category ICE LLC target actual
A. Face-to-face conversations, formal 20 8 80,000 90,775
B. Face-to-face conversations, informal 90 36 360,000 403,844
C. Telephone conversations (mostly informal) 10 4 40,000 47,242
D. Broadcast discussions (disparates/equals) 20 8 80,000 89,157
E. Broadcast interviews (disparates/equals) 10 4 40,000 43,046
F. Spontaneous commentary 23 9 91,000 95,381
G. Parliamentary language 5 2 20,000 21,083
H. Legal cross-examination 2 1 9,000 9,658
I. Assorted spontaneous (unscripted) speech 5 2 20,000 21,675
J. Prepared speech (mostly monologue) 15 6 60,000 63,575
Total 200 80 800,000 885,436
DCPSE text categories and statistics

Only ~130,000 words are found in corpus texts with one speaker, the remainder are conversations or multi-speaker presentations.

Comments

The actual number of words included exceeds the target figure by 10% or so. It is quite common in corpus linguistics to include more material rather than less. There is also a slight variation in the number of words per ICE-GB or LLC text for the categories F and H, however this variation is minor.

This variation is not a problem provided that experiments are carried out properly, and in general, the more data available, the better. Experimental results should always be considered relatively, i.e., in proportion to the number of words, clauses, or set of circumstances under investigation. Statistical methods used on corpus data, such as ratio statistics, chi-square, etc., assume that samples are likely to be unequal.

DCPSE text codes are given the prefix "DI-" (for ICE-GB) or "DL-" (for LLC), the letter code A-J (above), followed by an index number. Thus,

DI-B07 is the seventh text in the ICE-GB (1990s) sourced informal face-to-face conversations.

The sociolinguistic variable Source corpus stores the source corpus (ICE-GB or LLC) for every text.

The LLC corpus material

The London-Lund Corpus was recorded over several decades, from the earliest tape dated 1953, to the last, S-06-09, recorded in 1987. The time span of LLC texts included in DCPSE ranges from 1958 to 1977. In DCPSE the Date variable stores the year of recording.

The LLC was transcribed at the Survey of English Usage on paper and famously typed up and stored on paper cards or 'slips', which were archived at the Survey. The LLC corpus was originally stored in card index cabinets. Without computers, 'indexing' consisted of manually underlining constituents on slips, and 'retrieval' consisted of opening card indexes. It was only in the 1980s that the LLC was made accessible via a computer.

Many of the recordings in the LLC were made without the knowledge of all of the participants, a practice which today would not be considered ethical (and unlike in the case of ICE-GB). DCPSE contains an Awareness variable that records whether the speaker was aware of the recording or not.

The DCPSE project took these LLC texts and re-annotated them in a way that was as consistent with ICE-GB as possible. This meant importing ICE-GB transcription conventions, phonetic and prosodic information and segmentation, and recovering sociolinguistic information from dusty files – as well as carrying out the part-of-speech tagging and parsing of the text.

See also:

ICE-GB corpus design
Comparing DCPSE with other treebanks

This page last modified 12 June, 2013 by Survey Web Administrator.