DCPSE Corpus Design

The Diachronic Corpus of Present Day Spoken English is sampled from ICE-GB and from the London-Lund Corpus. ICE-GB texts are 2,000 words each while the LLC contains 5,000-word texts. To achieve a balanced sample, it was necessary to take proportionately fewer texts from the LLC.

The texts were sampled into the following categories as follows:

				number of words
	Text category	ICE	LLC	target	actual

A.	Face-to-face conversations, formal	20	8	80,000	90,775
B.	Face-to-face conversations, informal	90	36	360,000	403,844
C.	Telephone conversations (mostly informal)	10	4	40,000	47,242
D.	Broadcast discussions (disparates/equals)	20	8	80,000	89,157
E.	Broadcast interviews (disparates/equals)	10	4	40,000	43,046
F.	Spontaneous commentary	23	9	91,000	95,381
G.	Parliamentary language	5	2	20,000	21,083
H.	Legal cross-examination	2	1	9,000	9,658
I.	Assorted spontaneous (unscripted) speech	5	2	20,000	21,675
J.	Prepared speech (mostly monologue)	15	6	60,000	63,575
	Total	200	80	800,000	885,436

DCPSE text categories and statistics

Only ~130,000 words are found in corpus texts with one speaker, the remainder are conversations or multi-speaker presentations.

Comments

The actual number of words included exceeds the target figure by 10% or so. It is quite common in corpus linguistics to include more material rather than less. There is also a slight variation in the number of words per ICE-GB or LLC text for the categories F and H, however this variation is minor.

This variation is not a problem provided that experiments are carried out properly, and in general, the more data available, the better. Experimental results should always be considered relatively, i.e., in proportion to the number of words, clauses, or set of circumstances under investigation. Statistical methods used on corpus data, such as ratio statistics, chi-square, etc., assume that samples are likely to be unequal.

DCPSE text codes are given the prefix "DI-" (for ICE-GB) or "DL-" (for LLC), the letter code A-J (above), followed by an index number. Thus,

DI-B07 is the seventh text in the ICE-GB (1990s) sourced informal face-to-face conversations.

The sociolinguistic variable Source corpus stores the source corpus (ICE-GB or LLC) for every text.

The LLC corpus material

The London-Lund Corpus was recorded over several decades, from the earliest tape dated 1953, to the last, S-06-09, recorded in 1987. The time span of LLC texts included in DCPSE ranges from 1958 to 1977. In DCPSE the Date variable stores the year of recording.

The LLC was transcribed at the Survey of English Usage on paper and famously typed up and stored on paper cards or 'slips', which were archived at the Survey. The LLC corpus was originally stored in card index cabinets. Without computers, 'indexing' consisted of manually underlining constituents on slips, and 'retrieval' consisted of opening card indexes. It was only in the 1980s that the LLC was made accessible via a computer.

Many of the recordings in the LLC were made without the knowledge of all of the participants, a practice which today would not be considered ethical (and unlike in the case of ICE-GB). DCPSE contains an Awareness variable that records whether the speaker was aware of the recording or not.

The DCPSE project took these LLC texts and re-annotated them in a way that was as consistent with ICE-GB as possible. This meant importing ICE-GB transcription conventions, phonetic and prosodic information and segmentation, and recovering sociolinguistic information from dusty files – as well as carrying out the part-of-speech tagging and parsing of the text.

UCL Survey of English Usage

Survey of English Usage

DCPSE Corpus Design

Comments

The LLC corpus material

See also: