ICE-GB Corpus Design

ICE-GB contains 500 texts of approximately 2,000 words each. Many of these texts are composite, that is, they consist of two or more different samples of the same type which have been combined to make up a 2,000-word 'text'. In the category of business letters, for instance, a total of 198 individual letters have been included. We refer to these individual samples as 'subtexts'.

The table below provides a summary of the composition of the ICE-GB corpus.

	spoken	written	total

Number of words	637,562	423,702	1,061,264
Number of 2,000-word texts	300	200	500
Number of individual samples	447	554	1,001
Average number of words per text	2,125	2,118	2,122
Average number of words per sample	1,426	764	1,060
Number of syntactic trees	59,460	23,934	83,394
Average number of trees per text	198	119	166
Average number of trees per sample	133	43	83

ICE-GB Summary statistics

With just over one million words, ICE-GB is small in comparison with the British National Corpus (BNC). The BNC contains 100 million words, and samples British English from approximately the same period.

However, ICE-GB was designed primarily as a resource for syntactic studies, not for lexical studies. Unlike the BNC, every text unit ('sentence') in ICE-GB has been syntactically parsed at function and category level, and each unit is presented in the form of a syntactic tree. The 83,394 trees in the corpus represent an invaluable resource for studies of the syntax of contemporary British English.

Corpus structure

The sampling structure of the corpus is shown below.

Spoken Texts (300)	Dialogues (180)	Private (100)	face-to-face conversations (90) phonecalls (10)
	Dialogues (180)	Public (80)	classroom lessons (20) broadcast discussions (20) broadcast interviews (10) parliamentary debates (10) legal cross-examinations (10) business transactions (10)
	Monologues (100)	Unscripted (70)	spontaneous commentaries (20) unscripted speeches (30) demonstrations (10) legal presentations (10)
	Monologues (100)	Scripted (30)	broadcast talks (20) non-broadcast speeches (10)
	Mixed (20)		broadcast news (20)
Written Texts (200)	Non-printed (50)	Non-professional writing (20)	untimed student essays (10) student examination scripts (10)
	Non-printed (50)	Correspondence (30)	social letters (15) business letters (15)
	Printed (150)	Academic writing (40)	humanities (10) social sciences (10) natural sciences (10) technology (10)
		Non-academic writing (40)	humanities (10) social sciences (10) natural sciences (10) technology (10)
		Reportage (20)	press news reports (20)
		Instructional writing (20)	administrative/regulatory (10) skills/hobbies (10)
		Persuasive writing (10)	press editorials (10)
		Creative writing (20)	novels/stories (20)

ICE Corpus Design

This structure is reflected in the TEXT CATEGORY variable in ICECUP. The following diagram shows ICECUP's Corpus Map with the entire corpus - at the top left - then spoken, dialogue, private, direct conversations down to the first text S1A-001, S1A-002, etc. S1A-002 is opened further showing subtexts and speakers.

The texts in ICE-GB date from 1990 to 1993 inclusive. This means that the printed texts were originally published, and the spoken texts originally recorded, during this period. The corpus does not include reprints, second or later editions, or transcripts of repeat broadcasts. For handwritten material, such as letters and essays, these dates refer to the date of composition.

All authors and speakers are British. This means that they were born in Great Britain, that is, England, Scotland, or Wales. In a small number of cases, we have relaxed this criterion to include those who were born elsewhere, but moved to Britain at an early age.

UCL Survey of English Usage

Survey of English Usage

ICE-GB Corpus Design

Corpus structure

Spoken Texts (300)

Written Texts (200)

See also: