ICE-GB Corpus Design

ICE-GB contains 500 texts of approximately 2,000 words each. Many of these texts are composite, that is, they consist of two or more different samples of the same type which have been combined to make up a 2,000-word 'text'. In the category of business letters, for instance, a total of 198 individual letters have been included. We refer to these individual samples as 'subtexts'.

The table below provides a summary of the composition of the ICE-GB corpus.

  spoken written total
Number of words 637,562 423,702 1,061,264
Number of 2,000-word texts 300 200 500
Number of individual samples 447 554 1,001
Average number of words per text 2,125 2,118 2,122
Average number of words per sample 1,426 764 1,060
Number of syntactic trees 59,460 23,934 83,394
Average number of trees per text 198 119 166
Average number of trees per sample 133 43 83
ICE-GB Summary statistics

With just over one million words, ICE-GB is small in comparison with the British National Corpus (BNC). The BNC contains 100 million words, and samples British English from approximately the same period.

However, ICE-GB was designed primarily as a resource for syntactic studies, not for lexical studies. Unlike the BNC, every text unit ('sentence') in ICE-GB has been syntactically parsed at function and category level, and each unit is presented in the form of a syntactic tree. The 83,394 trees in the corpus represent an invaluable resource for studies of the syntax of contemporary British English.

Corpus structure

The sampling structure of the corpus is shown below.

Spoken Texts (300)

Dialogues (180) Private (100) face-to-face conversations (90)
phonecalls (10)
Public (80) classroom lessons (20)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
Monologues (100)  Unscripted (70) spontaneous commentaries (20)
unscripted speeches (30)
demonstrations (10)
legal presentations (10)
Scripted (30) broadcast talks (20)
non-broadcast speeches (10)
Mixed (20)  broadcast news (20)

Written Texts (200)

Non-printed (50) Non-professional writing (20) untimed student essays (10)
student examination scripts (10)
Correspondence (30) social letters (15)
business letters (15)
Printed (150) Academic writing (40) humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
Non-academic writing (40) humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
Reportage (20) press news reports (20)
Instructional writing (20) administrative/regulatory (10)
skills/hobbies (10)
Persuasive writing (10) press editorials (10)
Creative writing (20) novels/stories (20)
ICE Corpus Design

This structure is reflected in the TEXT CATEGORY variable in ICECUP. The following diagram shows ICECUP's Corpus Map with the entire corpus - at the top left - then spoken, dialogue, private, direct conversations down to the first text S1A-001, S1A-002, etc. S1A-002 is opened further showing subtexts and speakers.

The texts in ICE-GB date from 1990 to 1993 inclusive. This means that the printed texts were originally published, and the spoken texts originally recorded, during this period. The corpus does not include reprints, second or later editions, or transcripts of repeat broadcasts. For handwritten material, such as letters and essays, these dates refer to the date of composition.

All authors and speakers are British. This means that they were born in Great Britain, that is, England, Scotland, or Wales. In a small number of cases, we have relaxed this criterion to include those who were born elsewhere, but moved to Britain at an early age.

See also:

Comparing ICE-GB with other treebanks

This page last modified 12 June, 2013 by Survey Web Administrator.