XClose

UCL English

Home
Menu

ICE-GB

ICE-GB is the British component of the International Corpus of English. Published by the Survey of English Usage, it contains over 1 million words of fully-parsed written and spoken English c.1990-92.

The British Component of the International Corpus of English

23 September 1998

Introduction

The International Corpus of English (ICE) began in 1988 with the primary aim of providing material for comparative studies of varieties of English throughout the world.

More than twenty centres around the world are preparing corpora of their own national or regional variety of English. These include

Australia
Bahamas
Canada
East Africa (Kenya, Malawi, Tanzania)
Fiji
Ghana
Gibraltar
Great Britain
Hong Kong
India
Ireland
Jamaica
Malaysia
Malta

Namibia
New Zealand
Nigeria
Pakistan
Philippines
Puerto Rico
Scotland
Singapore
South Africa
Sri Lanka
Trinidad and Tobago
Uganda
USA

ICE-GB was first released in 1998 with ICECUP 3.0. Since then it has been used for research and education in universities, colleges and schools all over the world.

Availability

The latest version of ICE-GB is Release 2, supplied with ICECUP 3.1.1.

Release 2 is aligned with the 300 audio recordings which are available as an optional extra. See the order form for details.

ICE-GB contains:

  • One million words of spoken and written British English from the 1990s.
  • Tagged, parsed and checked.
  • Bundled with the ICECUP 3.1 exploration software designed for parsed corpora. 
  • Supplied with extensive on-line help.
  • Optional audio recordings for 300 spoken texts.

Why is ICE-GB still special?

After more than two decades since its first release, ICE-GB is still a unique resource for linguistics research.

ICE-GB is fully grammatically analysed. Like all the ICE corpora, ICE-GB consists of a million words of spoken and written English and adheres to the common corpus design. 200 written and 300 spoken texts make up the million words. Every text is grammatically annotated, permitting complex and detailed searches across the whole corpus.ICE-GB contains 83,394 parse trees, including 59,640 in the spoken part of the corpus. This is the biggest collection of parsed spoken material anywhere with the exception of DCPSE (which only contains spoken material). The picture below shows ICECUP 3.1 displaying a single tree from the spoken part of the corpus.

Example tree and text unit in ICE-GB

ICE-GB has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional ‘post-checking’ strategy and also by cross-sectional error-based searches. We do not believe that the analysis in the corpus is perfect, but it is not systematically imperfect — unlike the best parser output.

ICE-GB comes complete with ICECUP. ICECUP allows you to perform a variety of different queries, including using the parse analysis in the corpus to construct Fuzzy Tree Fragments to search the corpus.

Release 2 of ICE-GB is now available. This includes, as an optional paid-for extra, the digitised speech recordings of the spoken part of the corpus, aligned with the text. This allows researchers to play back the original source of the text that they can see on their screen.

  • A sample corpus from ICE-GB Release 2 and ICECUP 3.1 is now available for download. 
  • A book about ICE-GB and ICECUP was published in 2002.

Corpus Design

ICE-GB contains 500 texts of approximately 2,000 words each. Many of these texts are composite, that is, they consist of two or more different samples of the same type which have been combined to make up a 2,000-word 'text'. In the category of business letters, for instance, a total of 198 individual letters have been included. We refer to these individual samples as 'subtexts'.

The table below provides a summary of the composition of the ICE-GB corpus.

ICE-GB summary statistics (Release 2)
 spokenwrittentotal
Number of words637,562423,7021,061,264
Number of 2,000-word texts300200500
Number of individual samples (subtexts)4465541,000
Average number of words per text2,1252,1172,122
Average number of words per sample1,4297641,061
Number of syntactic trees (parse units)59,47023,93583,405
Average number of trees per text198119166
Average number of trees per sample1334383

With just over one million words, ICE-GB is small in comparison with, say, the British National Corpus (BNC). The BNC contains 100 million words, and samples British English from approximately the same period. However from a scientific research perspective, a key question concerns sampling independence. Despite the smaller size, ICE-GB has a high participant-to-content ratio, with 1,747 distinct participants compared to the BNC's approximately 4,000. BNC's mean sample length (24,293) is much longer than ICE-GB's (1,061).

ICE-GB was designed primarily as a resource for syntactic studies, not for lexical studies. Unlike the BNC, every text unit ('sentence') in ICE-GB has been syntactically parsed at function and category level, and each unit is presented in the form of a syntactic tree. The 83,405 trees in Release 2 represent an invaluable resource for studies of the syntax of contemporary British English.

Corpus structure

The sampling structure of the corpus is shown below.

ICE-GB Corpus Design
Spoken Texts (300)Dialogues (180)Private (100)face-to-face conversations (90)
phonecalls (10)
Public (80)classroom lessons (20)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
Monologues (100)Unscripted (70)spontaneous commentaries (20)
unscripted speeches (30)
demonstrations (10)
legal presentations (10)
Scripted (30)broadcast talks (20)
non-broadcast speeches (10)
Mixed (20) broadcast news (20)
Written Texts (200)Non-printed (50)Non-professional writing (20)untimed student essays (10)
student examination scripts (10)
Correspondence (30)social letters (15)
business letters (15)
Printed (150)Academic writing (40)humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
Non-academic writing (40)humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
Reportage (20)press news reports (20)
Instructional writing (20)administrative/regulatory (10)
skills/hobbies (10)
Persuasive writing (10)press editorials (10)
Creative writing (20)novels/stories (20)

This structure is reflected in the TEXT CATEGORY variable in ICECUP. The following diagram shows ICECUP's Corpus Map with the entire corpus — at the top left — then spoken, dialogue, private, direct conversations down to the first text S1A-001, S1A-002, etc. S1A-002 is opened further showing subtexts and speakers.

ICE-GB Corpus Map Table

Texts in ICE-GB date from 1990 to 1993 inclusive. This means that the printed texts were originally published, and the spoken texts originally recorded, during this period. The corpus does not include reprints, second or later editions, or transcripts of repeat broadcasts. For handwritten material, such as letters and essays, these dates refer to the date of composition.

All authors and speakers are British. This means that they were born in Great Britain, that is, England, Scotland, or Wales. In a small number of cases, we have relaxed this criterion to include those who were born elsewhere, but moved to Britain at an early age.


How ICE-GB compares with other treebanks

The table below is a list of fully-parsed and checked phrase structure treebanks of English that are publicly available.

We exclude corpora which were parsed automatically but not checked. Parsing a corpus is a very difficult task, precisely because the grammar of natural language is extremely complex. Automatic algorithms are generally poor at distinguishing between different structures, although the simpler the analysis scheme deployed, the easier the task will tend to be.

It is also probably fair to say that not all corpora may have been checked to the same degree — with some teams being satisfied after one ‘post-correction’ pass, and others (including ourselves) only being content to release corpora after a great deal of cross-checking.

As a minimum all corpora listed below have been manually completed (so that 100% coverage is obtained) and checked and corrected by teams of linguists trained on the parsing scheme. Schemes vary in the level of the detail of the grammar, with TOSCA/ICE and SUSANNE at the ‘detailed’ end of a spectum.

Major hand-checked parsed phrase structure grammar corpora of English (available)
NameSize (x1,000 words)Ratio spoken:writtenVarietyAnalysis
University of Pennsylvania (Penn) Treebank [2]2,900<144:~2,756USTreebank I, II
American Printing House for the Blind Treebank2000:200USskeleton
Associated Press (AP) Treebank1,0000:1,000USskeleton
Canadian Hansard Treebank [1]7500:750Canskeleton
Nijmegen Parsed Corpus (limited availability)13010:120BritTOSCA/ICE
Polytechnic of Wales Corpus [3]6161:0BritPOW (SFG)
Leeds-Lancaster Treebank (limited availability)450:45BritLOB (skeleton)
Lancaster Parsed Corpus1400:140BritLOB (skeleton)
IBM / Lancaster Spoken English Corpus (SEC)5252:0BritLOB (skeleton)
CHRISTINE & SUSANNE260130:130BritSUSANNE
British Component of ICE (ICE-GB)1,000600:400BritTOSCA/ICE
Diachronic Corpus of Present Day Spoken English (DCPSE)800800:0BritTOSCA/ICE

Notes

  1. ‘Spoken’ material here is limited to orthographically transcribed spoken data. Legal and political transcriptions of material are paraphrased, hence the Canadian Hansard is strictly a ‘written’ corpus. The grammar of paraphrases is ‘cleaned up’, and therefore highly misleading as a guide to the grammar of speech.
  2. The figures of spoken material for the Penn Treebank are slightly uncertain for the same reason. We have excluded Hansard-type material but included Air Traffic Control and telephone subcorpora transcribed for the purpose of linguistic analysis.
  3. SFG stands for a Hallidayan Systemic-Functional Grammar.
  4. We have excluded constraint grammar corpora such as the ENCG corpora for two reasons. First, because the level of correction applied is unclear (we are not ENCG experts), and second, because comparability between constraint and phrase structure grammars is a matter of debate.

ICE-GB Sound Recordings

The British Component of the International Corpus of English contains 300 samples of speech, including dialogues, monologues, scripted material, and unscripted material — a total of over 70 hours of recorded speech. The recordings were made on analogue tapes. They have been computerized and aligned to the orthographic transcriptions. 

The computerized sound files are available for download to allow researchers to listen to the recordings while examining the grammatically annotated transcriptions.

Researchers can access particular text units and hear context, both before and after the desired sound sample.

Free download

The sound recordings for the five spoken texts of the ICE-GB Sample Corpus are available for free download and may be installed in such a way that ICECUP will play the recordings.

Order ICE-GB Release 2 and R2 Sound

Audio playback is a feature of ICECUP 3.1. The ICE-GB R2 Sound Recordings are aligned with ICE-GB Release 2. If you order both the corpus and the audio, you can use ICECUP to hear speakers in the corpus.


The 'Red Book'

Exploring Natural Language:
Working with the British Component of the International Corpus of English

“This book is a must for anyone who wants to explore the immense possibilities of the ICE-GB.” - The Year’s Work in English Studies, 2004, 83.1
Gerald Nelson, Sean Wallis and Bas Aarts, 2002, Amsterdam: John Benjamins. 355 pages hbk/pbk. 
ISBN 90 272 4889 3 (Europe) / 1 58811 271 3 (US)
Edition G29 in the series Varieties of English Around the World (series editor: Edgar Schneider).

You may go to John Benjamin's site in order to purchase this book.

Exploring Natural Language (book)
 

ICE-GB is a 1 million-word corpus of contemporary British English. It is fully parsed, and contains over 83,000 syntactic trees. Together with the dedicated retrieval software, ICECUP, ICE-GB is an unprecedented resource for the study of English syntax.

Exploring Natural Language is a comprehensive guide to both corpus and software. It contains a full reference guide for ICE-GB. The chapters on ICECUP provide complete instructions on the use of the many features of the software, including concordancing, lexical and grammatical searches, sociolinguistic queries, random sampling, and searching for syntactic structures using ICECUP's Fuzzy Tree Fragment models. Special attention is given to the principles of experimental design in a parsed corpus.

Six case studies provide step-by-step illustrations of how the corpus and software can be used to explore real linguistic issues, from simple lexical studies to more complex syntactic topics, such as noun phrase structure, verb transitivity, and voice.

Keywords: Corpus Linguistics; International Corpus of English (ICE); ICE-GB; ICECUP; Grammar; Parsing; Fuzzy Tree Fragments (FTFs); Research Methods; Corpus Exploration; Experimental Design

See also