The International Corpus of English

The International Corpus of English (ICE) project was initiated in 1988 by the late Sidney Greenbaum, the then Director of the Survey of English Usage, University College London. In a brief notice in World Englishes, Greenbaum pointed out that grammatical studies had been greatly facilitated by the availability of two computerized corpora of printed English, the Brown Corpus of American English, and the LOB (Lancaster/Oslo-Bergen) Corpus of British English. Greenbaum continued:

We should now be thinking of extending the scope for computerized comparative studies in three ways: (1) to sample standard varieties from other countries where English is the first language, for example Canada and Australia; (2) to sample national varieties from countries where English is an official additional language, for example India and Nigeria; and (3) to include spoken and manuscript English as well as printed English. (Greenbaum 1988)

In response, linguists from around the world came forward to discuss Greenbaum's proposal, and ultimately to put it into effect (Greenbaum 1991). The project soon became known as the International Corpus of English (ICE), and was coordinated by Greenbaum until 1996. From 1996 to 2001, ICE was coordinated by Charles Meyer, University of Massachusetts-Boston. It is now coordinated by Gerald Nelson in Hong Kong. The ICE project involves research teams in each of the countries or regions shown below.

Australia
Cameroon
Canada
East Africa (Kenya, Malawi, Tanzania)
Fiji
Great Britain
Hong Kong
India
Ireland
Jamaica
Kenya
Malta

Malaysia
New Zealand
Nigeria
Pakistan
Philippines
Sierra Leone
Singapore
South Africa
Sri Lanka
Trinidad and Tobago
USA

Each ICE team is compiling – or has already compiled – a one million-word corpus of their own national or regional variety of English. Crucially, each team follows a common corpus design and a common annotation scheme, in order to ensure maximum comparability between the components (Nelson 1996). The long-term aim of ICE is to produce up to twenty one million-word corpora, each syntactically analysed according to a common parsing scheme, and supplied with the retrieval software, ICECUP.

Each ICE corpus samples the English of adults (age 18 or over) who have been educated through the medium of English to at least the end of secondary schooling. Furthermore, each component corpus is grammatically analysed using a common grammatical annotation scheme.

Enter the International Corpus of English website (opens in a new window)

Our ICE corpora

ICE-GB is the British component of ICE. It was compiled and grammatically analysed at the Survey of English Usage, between 1990 and 1998. Version 1 was released on CD-ROM in 1998, with ICECUP 3.0. Version 2, documented here, was released in 2006 with ICECUP 3.1 and audio recordings.

DCPSE, the Diachronic Corpus of Present-Day Spoken English, is a new corpus of spoken English that samples spoken English across the decades from ICE-GB and an earlier corpus, the London-Lund Corpus (LLC). The spoken ('London') part of the LLC was collected by Randolph Quirk at the Survey, primarily in the 1960s and 1970s. The samples are of equal size – 400,000 words – and were selected to try to obtain a balanced sample by spoken 'genre', containing similar numbers of words in telephone conversations, for example.

DCPSE is fully-parsed using the same grammatical scheme as ICE-GB. Due to the fact that it samples spoken English across a period from the late 1950s to the early 1990s, this corpus makes it possible, for the first time, to investigate recent grammatical change in spoken English. DCPSE is a "spin-off" from the ICE Project, with a different corpus design but sharing common grammar and markup standards. It is also supplied with the ICECUP software.

Follow @UCLEnglishUsage

This page last modified 18 November, 2021 by Survey Web Administrator.

UCL Survey of English Usage

Survey of English Usage

The International Corpus of English

Our ICE corpora