Comparing ICE-GB and DCPSE with other treebanks

The table below is a list of fully-parsed and checked phrase structure treebanks of English that are publicly available.

We exclude corpora which were parsed automatically but not checked. Parsing a corpus is a very difficult task, precisely because the grammar of natural language is extremely complex. Automatic algorithms are generally poor at distinguishing between different structures, although the simpler the analysis scheme deployed, the easier the task will tend to be.

It is also probably fair to say that not all corpora may have been checked to the same degree - with some teams being satisfied after one ‘post-correction’ pass, and others (including ourselves) only being content to release corpora after a great deal of cross-checking.

As a minimum all corpora listed below have been manually completed (so that 100% coverage is obtained) and checked and corrected by teams of linguists trained on the parsing scheme. Schemes vary in the level of the detail of the grammar, with TOSCA/ICE and SUSANNE at the ‘detailed’ end of a spectum.

Both ICE-GB and DCPSE are in the top rank of currently available parsed corpora.
DCPSE contains the largest available volume of parsed spoken English in the world.
ICE-GB contains the next largest sample of spoken English, plus a significant amount of linguistically varied written English, handwritten and printed.

This is not a complete list of corpora, although we will endeavour to keep this list up-to-date. The publication of large parsed corpora of this kind is a relatively rare occurrence.

Name	Size (1,000 words)	Ratio			Variety	Analysis
Name	Size (1,000 words)	spoken	:	written	Variety	Analysis

University of Pennsylvania (Penn) Treebank [2]	2,900	<144	:	~2,756	US	Treebank I, II
American Printing House for the Blind Treebank	200	0	:	200	US	skeleton
Associated Press (AP) Treebank	1,000	0	:	1,000	US	skeleton
Canadian Hansard Treebank [1]	750	0	:	750	Can	skeleton
Nijmegen Parsed Corpus (limited availability)	130	10	:	120	Brit	TOSCA/ICE
Polytechnic of Wales Corpus [3]	61	61	:	0	Brit	POW (SFG)
Leeds-Lancaster Treebank (limited availability)	45	0	:	45	Brit	LOB (skeleton)
Lancaster Parsed Corpus	140	0	:	140	Brit	LOB (skeleton)
IBM / Lancaster Spoken English Corpus (SEC)	52	52	:	0	Brit	LOB (skeleton)
CHRISTINE & SUSANNE	260	130	:	130	Brit	SUSANNE
British Component of ICE (ICE-GB)	1,000	600	:	400	Brit	TOSCA/ICE
Diachronic Corpus of Present Day Spoken English (DCPSE)	800	800	:	0	Brit	TOSCA/ICE

Major hand-checked parsed phrase structure grammar corpora of English (available)

Notes

‘Spoken’ material here is limited to orthographically transcribed spoken data. Legal and political transcriptions of material are paraphrased, hence the Canadian Hansard is strictly a ‘written’ corpus. The grammar of paraphrases is ‘cleaned up’, and therefore highly misleading as a guide to the grammar of speech.
The figures of spoken material for the Penn Treebank are slightly uncertain for the same reason. We have excluded Hansard-type material but include Air Traffic Control and telephone subcorpora transcribed for the purpose of linguistic analysis.
SFG stands for a Hallidayan Systemic-Functional Grammar.
We have excluded constraint grammar corpora such as the ENCG corpora for two reasons. First, because the level of correction applied is unclear (we are not ENCG experts), and second, because comparability between constraint and phrase structure grammars is still a matter of debate.

Follow @UCLEnglishUsage

This page last modified 14 May, 2020 by Survey Web Administrator.

UCL Survey of English Usage

Survey of English Usage

Comparing ICE-GB and DCPSE with other treebanks

Notes