Comparing ICE-GB and DCPSE with other treebanks

The table below is a list of fully-parsed and checked phrase structure treebanks of English that are publicly available.

We exclude corpora which were parsed automatically but not checked. Parsing a corpus is a very difficult task, precisely because the grammar of natural language is extremely complex. Automatic algorithms are generally poor at distinguishing between different structures, although the simpler the analysis scheme deployed, the easier the task will tend to be.

It is also probably fair to say that not all corpora may have been checked to the same degree - with some teams being satisfied after one ‘post-correction’ pass, and others (including ourselves) only being content to release corpora after a great deal of cross-checking.

As a minimum all corpora listed below have been manually completed (so that 100% coverage is obtained) and checked and corrected by teams of linguists trained on the parsing scheme. Schemes vary in the level of the detail of the grammar, with TOSCA/ICE and SUSANNE at the ‘detailed’ end of a spectum.

  1. Both ICE-GB and DCPSE are in the top rank of currently available parsed corpora.
  2. DCPSE contains the largest available volume of parsed spoken English in the world.
  3. ICE-GB contains the next largest sample of spoken English, plus a significant amount of linguistically varied written English, handwritten and printed.

This is not a complete list of corpora, although we will endeavour to keep this list up-to-date. The publication of large parsed corpora of this kind is a relatively rare occurrence.

Name Size (1,000 words) Ratio Variety Analysis
spoken : written
University of Pennsylvania (Penn) Treebank [2] 2,900 <144 : ~2,756 US Treebank I, II
American Printing House for the Blind Treebank 200 0 : 200 US skeleton
Associated Press (AP) Treebank 1,000 0 : 1,000 US skeleton
Canadian Hansard Treebank [1] 750 0 : 750 Can skeleton
Nijmegen Parsed Corpus (limited availability) 130 10 : 120 Brit TOSCA/ICE
Polytechnic of Wales Corpus [3] 61 61 : 0 Brit POW (SFG)
Leeds-Lancaster Treebank (limited availability) 45 0 : 45 Brit LOB (skeleton)
Lancaster Parsed Corpus 140 0 : 140 Brit LOB (skeleton)
IBM / Lancaster Spoken English Corpus (SEC) 52 52 : 0 Brit LOB (skeleton)
CHRISTINE & SUSANNE 260 130 : 130 Brit SUSANNE
British Component of ICE (ICE-GB) 1,000 600 : 400 Brit TOSCA/ICE
Diachronic Corpus of Present Day Spoken English (DCPSE) 800 800 : 0 Brit TOSCA/ICE
Major hand-checked parsed phrase structure grammar corpora of English (available)

Notes

  1. ‘Spoken’ material here is limited to orthographically transcribed spoken data. Legal and political transcriptions of material are paraphrased, hence the Canadian Hansard is strictly a ‘written’ corpus. The grammar of paraphrases is ‘cleaned up’, and therefore highly misleading as a guide to the grammar of speech.
  2. The figures of spoken material for the Penn Treebank are slightly uncertain for the same reason. We have excluded Hansard-type material but include Air Traffic Control and telephone subcorpora transcribed for the purpose of linguistic analysis.
  3. SFG stands for a Hallidayan Systemic-Functional Grammar.
  4. We have excluded constraint grammar corpora such as the ENCG corpora for two reasons. First, because the level of correction applied is unclear (we are not ENCG experts), and second, because comparability between constraint and phrase structure grammars is still a matter of debate.

This page last modified 12 June, 2013 by Survey Web Administrator.