Comparing ICE-GB and DCPSE with other treebanks
The table below is a list of fully-parsed and checked phrase structure treebanks of English that are publicly available.
We exclude corpora which were parsed automatically but not checked. Parsing a corpus is a very difficult task, precisely because the grammar of natural language is extremely complex. Automatic algorithms are generally poor at distinguishing between different structures, although the simpler the analysis scheme deployed, the easier the task will tend to be.
It is also probably fair to say that not all corpora may have been checked to the same degree - with some teams being satisfied after one ‘post-correction’ pass, and others (including ourselves) only being content to release corpora after a great deal of cross-checking.
As a minimum all corpora listed below have been manually completed (so that 100% coverage is obtained) and checked and corrected by teams of linguists trained on the parsing scheme. Schemes vary in the level of the detail of the grammar, with TOSCA/ICE and SUSANNE at the ‘detailed’ end of a spectum.
- Both ICE-GB and DCPSE are in the top rank of currently available parsed corpora.
- DCPSE contains the largest available volume of parsed spoken English in the world.
- ICE-GB contains the next largest sample of spoken English, plus a significant amount of linguistically varied written English, handwritten and printed.
This is not a complete list of corpora, although we will endeavour to keep this list up-to-date. The publication of large parsed corpora of this kind is a relatively rare occurrence.
|Major hand-checked parsed phrase structure grammar corpora of English (available)|
- ‘Spoken’ material here is limited to orthographically transcribed spoken data. Legal and political transcriptions of material are paraphrased, hence the Canadian Hansard is strictly a ‘written’ corpus. The grammar of paraphrases is ‘cleaned up’, and therefore highly misleading as a guide to the grammar of speech.
- The figures of spoken material for the Penn Treebank are slightly uncertain for the same reason. We have excluded Hansard-type material but include Air Traffic Control and telephone subcorpora transcribed for the purpose of linguistic analysis.
- SFG stands for a Hallidayan Systemic-Functional Grammar.
- We have excluded constraint grammar corpora such as the ENCG corpora for two reasons. First, because the level of correction applied is unclear (we are not ENCG experts), and second, because comparability between constraint and phrase structure grammars is still a matter of debate.
This page last modified 25 April, 2013 by Survey Web Administrator.