
|
 |
Comparing ICE-GB and DCPSE with other treebanks
The table below is a list of fully-parsed and checked phrase structure
treebanks of English that are publicly available.
We exclude corpora which were parsed automatically but not checked.
Parsing a corpus is a very difficult task, precisely because the
grammar of natural language is extremely complex. Automatic algorithms
are generally poor at distinguishing between different structures,
although the simpler the analysis scheme deployed, the easier the
task will tend to be.
It is also probably fair to say that not all corpora may have been
checked to the same degree - with some teams being satisfied
after one ‘post-correction’ pass, and others (including
ourselves) only being content to release corpora after a great deal
of cross-checking.
As a minimum all corpora listed below have been manually completed
(so that 100% coverage is obtained) and checked and corrected
by teams of linguists trained on the parsing scheme. Schemes vary
in the level of the detail of the grammar, with TOSCA/ICE
and SUSANNE at the ‘detailed’ end of a spectum.
- Both ICE-GB and DCPSE are in the top rank of currently
available parsed corpora.
- DCPSE contains the largest available volume of parsed spoken
English in the world.
- ICE-GB contains the next largest sample of spoken English,
plus a significant amount of linguistically varied written English,
handwritten and printed.
This is not a complete list of corpora, although we will
endeavour to keep this list up-to-date. The publication of large
parsed corpora of this kind is a relatively rare occurrence.
Notes
- ‘Spoken’ material here is limited to orthographically
transcribed spoken data. Legal and political transcriptions
of material are paraphrased, hence the Canadian Hansard is strictly
a ‘written’ corpus. The grammar of paraphrases is
‘cleaned up’, and therefore highly misleading as a
guide to the grammar of speech.
- The figures of spoken material for the Penn Treebank are slightly
uncertain for the same reason. We have excluded Hansard-type material
but include Air Traffic Control and telephone subcorpora transcribed
for the purpose of linguistic analysis.
- SFG stands for a Hallidayan Systemic-Functional Grammar.
- We have excluded constraint grammar corpora such as the
ENCG corpora for two reasons. First, because the level of correction
applied is unclear (we are not ENCG experts), and second, because
comparability between constraint and phrase structure grammars
is still a matter of debate.
This page last modified
23 October, 2009
by Survey Web Administrator.
|
 |
DCPSE
|