||Choosing a corpus
|| I am interested in carrying out research into the difference
between male and female discourse. What kind of corpus
is appropriate for me?
||Linguistic research with a treebank
|| Why should I be interested in using a parsed corpus
for linguistic research?
||What other uses does a parsed corpus have?
||What are the disadvantages of using a parsed corpus?
||Trying out the software
|| Can I try ICE-GB and DCPSE before I part with any money?
||ICE-GB and DCPSE
|| How do I order a corpus, and if so, which should I
||ICE-GB and sound
|| How do I order audio material, and can I buy just those
CDs I need?
||I already have ICE-GB R1. Why should I upgrade to Release
||What are the differences between the types of licence
||I represent a commercial company and we are interested
in using your corpora. What uses might be acceptable?
||I am interested in carrying out
research into the difference between male and female discourse.
What kind of corpus is appropriate for me?
If you are interested in male and female
spoken discourse, at the very least you need to obtain
a set of texts of orthographically transcribed
(word for word) spoken conversation. This type of faithful
transcription is difficult and expensive to carry out,
and you should steer clear of paraphrase-transcriptions
such as Hansard or court documents.
Assuming you have access to transcribed spoken data
in (say) English, with male and female participants,
you might wish to look out for the following:
- different sample types, e.g. male/female,
male/male, female/female; 2, 3, 4-way conversations,
equal vs. unequal roles (e.g. friends, student/teacher),
co-present vs. distant (telephone conversations),
etc. You might also wish to compare conversation with
monologue - what we might term a control group.
- different levels of annotation, e.g. plain
text, text with additional annotation (e.g. overlapping,
self-correction), part-of-speech tagged text (noun,
verb etc), through to full syntactic parsing.
- large size and wide representativeness of
Grammatical annotation is extremely useful. A part-of-speech
tagged corpus will distinguish between say, work
as a verb and as a noun. You can search
for verbs in particular positions, or just present tense
With a fully parsed corpus, every sentence is
given a complete tree analysis. You gain a very great
deal of control in locating and distinguishing particular
grammatical patterns, for example to ensure that work
is the head of a noun phrase, as in work that he
believed in. This control becomes very useful if
you need to work with numbers from the corpus, rather
than just find an example or two.
The question of whether a corpus is “large enough”
depends on what it is that you are researching into.
Obviously this is a critical question when you are trying
to decide on what corpus to gain access to. (See comparing
For example, a typical ‘discourse marker’
word like “OK” appears 438 times in 200,000
words of the dialogues in ICE-GB. However there are
36,212 words in the same material that we classify as
discourse markers (these are conversation-fillers, including
“OK”, “no”, “yes”,
“I mean”, etc.).
Corpus linguists sometimes argue about the question
of how much data is “enough”. If you just
want to find a few cases to illustrate a point, you
may only need a single example. But if you want to engage
in a stricter scientific argument, claiming that a pattern
you find in the corpus is likely to be reproduced in
English more generally (or more precisely, in English
language data sampled identically to the corpus), then
there are some minimum numbers.
Typically you would give up if you find less than twenty
cases in the corpus. (On our website we have some pages
which explain scientific
experiments with our parsed corpora, although the
principles are the same for all corpus experiments.)
If you have a lot of examples of a phenomenon, every
detected variation is likely to be statistically significant.
However if you narrow down your question to one part
of the corpus then numbers can fall rapidly.
We have two corpora which could be valuable to you.
Since your interest is in spoken discourse we would
suggest the Diachronic Corpus of Present Day Spoken
is probably the more suitable, as this contains only
spoken data. This contains the largest amount of orthographically
transcribed parsed English in the world (some 885,436
words), with 754,819 words of dialogues.
||Why should I be interested in using
a parsed corpus for linguistic research?
A parsed corpus, or ‘treebank’,
allows for deeper linguistic research than is possible
with plain text or part-of-speech corpora.
The value of parsed corpora is only really beginning
to be understood. Introspection about grammar is
inevitably extremely partial, as linguists have found
when attempting to parse actual speech and writing.
Once parsed, a corpus will contain three types of evidence:
- frequency (how common
different grammatical structures are in use),
- coverage (discovery
of new, unanticipated, grammatical phenomena), and
- interaction (evidence
about how one grammatical construction, etc., tends
to influence the use of another).
John Sinclair (1987) remarked how
annotating the corpus-derived COBUILD dictionary led
to many lexicologists' preconceptions built up over
centuries to be overturned. Words which were assumed
to be primarily a noun were found to be more frequently
used as a verb. Transitivity cases were less stereotypical
than had been assumed and so on.
The same is true in a parsed corpus. Real people don’t
tend to write and talk in the way that grammarians might
assume (or wish). Hence corpora have been used for writing
new grammars, such as Quirk et al.
(1985), Greenbaum (1996), and Huddleston
and Pullum (2002).
From a psycholinguistic point of view, a parsed corpus
is a complete description of a volume of passages of
‘real’ ethnologically representative text.
These texts are ‘real’ not merely in the
sense that they are not created in the laboratory, but
also because the contributors are not aware of the vast
majority of their linguistic choices - they are
just speaking and writing. The corpus is simply a general
database which can then be interrogated by linguists.
Psycholinguists are more likely to be interested in
orthographically transcribed speech than written grammar,
although the differences between the two are
Frequency data is valuable for algorithms because it
provides a more representative picture of the types
of sentences a system for parsing natural language is
likely to have to deal with if it is part of an unattended
When we parse a corpus we inevitably will find structures
that we do not expect, and we have to (manually) make
decisions as to how to correctly parse them. These rules
are clearly new to our parser, and this is therefore
This process of completing the parsing is important.
We apply our scheme to novel sentences and our decisions
are recorded in the corpus. Supporting the parsing of
new sentences extends the coverage of any analytical
scheme. It broadens our expectation of what language
‘should’ do and makes us pay attention to
the actual spoken and written text rather than our expectations
Wallis (2007) discusses the
question of the role of annotation in corpus linguistics
in more detail.
Interaction evidence concerns the fact that language
is a natural process where decisions may be partially
contingent on each other.
It is possible to investigate
- the impact of one type of lexical or grammatical
‘event’ on another,
- the impact of sociolinguistic phenomena on lexical
or grammatical choices, and
- whether grammatical/lexical decisions can be evidence
of particular speakers or types of language (used
Without a parsed corpus it is simply not feasible to
do this. You would have to manually annotate each case
By applying a query to a corpus you are asking the
software to identify and retrieve a finite number of
cases of a particular type of linguistic event. The
richer the annotation scheme, the more precise you can
be in specifying a particular event, and the greater
the range of potential questions you can ask.
ICECUP uses a system called Fuzzy
Tree Fragments (FTFs) to query a corpus. These are
structured grammatical queries which allow a linguist
to specify the precise arrangement of elements within
a fragment of the phrase structure tree. FTFs can be
- frame a research question, e.g. limiting
a pattern to a single phrase.
- enumerate multiple possibilities, e.g. identifying
possible alternative features, heads, etc.
- specify different linguistic choices or ‘alternates’,
e.g. identifying nonfinite clauses vs. relative clauses.
For more on FTFs see our Fuzzy Tree
||What other uses does a parsed corpus
As well as being a source for the academic
study of grammar, parsed corpora have a number of applications.
- They are a source of ‘real examples’
for teaching purposes. A parsed corpus can
be as a source of English examples, and permit students
to explore contrasts between say, speech and writing,
or British and US English.
- They can be used as sources of evidence to improve
the accuracy and coverage of natural language processing
(NLP) parsing algorithms (‘automatic
- They can be used as a starting dataset for other
algorithms, such as information extraction.
This is not an exhaustive list. Adding new levels of
annotation to parsed corpora may extend their potential
||What are the disadvantages of using
a parsed corpus?
There are two main issues that you need
to think about before using a parsed corpus. These are
Size and representativeness
As a rule, parsed corpora are smaller than equivalent
are around the 1M-word mark. Part-of-speech tagged corpora
like the British National Corpus or Bank of
English are often 100 times this size (and sometimes
called “megacorpora” to emphasise their
sheer size). Most of the texts in these corpora are
written, because spoken transcription is difficult and
Even megacorpora are small compared to automatically
collected newspaper or web corpora, which can
grow in size all the time. The problem with these corpora
can be summed up in one word: representativeness.
What are we trying to claim if we find an example of
a phrase or word in this type of dataset? Is this really
telling us anything very interesting except that something
that we might have thought was impossible is ‘out
The second standard objection to using a parsed corpus
is that it means committing to a particular grammar,
in the case of ICE-GB
and DCPSE, the TOSCA/ICE
grammar. If you are a linguist with a different
research background then it might seem unreasonable
for you to have to adopt the way of thinking that comes
with a different framework.
There are two comments that are worth making in response
to this problem. These are at the levels of practice
We designed our ICECUP software
around this problem. It is possible for linguists to
identify (and even construct) an appropriate grammatical
query from a tree example in the corpus.
This means that you can carry out a lexical search,
find a suitable example tree, and then turn the analysis
into an FTF. You can then apply that FTF
to the corpus to find other (lexically diverse) examples
of the same phenomenon. You can refine the FTF until
it is sufficiently broad or narrow in its selection.
(In practice, linguists tell us that this is much less
of a problem than it seems at first sight. You just
have to try... Download
a sample corpus!)
Different grammatical analysis schemes address much
the same problem: to account for the structural decisions
of speakers and writers as they form sentences.
Consequently, if several research teams independently
form theories to account for the same structural choices,
then it follows that many of the same phenomena can
be extracted from corpora regardless of the analysis
scheme. This is the case even if grammars are
not notationally equivalent. Arguing about which
structural phenomena should be extractable from
a parsed corpus is, in fact, a way of arguing about
the superiority of one grammar over another.
Note that we are not claiming that our grammar is “correct”,
or superior to others. This is unnecessary. The simple
fact is that every grammar and parsed corpus is imperfect.
All theories and models of the world are approximations,
and syntax is no different. We chose a grammar (Quirk
et al. 1985) that was well-known and documented,
and justified by its linguistic pedigree in the ‘Grand
Tradition’. It is also more detailed than some
‘skeleton’ phrase structure grammars used
in corpora. But this does not mean that it cannot be
Nothing in the foregoing means that it is not possible
to gain significant insight into how speakers and writers
make linguistic choices by working with parsed corpora.
On the contrary, it is hard to see how else
this might be possible.
||Are FTFs dependent on ICECUP?
No. At present, people use FTFs
with ICECUP because ICECUP is
the main way in which we distribute the editor, and
ICECUP is organised around
the FTF concept. This does not mean that they could
not be used in other treebank tools. (By the way, our
main reason for not wishing to distribute the source
code at present is practical: we are continually developing
the approach and we want to keep versions standard and
clear.) But if you have suggestions for collaboration,
do email us.
We have stood on the shoulders of others, most notably,
van Halteren and van den Heuvel (1990),
and their LDB system (built originally
for the Nijmegen Parsed Corpus). This program makes
use of user-definable ‘patterns’ which are
similar in concept to our FTFs, although they can require
more effort and programming from the end user.
Since the publication of ICECUP 3.0, a number of other
query systems have been developed which replicate the
basic idea in FTFs of employing tree models. (The exception
is a logic-based query tool called fsq, Kepser
Phrase structure grammar
This is inevitably not a complete list. Unfortunately
one of the problems of tool development of this kind is
that projects struggle to sustain funding and may not
be supported for long periods of time. A review of query
representations and tools is to be found in Wallis
||Do FTFs rely on ICE-GB or the ICE
No. FTFs necessarily reflect the
topology of a particular grammar - the structural
rules that define what kinds of relations are possible
in a grammatical ‘tree’ - because an FTF
is a kind of ‘abstract tree’. We developed
FTFs in the context of a particular grammar. (It is
hard to see how a grammatical query system could be
evaluated by linguists unless it was developed
in this way!)
But that does not mean we cannot then advance towards
a universal system from our current more specialised
one. We would rather move toward such a system from
our current practice-based starting point rather than
abstractly defining parameters for universal grammatical
queries. We believe that as far as possible, any system
has to be usable by non-specialists.
Moreover, this does not mean that current FTFs could
not be modified relatively easily to work with other
phrase-structure grammars (this covers most parsed corpora
in the world today).
We have been also looking at how FTFs might work with
constraint or dependency grammars (Wallis
2008). These formalisms raise some interesting questions
because crossing links are allowed and the ‘trees’
are rather different.
||Would FTFs work with other kinds
of structure apart from grammatical ones?
Yes. The principles behind FTFs
would work with other collections of structured objects.
We work with grammar because this is where our primary
expertise is. Within corpus linguistics, FTFs could
be extended to handle:
See Wallis (2008) for more information.
- word-level annotations, e.g., prosody,
- parallel structures, e.g., parallel parsing
(the same sentence parsed several times, possibly
under different analysis schemes), and
- other structural annotation, e.g., semantic
||Can FTFs be used to perform experiments
Yes. If you look at our
FTF experiment pages, you will see that we can perform
research into grammatical variation due to socio-linguistic
variation and the interaction between two different
grammatical variables. Linguistics experiments on a
given corpus are ex post facto studies, or natural
experiments, because it is not possible to modify
or constrain data collection within the experiment.
See also the section on Methodology
which points to future
work where entire groups of experiments can be automated.
||Isn’t the analysis in the
corpus simply one interpretation out of many, so how can
any experiment we perform be said to be ‘scientific’?
This is an important question that goes
to the heart of the scientific method. Even within a
notionally simple experiment in the physical sciences,
as Putnam points out, questions
regarding what to measure, how to measure it and whether
measurements are reliable, rely on certain theoretical
For example, when predicting orbits around the sun,
astronomers first assume that it can be dealt with
as a “two-body problem” (the two bodies
being the sun and the orbiting body, say, Jupiter).
But all other bodies in the solar system (and outside
it) actually exert a pull on both the sun and Jupiter.
The point is that these other forces are almost negligible,
and a simple application of Kepler's laws will predict
Jupiter’s orbit to a high degree of accuracy.
(Of course one can’t measure Jupiter’s
mass directly, so further assumptions must
be applied in order to apply the equations.)
This means that there is no such thing as an assumption-free
experiment. Instead, we have a working set of assumptions
which parameterise a set of experiments. These
assumptions determine what can and cannot be measured,
how these might be measured or approximated to, other
variables to consider, etc. In our case, our working
set of assumptions - the grammar - is large,
complex and highly theory laden.
But this need not rule out a scientific procedure operating
in good faith. Let us take another analogy: historical
studies. Does a historian select his facts to fit a
theory or choose a theory to fit the facts? How does
s/he know that 'the facts' are the salient facts? In
a famous series of lectures, Carr
argued that the best historians are in a constant dialogue
with their chosen facts and theoretical assumptions
in constant interaction.
As any working historian knows, if he stops to reflect
what he is doing as he thinks and writes, the historian
is engaged on a continuous process of moulding his
facts to his interpretation, and his interpretation
to his facts. It is impossible to assign primacy to
one over another [p29].
Anyone who has tried parsing a corpus will recognise
this statement! Scientific experimentation is necessarily
cyclic, involving stages of induction
from facts to theory, and evaluation from theory
Experimental assumptions are employed in a series of
experiments that explore what Lakatos
calls a research programme. Research programmes
can be productive (they produce novel results)
or degenerative (they end up full of patched-up
exceptions). The main difference between grammatical
studies and those in the physical sciences is that linguists
may often not agree on much of the theoretical framework
beyond simple wordclasses, and even these are debated.
In the past, this may have been due, at least in part,
to a lack of parsed corpus data; but this is changing
fast. Now a new methodological question arises:
how do we simultaneously apply and evaluate the parse
analysis? Our observation here is simply that the
same basic approach - i.e., systematic experimentation
- can be applied with a number of important provisos.
Extreme cases, often the lifeblood of traditional linguistics,
play a role: they highlight the borders of what might
be expressed. They indicate the possible existence
of phenomena but cannot explain them, nor how
relevant any introspective judgement may be to the bulk
of natural language. (Moreover, as Abney
points out, acceptability judgements may be probabilistic
and context-dependent rather than absolute and independent.
While, in the experimental approach described here,
individual parse analyses are absolute, they are collectively
Counting classes of element gives us an idea of relative
frequencies in a sample. But neither extrema nor frequency
counts explain contextual variation in language, or
how different components affect one another. To do this,
we need to employ statistical methods.
In summary, just as in other kinds of empirical research
you should (a) avoid circularity and contradiction
(beware of measuring the same linguistic phenomenon
twice) and (b) relate your results to the theoretical
background. In this case, this means taking into account
the annotation of the corpus in your explanations, and
trying to ensure that neither the process of annotation
nor abstraction (forming queries) overly affects
the results. You should always be prepared to play devils
advocate to your own theory.
Finally, simpler explanations are generally better
than more complex ones. However, the definition of simplicity
depends not only on the number of terms in an expression
but also how the expression is rooted in the theory.
I might explain a choice, b, by the presence
of another factor, a, but if this factor relies
on a complex and long-winded justification, then my
explanation can hardly be called simple.
||Can I try ICE-GB and DCPSE before
I part with any money?
||Yes, in fact we encourage it! Simply
go to our ICE-GB download
page or DCPSE download
page to download a sample corpus, the full ICECUP
software and help files.
||How do I order a corpus, and if
so, which should I order?
Short answer: You must print out
and complete the relevant order form and sign the statement
agreeing to comply with the licence agreement. This
must be sent to us by post with payment by either
sterling cheque (UK bank) or credit card.
There is a separate credit
card form to complete for credit card payments.
Choose DCPSE if...
- If you want to work with English language change
- If you want to focus on the grammar of spoken English.
Choose ICE-GB if...
- If you want to work with different genres of English
language, including speech and writing.
- If you want to compare British English with other
global varieties of English.
- You can send the licence form as a fax if you have
no other option, but we would recommend that credit
card details are not sent in this way.
- If neither credit card nor cheque payments are possible,
then we can arrange a bank transfer. However we try
to avoid this method because it is very slow.
- We will only send out the CD when we have received
a signed order form and evidence of payment.
- EU institutions and companies must provide
a VAT number.
- Students must provide evidence of full time status
in the form of a signed letter from their head of
If you have any questions, email email@example.com.
||How do I order audio material,
and can I buy just those CDs I need?
The audio for ICE-GB Release 2 covers
the 300 spoken texts and totals 75 hours. It is supplied
in the form of a collection of audio files on 11 CDs,
each recorded as 16KHz mono with no compression.
ICE-GB R2 Sound is ordered separately. You can if you
wish, order the audio without ordering ICE-GB.
If you want to explore ICE-GB
and play the sound you need both ICE-GB Release
2 and the sound recordings.
||I already have ICE-GB R1. Why should
I upgrade to Release 2?
ICE-GB Release 1 was a milestone in corpus
linguistics when it was released in 1998. Why should
any current user of ICE-GB consider upgrading to Release
2? Here are three reasons.
- An enhanced corpus. We have reinstated some
missing material and corrected the transcription (and
thus the parse analysis) when we reviewed the recordings.
- More facilities. ICECUP
3.1 contains many more facilities for search and
exploration. These include lexical
wild-card queries, enhanced
FTFs, an integrated lexicon
and grammaticon for ICE-GB
and the ability to compute and extract statistical
- Synchronous audio. If you want to play
the audio aligned with the transcription you will
need to upgrade. This facility means that if you search
for a word in the corpus, you can hear the passage
containing that word.
Finally, ICE-GB Release 2 is compatible with the new
ICECUP IV software (currently
in beta). You can continue to upgrade the software for
||What are the differences between
the types of licence offered?
There are three types of licence. Full
details are on the order form and we ask you to read
them clearly. All licences are granted on the basis
that the corpus is only used for education and research.
The differences are essentially the following.
- Individual licence, including student licence.
Only the individual named licencee is entitled to
access the corpus. Use this licence if you are going
to use it for your own personal research.
- Institutional licence, single copy. Sometimes
called a ‘single seat’ licence, only one
member of the institution (staff or student) at
a time may access the corpus. Use this licence
if you want the corpus to be available for research
in the institution but will limit access.
- Institutional licence, multiple copy. This
is a ‘network’ or ‘multi-seat’
licence, where any number of members of the
institution may access the corpus. Use this licence
if you want the corpus to be available for whole-class
teaching or made available for general use across
For commercial licencing see below.
||I represent a commercial company
and we are interested in using your corpora. What uses
might be acceptable?
Our contributors have given us the right
to distribute the corpus resource on the basis that
the corpus is only used for education and research.
Commercial use of the corpus is therefore limited to
education and research purposes.
This can lead to some fine distinctions. You can test
parsers and other algorithms against a corpus and exploit
those tools commercially, but it would not be permissible
for you to reprocess a corpus and publish data abstracted
from it as part of the database for a commerical tool.
For more information, contact firstname.lastname@example.org
with your proposal.
(1996), Statistical Methods and Linguistics, in Klavans, J. and
Resnik, P. (Eds.) (1996) The Balancing Act, Cambridge, MA.
Carr, E.H. (1964),
What is history?, Harmondsworth: Penguin.
Greenbaum, S. (1996) The Oxford
English Grammar, Oxford: OUP.
Halteren, H., van, & van den Heuvel,
T. (1990). Linguistic Exploitation of Syntactic Databases: the
use of the Nijmegen Linguistic DataBase program (Language and
Computers, vol. 5). Amsterdam: Rodopi.
Huddleston, R. & Pullum, G.K. (eds.)
(2002). The Cambridge Grammar of the English Language. Cambridge:
Cambridge University Press.
Kepser, S. (2003). Finite Structure Query: A Tool for Querying
Syntactically Annotated Corpora, in Copestake, C. & Hajic, J.
(eds.), Proceedings of the 10th Conference of the European Chapter
of the Association for Computational Linguistics, EACL 2003,
I. (1978), Mathematics, science and epistemology (philosophical
papers), Cambridge: CUP.
H. (1974), The ‘Corroboration’ of Scientific Theories,
in Hacking, I. (Ed.) (1981), Scientific Revolutions, Oxford
Readings in Philosophy, Oxford: OUP, pp60-79.
Quirk, R., Greenbaum, S., Leech,
G., & Svartvik J. (1985). A Comprehensive Grammar of the
English Language. London: Longman.
Sinclair, J.M. (1987). Grammar in the
Dictionary. In Sinclair, J.M. (ed.) Looking Up: an account of
the COBUILD Project in lexical computing. London: Collins.
S.A. (2007). Annotation, Retrieval and Experimentation. In Meurman-Solin,
A. & Nurmi, A.A. (ed.) Annotating Variation and Change.
Helsinki: Varieng, UoH. » ePublished
Wallis, S.A. (2008),
Searching treebanks and other structured corpora. In Lüdeling,
A. & Kytö, M. (eds.) Corpus Linguistics: An International
Handbook. Handbücher zur Sprache und Kommunikationswissenschaft
series. Berlin: Mouton de Gruyter. 738-759.
This page last modified
12 June, 2013
by Survey Web Administrator.