Jump to answersJump to questionsFrequently Asked Questions


 

Questions


Parsed corpora

  Choosing a corpus
Q. I am interested in carrying out research into the difference between male and female discourse. What kind of corpus is appropriate for me?
  Linguistic research with a treebank
Q. Why should I be interested in using a parsed corpus for linguistic research?
  Other uses
Q. What other uses does a parsed corpus have?
  Drawbacks
Q. What are the disadvantages of using a parsed corpus?

Fuzzy Tree Fragments

  Are they software dependent?
Q. Are FTFs dependent on ICECUP?
  Are they grammar dependent?
Q. Do FTFs rely on ICE-GB or the ICE grammar?
  FTFs and other structures
Q. Would FTFs work with other kinds of structure apart from grammatical ones?
  Experiments
Q. Can FTFs be used to perform experiments in grammar?
  Scientific validity
Q. Isn’t the analysis in the corpus simply one interpretation out of many, so how can any experiment we perform be said to be ‘scientific’?

Ordering corpora

  Trying out the software
Q. Can I try ICE-GB and DCPSE before I part with any money?
  ICE-GB and DCPSE
Q. How do I order a corpus, and if so, which should I order?
  ICE-GB and sound
Q. How do I order audio material, and can I buy just those CDs I need?
  Upgrades
Q. I already have ICE-GB R1. Why should I upgrade to Release 2?
  Licences
Q. What are the differences between the types of licence offered?
  Commercial use
Q. I represent a commercial company and we are interested in using your corpora. What uses might be acceptable?

 

Answers


Parsed corpora

Q. I am interested in carrying out research into the difference between male and female discourse. What kind of corpus is appropriate for me?
A.

If you are interested in male and female spoken discourse, at the very least you need to obtain a set of texts of orthographically transcribed (word for word) spoken conversation. This type of faithful transcription is difficult and expensive to carry out, and you should steer clear of paraphrase-transcriptions such as Hansard or court documents.

Assuming you have access to transcribed spoken data in (say) English, with male and female participants, you might wish to look out for the following:

  1. different sample types, e.g. male/female, male/male, female/female; 2, 3, 4-way conversations, equal vs. unequal roles (e.g. friends, student/teacher), co-present vs. distant (telephone conversations), etc. You might also wish to compare conversation with monologue - what we might term a control group.
  2. different levels of annotation, e.g. plain text, text with additional annotation (e.g. overlapping, self-correction), part-of-speech tagged text (noun, verb etc), through to full syntactic parsing.
  3. large size and wide representativeness of sample.

Grammatical annotation is extremely useful. A part-of-speech tagged corpus will distinguish between say, work as a verb and as a noun. You can search for verbs in particular positions, or just present tense verbs, etc.

With a fully parsed corpus, every sentence is given a complete tree analysis. You gain a very great deal of control in locating and distinguishing particular grammatical patterns, for example to ensure that work is the head of a noun phrase, as in work that he believed in. This control becomes very useful if you need to work with numbers from the corpus, rather than just find an example or two.

The question of whether a corpus is “large enough” depends on what it is that you are researching into. Obviously this is a critical question when you are trying to decide on what corpus to gain access to. (See comparing treebanks.)

For example, a typical ‘discourse marker’ word like “OK” appears 438 times in 200,000 words of the dialogues in ICE-GB. However there are 36,212 words in the same material that we classify as discourse markers (these are conversation-fillers, including “OK”, “no”, “yes”, “I mean”, etc.).

Corpus linguists sometimes argue about the question of how much data is “enough”. If you just want to find a few cases to illustrate a point, you may only need a single example. But if you want to engage in a stricter scientific argument, claiming that a pattern you find in the corpus is likely to be reproduced in English more generally (or more precisely, in English language data sampled identically to the corpus), then there are some minimum numbers.

Typically you would give up if you find less than twenty cases in the corpus. (On our website we have some pages which explain scientific experiments with our parsed corpora, although the principles are the same for all corpus experiments.) If you have a lot of examples of a phenomenon, every detected variation is likely to be statistically significant. However if you narrow down your question to one part of the corpus then numbers can fall rapidly.

We have two corpora which could be valuable to you. Since your interest is in spoken discourse we would suggest the Diachronic Corpus of Present Day Spoken English (DCPSE) is probably the more suitable, as this contains only spoken data. This contains the largest amount of orthographically transcribed parsed English in the world (some 885,436 words), with 754,819 words of dialogues.

up
Q. Why should I be interested in using a parsed corpus for linguistic research?
A.

A parsed corpus, or ‘treebank’, allows for deeper linguistic research than is possible with plain text or part-of-speech corpora.

The value of parsed corpora is only really beginning to be understood. Introspection about grammar is inevitably extremely partial, as linguists have found when attempting to parse actual speech and writing.

Once parsed, a corpus will contain three types of evidence:

  1. frequency (how common different grammatical structures are in use),
  2. coverage (discovery of new, unanticipated, grammatical phenomena), and
  3. interaction (evidence about how one grammatical construction, etc., tends to influence the use of another).

Frequency

John Sinclair (1987) remarked how annotating the corpus-derived COBUILD dictionary led to many lexicologists' preconceptions built up over centuries to be overturned. Words which were assumed to be primarily a noun were found to be more frequently used as a verb. Transitivity cases were less stereotypical than had been assumed and so on.

The same is true in a parsed corpus. Real people don’t tend to write and talk in the way that grammarians might assume (or wish). Hence corpora have been used for writing new grammars, such as Quirk et al. (1985), Greenbaum (1996), and Huddleston and Pullum (2002).

From a psycholinguistic point of view, a parsed corpus is a complete description of a volume of passages of ‘real’ ethnologically representative text. These texts are ‘real’ not merely in the sense that they are not created in the laboratory, but also because the contributors are not aware of the vast majority of their linguistic choices - they are just speaking and writing. The corpus is simply a general database which can then be interrogated by linguists. Psycholinguists are more likely to be interested in orthographically transcribed speech than written grammar, although the differences between the two are of interest.

Frequency data is valuable for algorithms because it provides a more representative picture of the types of sentences a system for parsing natural language is likely to have to deal with if it is part of an unattended system.

Coverage

When we parse a corpus we inevitably will find structures that we do not expect, and we have to (manually) make decisions as to how to correctly parse them. These rules are clearly new to our parser, and this is therefore new evidence.

This process of completing the parsing is important. We apply our scheme to novel sentences and our decisions are recorded in the corpus. Supporting the parsing of new sentences extends the coverage of any analytical scheme. It broadens our expectation of what language ‘should’ do and makes us pay attention to the actual spoken and written text rather than our expectations of it.

Wallis (2007) discusses the question of the role of annotation in corpus linguistics in more detail.

Interaction

Interaction evidence concerns the fact that language is a natural process where decisions may be partially contingent on each other.

It is possible to investigate

  • the impact of one type of lexical or grammatical ‘event’ on another,
  • the impact of sociolinguistic phenomena on lexical or grammatical choices, and
  • whether grammatical/lexical decisions can be evidence of particular speakers or types of language (used in stylistics).

Without a parsed corpus it is simply not feasible to do this. You would have to manually annotate each case yourself.

By applying a query to a corpus you are asking the software to identify and retrieve a finite number of cases of a particular type of linguistic event. The richer the annotation scheme, the more precise you can be in specifying a particular event, and the greater the range of potential questions you can ask.

ICECUP uses a system called Fuzzy Tree Fragments (FTFs) to query a corpus. These are structured grammatical queries which allow a linguist to specify the precise arrangement of elements within a fragment of the phrase structure tree. FTFs can be used to

  • frame a research question, e.g. limiting a pattern to a single phrase.
  • enumerate multiple possibilities, e.g. identifying possible alternative features, heads, etc.
  • specify different linguistic choices or ‘alternates’, e.g. identifying nonfinite clauses vs. relative clauses.

For more on FTFs see our Fuzzy Tree Fragment webpages.

up
Q. What other uses does a parsed corpus have?
A.

As well as being a source for the academic study of grammar, parsed corpora have a number of applications.

  1. They are a source of ‘real examples’ for teaching purposes. A parsed corpus can be as a source of English examples, and permit students to explore contrasts between say, speech and writing, or British and US English.
  2. They can be used as sources of evidence to improve the accuracy and coverage of natural language processing (NLP) parsing algorithms (‘automatic parsers’).
  3. They can be used as a starting dataset for other algorithms, such as information extraction.

This is not an exhaustive list. Adding new levels of annotation to parsed corpora may extend their potential application.

up
Q. What are the disadvantages of using a parsed corpus?
A.

There are two main issues that you need to think about before using a parsed corpus. These are

Size and representativeness

As a rule, parsed corpora are smaller than equivalent unparsed corpora.

Typical treebanks are around the 1M-word mark. Part-of-speech tagged corpora like the British National Corpus or Bank of English are often 100 times this size (and sometimes called “megacorpora” to emphasise their sheer size). Most of the texts in these corpora are written, because spoken transcription is difficult and expensive.

Even megacorpora are small compared to automatically collected newspaper or web corpora, which can grow in size all the time. The problem with these corpora can be summed up in one word: representativeness. What are we trying to claim if we find an example of a phrase or word in this type of dataset? Is this really telling us anything very interesting except that something that we might have thought was impossible is ‘out there’?

Grammar

The second standard objection to using a parsed corpus is that it means committing to a particular grammar, in the case of ICE-GB and DCPSE, the TOSCA/ICE grammar. If you are a linguist with a different research background then it might seem unreasonable for you to have to adopt the way of thinking that comes with a different framework.

There are two comments that are worth making in response to this problem. These are at the levels of practice and theory.

Practice

We designed our ICECUP software around this problem. It is possible for linguists to identify (and even construct) an appropriate grammatical query from a tree example in the corpus.

This means that you can carry out a lexical search, find a suitable example tree, and then turn the analysis into an FTF. You can then apply that FTF to the corpus to find other (lexically diverse) examples of the same phenomenon. You can refine the FTF until it is sufficiently broad or narrow in its selection.

(In practice, linguists tell us that this is much less of a problem than it seems at first sight. You just have to try... Download a sample corpus!)

Theory

Different grammatical analysis schemes address much the same problem: to account for the structural decisions of speakers and writers as they form sentences.

Consequently, if several research teams independently form theories to account for the same structural choices, then it follows that many of the same phenomena can be extracted from corpora regardless of the analysis scheme. This is the case even if grammars are not notationally equivalent. Arguing about which structural phenomena should be extractable from a parsed corpus is, in fact, a way of arguing about the superiority of one grammar over another.

Note that we are not claiming that our grammar is “correct”, or superior to others. This is unnecessary. The simple fact is that every grammar and parsed corpus is imperfect. All theories and models of the world are approximations, and syntax is no different. We chose a grammar (Quirk et al. 1985) that was well-known and documented, and justified by its linguistic pedigree in the ‘Grand Tradition’. It is also more detailed than some ‘skeleton’ phrase structure grammars used in corpora. But this does not mean that it cannot be improved.

Nothing in the foregoing means that it is not possible to gain significant insight into how speakers and writers make linguistic choices by working with parsed corpora.

On the contrary, it is hard to see how else this might be possible.

up

Fuzzy Tree Fragments

Q. Are FTFs dependent on ICECUP?
A.

No. At present, people use FTFs with ICECUP because ICECUP is the main way in which we distribute the editor, and ICECUP is organised around the FTF concept. This does not mean that they could not be used in other treebank tools. (By the way, our main reason for not wishing to distribute the source code at present is practical: we are continually developing the approach and we want to keep versions standard and clear.) But if you have suggestions for collaboration, do email us.

We have stood on the shoulders of others, most notably, van Halteren and van den Heuvel (1990), and their LDB system (built originally for the Nijmegen Parsed Corpus). This program makes use of user-definable ‘patterns’ which are similar in concept to our FTFs, although they can require more effort and programming from the end user.

Since the publication of ICECUP 3.0, a number of other query systems have been developed which replicate the basic idea in FTFs of employing tree models. (The exception is a logic-based query tool called fsq, Kepser 2003)

Phrase structure grammar

Dependency grammar

This is inevitably not a complete list. Unfortunately one of the problems of tool development of this kind is that projects struggle to sustain funding and may not be supported for long periods of time. A review of query representations and tools is to be found in Wallis (2008).
up
Q. Do FTFs rely on ICE-GB or the ICE grammar?
A.

No. FTFs necessarily reflect the topology of a particular grammar - the structural rules that define what kinds of relations are possible in a grammatical ‘tree’ - because an FTF is a kind of ‘abstract tree’. We developed FTFs in the context of a particular grammar. (It is hard to see how a grammatical query system could be evaluated by linguists unless it was developed in this way!)

But that does not mean we cannot then advance towards a universal system from our current more specialised one. We would rather move toward such a system from our current practice-based starting point rather than abstractly defining parameters for universal grammatical queries. We believe that as far as possible, any system has to be usable by non-specialists.

Moreover, this does not mean that current FTFs could not be modified relatively easily to work with other phrase-structure grammars (this covers most parsed corpora in the world today).

We have been also looking at how FTFs might work with constraint or dependency grammars (Wallis 2008). These formalisms raise some interesting questions because crossing links are allowed and the ‘trees’ are rather different.

up
Q. Would FTFs work with other kinds of structure apart from grammatical ones?
A.

Yes. The principles behind FTFs would work with other collections of structured objects. We work with grammar because this is where our primary expertise is. Within corpus linguistics, FTFs could be extended to handle:

  1. word-level annotations, e.g., prosody,
  2. parallel structures, e.g., parallel parsing (the same sentence parsed several times, possibly under different analysis schemes), and
  3. other structural annotation, e.g., semantic relations.

See Wallis (2008) for more information.
up
Q. Can FTFs be used to perform experiments in grammar?
A.

Yes. If you look at our FTF experiment pages, you will see that we can perform research into grammatical variation due to socio-linguistic variation and the interaction between two different grammatical variables. Linguistics experiments on a given corpus are ex post facto studies, or natural experiments, because it is not possible to modify or constrain data collection within the experiment.

See also the section on Methodology which points to future work where entire groups of experiments can be automated.

up
Q. Isn’t the analysis in the corpus simply one interpretation out of many, so how can any experiment we perform be said to be ‘scientific’?
A.

This is an important question that goes to the heart of the scientific method. Even within a notionally simple experiment in the physical sciences, as Putnam points out, questions regarding what to measure, how to measure it and whether measurements are reliable, rely on certain theoretical assumptions.

For example, when predicting orbits around the sun, astronomers first assume that it can be dealt with as a “two-body problem” (the two bodies being the sun and the orbiting body, say, Jupiter). But all other bodies in the solar system (and outside it) actually exert a pull on both the sun and Jupiter. The point is that these other forces are almost negligible, and a simple application of Kepler's laws will predict Jupiter’s orbit to a high degree of accuracy. (Of course one can’t measure Jupiter’s mass directly, so further assumptions must be applied in order to apply the equations.)

This means that there is no such thing as an assumption-free experiment. Instead, we have a working set of assumptions which parameterise a set of experiments. These assumptions determine what can and cannot be measured, how these might be measured or approximated to, other variables to consider, etc. In our case, our working set of assumptions - the grammar - is large, complex and highly theory laden.

But this need not rule out a scientific procedure operating in good faith. Let us take another analogy: historical studies. Does a historian select his facts to fit a theory or choose a theory to fit the facts? How does s/he know that 'the facts' are the salient facts? In a famous series of lectures, Carr argued that the best historians are in a constant dialogue with their chosen facts and theoretical assumptions in constant interaction.

As any working historian knows, if he stops to reflect what he is doing as he thinks and writes, the historian is engaged on a continuous process of moulding his facts to his interpretation, and his interpretation to his facts. It is impossible to assign primacy to one over another [p29].

Anyone who has tried parsing a corpus will recognise this statement! Scientific experimentation is necessarily cyclic, involving stages of induction from facts to theory, and evaluation from theory to facts.

Experimental assumptions are employed in a series of experiments that explore what Lakatos calls a research programme. Research programmes can be productive (they produce novel results) or degenerative (they end up full of patched-up exceptions). The main difference between grammatical studies and those in the physical sciences is that linguists may often not agree on much of the theoretical framework beyond simple wordclasses, and even these are debated.

In the past, this may have been due, at least in part, to a lack of parsed corpus data; but this is changing fast. Now a new methodological question arises: how do we simultaneously apply and evaluate the parse analysis? Our observation here is simply that the same basic approach - i.e., systematic experimentation - can be applied with a number of important provisos.

Extreme cases, often the lifeblood of traditional linguistics, play a role: they highlight the borders of what might be expressed. They indicate the possible existence of phenomena but cannot explain them, nor how relevant any introspective judgement may be to the bulk of natural language. (Moreover, as Abney points out, acceptability judgements may be probabilistic and context-dependent rather than absolute and independent. While, in the experimental approach described here, individual parse analyses are absolute, they are collectively probabilistic.)

Counting classes of element gives us an idea of relative frequencies in a sample. But neither extrema nor frequency counts explain contextual variation in language, or how different components affect one another. To do this, we need to employ statistical methods.

In summary, just as in other kinds of empirical research you should (a) avoid circularity and contradiction (beware of measuring the same linguistic phenomenon twice) and (b) relate your results to the theoretical background. In this case, this means taking into account the annotation of the corpus in your explanations, and trying to ensure that neither the process of annotation nor abstraction (forming queries) overly affects the results. You should always be prepared to play ‘devil’s advocate’ to your own theory.

Finally, simpler explanations are generally better than more complex ones. However, the definition of simplicity depends not only on the number of terms in an expression but also how the expression is rooted in the theory. I might explain a choice, b, by the presence of another factor, a, but if this factor relies on a complex and long-winded justification, then my explanation can hardly be called simple.

up

Ordering corpora

Q. Can I try ICE-GB and DCPSE before I part with any money?
A. Yes, in fact we encourage it! Simply go to our ICE-GB download page or DCPSE download page to download a sample corpus, the full ICECUP software and help files.
up
Q. How do I order a corpus, and if so, which should I order?
A.

Short answer: You must print out and complete the relevant order form and sign the statement agreeing to comply with the licence agreement. This must be sent to us by post with payment by either sterling cheque (UK bank) or credit card.

There is a separate credit card form to complete for credit card payments.

Choose DCPSE if...

  • If you want to work with English language change over time.
  • If you want to focus on the grammar of spoken English.

Choose ICE-GB if...

  • If you want to work with different genres of English language, including speech and writing.
  • If you want to compare British English with other global varieties of English.

Order ICE-GB Release 2
Order DCPSE
Credit card form

Notes

  • You can send the licence form as a fax if you have no other option, but we would recommend that credit card details are not sent in this way.
  • If neither credit card nor cheque payments are possible, then we can arrange a bank transfer. However we try to avoid this method because it is very slow.
  • We will only send out the CD when we have received a signed order form and evidence of payment.
  • EU institutions and companies must provide a VAT number.
  • Students must provide evidence of full time status in the form of a signed letter from their head of department.

If you have any questions, email ucleseu@ucl.ac.uk.

up
Q. How do I order audio material, and can I buy just those CDs I need?
A.

The audio for ICE-GB Release 2 covers the 300 spoken texts and totals 75 hours. It is supplied in the form of a collection of audio files on 11 CDs, each recorded as 16KHz mono with no compression.

ICE-GB R2 Sound is ordered separately. You can if you wish, order the audio without ordering ICE-GB.

Order ICE-GB R2 Sound

If you want to explore ICE-GB and play the sound you need both ICE-GB Release 2 and the sound recordings.

up
Q. I already have ICE-GB R1. Why should I upgrade to Release 2?
A.

ICE-GB Release 1 was a milestone in corpus linguistics when it was released in 1998. Why should any current user of ICE-GB consider upgrading to Release 2? Here are three reasons.

  • An enhanced corpus. We have reinstated some missing material and corrected the transcription (and thus the parse analysis) when we reviewed the recordings.
  • More facilities. ICECUP 3.1 contains many more facilities for search and exploration. These include lexical wild-card queries, enhanced FTFs, an integrated lexicon and grammaticon for ICE-GB and the ability to compute and extract statistical tables.
  • Synchronous audio. If you want to play the audio aligned with the transcription you will need to upgrade. This facility means that if you search for a word in the corpus, you can hear the passage containing that word.

Finally, ICE-GB Release 2 is compatible with the new ICECUP IV software (currently in beta). You can continue to upgrade the software for free.

up
Q. What are the differences between the types of licence offered?
A.

There are three types of licence. Full details are on the order form and we ask you to read them clearly. All licences are granted on the basis that the corpus is only used for education and research.

The differences are essentially the following.

  • Individual licence, including student licence. Only the individual named licencee is entitled to access the corpus. Use this licence if you are going to use it for your own personal research.
  • Institutional licence, single copy. Sometimes called a ‘single seat’ licence, only one member of the institution (staff or student) at a time may access the corpus. Use this licence if you want the corpus to be available for research in the institution but will limit access.
  • Institutional licence, multiple copy. This is a ‘network’ or ‘multi-seat’ licence, where any number of members of the institution may access the corpus. Use this licence if you want the corpus to be available for whole-class teaching or made available for general use across the institution.

For commercial licencing see below.

up
Q. I represent a commercial company and we are interested in using your corpora. What uses might be acceptable?
A.

Our contributors have given us the right to distribute the corpus resource on the basis that the corpus is only used for education and research.

Commercial use of the corpus is therefore limited to education and research purposes.

This can lead to some fine distinctions. You can test parsers and other algorithms against a corpus and exploit those tools commercially, but it would not be permissible for you to reprocess a corpus and publish data abstracted from it as part of the database for a commerical tool.

For more information, contact ucleseu@ucl.ac.uk with your proposal.

up

References

Abney, S. (1996), Statistical Methods and Linguistics, in Klavans, J. and Resnik, P. (Eds.) (1996) The Balancing Act, Cambridge, MA.

Carr, E.H. (1964), What is history?, Harmondsworth: Penguin.

Greenbaum, S. (1996) The Oxford English Grammar, Oxford: OUP.

Halteren, H., van, & van den Heuvel, T. (1990). Linguistic Exploitation of Syntactic Databases: the use of the Nijmegen Linguistic DataBase program (Language and Computers, vol. 5). Amsterdam: Rodopi.

Huddleston, R. & Pullum, G.K. (eds.) (2002). The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press.

Kepser, S. (2003). Finite Structure Query: A Tool for Querying Syntactically Annotated Corpora, in Copestake, C. & Hajic, J. (eds.), Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2003, Budapest. 179-186.

Lakatos, I. (1978), Mathematics, science and epistemology (philosophical papers), Cambridge: CUP.

Putnam, H. (1974), The ‘Corroboration’ of Scientific Theories, in Hacking, I. (Ed.) (1981), Scientific Revolutions, Oxford Readings in Philosophy, Oxford: OUP, pp60-79.

Quirk, R., Greenbaum, S., Leech, G., & Svartvik J. (1985). A Comprehensive Grammar of the English Language. London: Longman.

Sinclair, J.M. (1987). Grammar in the Dictionary. In Sinclair, J.M. (ed.) Looking Up: an account of the COBUILD Project in lexical computing. London: Collins.

Wallis, S.A. (2007). Annotation, Retrieval and Experimentation. In Meurman-Solin, A. & Nurmi, A.A. (ed.) Annotating Variation and Change. Helsinki: Varieng, UoH. » ePublished

Wallis, S.A. (2008), Searching treebanks and other structured corpora. In Lüdeling, A. & Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter. 738-759.

back (back)

This page last modified 12 June, 2013 by Survey Web Administrator.