Frequently asked questions about FTFs

FTFs are declarative diagrammatic models designed, in our case, for searching for grammatical constructions in parsed corpora. However, they could be employed in a variety of applications.

Are FTFs dependent on ICECUP?

No. At present, people use FTFs with ICECUP because ICECUP is the main way in which we distribute the editor, and ICECUP is organised around the FTF concept. This does not mean that they could not be used in other treebank tools. (By the way, our main reason for not wishing to distribute the source code at present is practical: we are continually developing the approach and we want to keep versions standard and clear.) But if you have suggestions for collaboration, do email us.

We have stood on the shoulders of others, most notably, Hans van Halteren and Theo van den Heuvel, and their LDB system (built originally for the Nijmegen Parsed Corpus). This program makes use of user-definable ‘patterns’ which are similar in concept to our FTFs, although they can require more effort and programming from the end user. Since the publication of ICECUP 3.0, a number of other query systems have been developed which replicate the basic idea of employing tree models.

Do FTFs rely on ICE-GB or the ICE grammar?

No. FTFs necessarily reflect the topology of a particular grammar - the structural rules that define what kinds of relations are possible in a grammatical ‘tree’ - because an FTF is a kind of ‘abstract tree’. We developed FTFs in the context of a particular grammar. (It is hard to see how a grammatical query system could be evaluated by linguists unless it was developed in this way!) But that does not mean we cannot then advance towards a universal system from our current more specialised one. We would rather move toward such a system from our current practice-based starting point rather than abstractly defining parameters for universal grammatical queries. We believe that as far as possible, any system has to be usable by non-specialists.

Moreover, this does not mean that current FTFs could not be modified relatively easily to work with other phrase-structure grammars (this covers most parsed corpora in the world today).

We have been also looking at how FTFs might work with constraint or dependency grammars (Wallis 2008). These formalisms raise some interesting questions because crossing links are allowed and the ‘trees’ are rather different.

Would FTFs work with other kinds of structure apart from grammatical ones?

Yes. The principles behind FTFs would work with other collections of structured objects. We work with grammar because this is where our primary expertise is. Within corpus linguistics, FTFs could be extended to handle:

  1. word-level annotations, e.g., prosody,
  2. parallel structures, e.g., parallel parsing (the same sentence parsed several times, possibly under different analysis schemes), and
  3. other structural annotation, e.g., semantic relations.

Can FTFs be used to perform experiments in grammar?

Yes. If you look at these notes, you will see that we can perform research into grammatical variation due to socio-linguistic variation and the interaction between two different grammatical variables. Linguistics experiments on a given corpus are ex post facto studies, or natural experiments, because it is not possible to modify or constrain data collection within the experiment.

See also the section on Methodology which points to future work where entire groups of experiments can be automated.

Isn’t the analysis in the corpus simply one interpretation out of many, so how can any experiment we perform be said to be ‘scientific’?

This is an important question that goes to the heart of the scientific method. Even within a notionally simple experiment in the physical sciences, as Putnam points out, questions regarding what to measure, how to measure it and whether measurements are reliable, rely on certain theoretical assumptions.

For example, when predicting orbits around the sun, astronomers first assume that it can be dealt with as a “two-body problem” (the two bodies being the sun and the orbiting body, say, Jupiter). But all other bodies in the solar system (and outside it) actually exert a pull on both the sun and Jupiter. The point is that these other forces are almost negligible, and a simple application of Kepler's laws will predict Jupiter’s orbit to a high degree of accuracy. (Of course one can’t measure Jupiter’s mass directly, so further assumptions must be applied in order to apply the equations.)

This means that there is no such thing as an assumption-free experiment. Instead, we have a working set of assumptions which parameterise a set of experiments. These assumptions determine what can and cannot be measured, how these might be measured or approximated to, other variables to consider, etc. In our case, our working set of assumptions - the grammar - is large, complex and highly theory laden.

But this need not rule out a scientific procedure operating in good faith. Let us take another analogy: historical studies. Does a historian select his facts to fit a theory or choose a theory to fit the facts? How does s/he know that 'the facts' are the salient facts? In a famous series of lectures, Carr argued that the best historians are in a constant dialogue with their chosen facts and theoretical assumptions in constant interaction.

As any working historian knows, if he stops to reflect what he is doing as he thinks and writes, the historian is engaged on a continuous process of moulding his facts to his interpretation, and his interpretation to his facts. It is impossible to assign primacy to one over another [p29].

Anyone who has tried parsing a corpus will recognise this statement! Scientific experimentation is necessarily cyclic, involving stages of induction from facts to theory, and evaluation from theory to facts.

Experimental assumptions are employed in a series of experiments that explore what Lakatos calls a research programme. Research programmes can be productive (they produce novel results) or degenerative (they end up full of patched-up exceptions). The main difference between grammatical studies and those in the physical sciences is that linguists may often not agree on much of the theoretical framework beyond simple wordclasses, and even these are debated.

In the past, this may have been due, at least in part, to a lack of parsed corpus data; but this is changing fast. Now a new methodological question arises: how do we simultaneously apply and evaluate the parse analysis? Our observation here is simply that the same basic approach - i.e., systematic experimentation - can be applied with a number of important provisos.

Extreme cases, often the lifeblood of traditional linguistics, play a role: they highlight the borders of what might be expressed. They indicate the possible existence of phenomena but cannot explain them, nor how relevant any introspective judgement may be to the bulk of natural language. (Moreover, as Abney points out, acceptability judgements may be probabilistic and context-dependent rather than absolute and independent. While, in the experimental approach described here, individual parse analyses are absolute, they are collectively probabilistic.)

Counting classes of element gives us an idea of relative frequencies in a sample. But neither extrema nor frequency counts explain contextual variation in language, or how different components affect one another. To do this, we need to employ statistical methods.

In summary, just as in other kinds of empirical research you should (a) avoid circularity and contradiction (beware of measuring the same linguistic phenomenon twice) and (b) relate your results to the theoretical background. In this case, this means taking into account the annotation of the corpus in your explanations, and trying to ensure that neither the process of annotation nor abstraction (forming queries) overly affects the results. You should always be prepared to play ‘devil’s advocate’ to your own theory.

Finally, simpler explanations are generally better than more complex ones. However, the definition of simplicity depends not only on the number of terms in an expression but also how the expression is rooted in the theory. I might explain a choice, b, by the presence of another factor, a, but if this factor relies on a complex and long-winded justification, then my explanation can hardly be called simple.

Is it possible to evaluate a grammar empirically using a parsed corpus?

Tentatively, yes. This question is something of a ‘holy grail’ in linguistics.

Crudely, theoretical linguistics concerns itself with models of grammar and their intrinsic (internally-deductive) properties and computational linguistics attempts to fit these models to data. Both of these are absolutely necessary stages in the parsing of a corpus.

Once a corpus has been parsed it is the source of three types of evidence.

  1. Frequency evidence is the evidence of frequency of known phenomena. For example, corpus studies have found higher frequency for verb forms of particular lexical items rather than the noun form, although dictionaries generally assume the predominance of the noun. A parser can be ‘trained’ on a corpus by identifying the frequency of particular rules in its knowledge base. This can be done with an uncorrected corpus, i.e. by applying a parser to a number of sentences and counting the number of times each rule was applied in the final analysis. However, if human linguists correct the corpus it is likely that more reliable and informative frequency evidence will be obtained.
  2. Coverage evidence is the evidence of frequency of unknown phenomena, that is, the identification of new rules, lexical items, etc. Corpora have long been used for the ‘discovery’ of undocumented lexical items and the same applies to parser knowledge bases. In this case, manual correction of a corpus is a necessity.
  3. Interaction evidence is correlational evidence between two or more linguistic phenomena. The grammatical experiments on our website are simple examples of studies of grammatical interaction. Our Next Generation Tools project concerned itself with developing a platform for studying competing interactions involving multiple variables. Another example is the collection of ‘transition probabilities’ in parsers and taggers, where the algorithm selects a tag or rule according to neighbouring tags or rules.

One way in which grammatical frameworks in a parsed corpus may be compared is by considering how well they distinguish phenomena. Wallis (2008) referred to this as retrieval of ‘linguistic events’ in the corpus. Using our corpora one can use ICECUP and Fuzzy Tree Fragments to reliably retrieve a wide range of phenomena.

In terms of the scheme above, both (1) and (2) are sources of evidence for this type of argument. A ‘better’ grammar is then one that permits the reliable retrieval of more events in a corpus (because it can distinguish them) than another. These events may be (1) known or (2) previously unknown. Note that a deductive argument might suffice for (1) but (2) requires a corpus.

However, the simple retrieval of events is not a sufficient criterion for the comparative evaluation of a grammar.

  • The argument is circular. Not all events are equally important to every linguist. Consider two grammars, A and B, which each distinguish different phenomena plus a common core set. It is not enough to say that Grammar A is better than Grammar B because A can distinguish more phenomena than B. The linguists advocating for B will argue that the phenomena that B uniquely distinguishes are more important than those uniquely distinguished by A.
  • ‘Improvements’ can be made by redundancy. All that one needs to do is invent a new Grammar A+B that incorporates both A and B, and this framework will distinguish the complete set. The result is pluralism and multiple redundancy, not refinement.
  • It concerns itself with atomic properties of the grammar, rather than patterns and processes. Single events are evaluated rather than the structure itself. This seems deeply unsatisfactory.

New research by Wallis (submitted) concerns the evaluation of patterns of interaction evidence (3) in a parsed corpus. Distinguishing and retrieving events reliably is still important - it is the basis of any experimental process - but the actual evaluation of grammar depends on the retrievability of these patterns of interaction. It must be emphasised that this research is only feasible with a completed parsed corpus like ICE-GB or DCPSE. In all of this research we need to distinguish between the performance of a parsing algorithm and the attested phenomena in a corpus.

This research is ongoing, and require some new types of experimental designs, but it is possible to say with some confidence that the results are extremely promising. It is already possible to show experimentally the benefits of particular refinements to an existing grammar in a parsed corpus.


Abney, S. (1996), Statistical Methods and Linguistics, in Klavans, J. and Resnik, P. (Eds.) (1996) The Balancing Act, Cambridge, MA.

Carr, E.H. (1964), What is history?, Harmondsworth: Penguin.

Lakatos, I. (1978), Mathematics, science and epistemology (philosophical papers), Cambridge: CUP.

Putnam, H. (1974), The ‘Corroboration’ of Scientific Theories, in Hacking, I. (Ed.) (1981), Scientific Revolutions, Oxford Readings in Philosophy, Oxford: OUP, pp60-79.

Wallis, S.A. (2008), Searching treebanks and other structured corpora. In Lüdeling, A. & Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter. 738-759.

Wallis, S.A. (submitted), Capturing linguistic interaction in a grammar: A method for empirically evaluating the grammar of a parsed corpus. Preview (PDF).

FTF home pages by Sean Wallis and Gerry Nelson.
Comments/questions to s.wallis@ucl.ac.uk.

This page last modified 28 January, 2021 by Survey Web Administrator.