XClose

UCL English

Home
Menu

Fuzzy Tree Fragments: Evaluation

A brief page summarising how FTFs have worked out in practice.

Criteria

When we started work on FTFs, we proposed that we should use the following criteria to evaluate grammatical query systems (in order of importance, from most to least). See Wallis and Nelson (2000).

  1. Linguistic adequacy.

    Queries are sufficiently expressive to capture linguistically meaningful research questions.

  2. Transparency.

    Queries are clear and meaningful to end users, to whom any new system will be unfamiliar.

  3. General expressivity.

    One can express a broad range of queries that extends beyond current linguistic concerns.

  4. Computational efficiency.

    Naturally, we all prefer a rapid response to a poor one. In practice, however, it is not the proof algorithm that determines the speed but the disk access time.

Note

ICECUP and FTFs have developed over time. Wallis and Nelson (2000) summarises ICECUP 3.0, its technical infrastructure and the first version of FTFs. 

With the release of ICECUP 3.1 in 2006, to coincide with the publication of DCPSE and Release 2 of ICE-GB, the expressivity of FTFs was extended in a number of ways – in particular, to permit the expression of logic in nodes and lexical wild cards.

It was not found necessary to extend the set of links and edges, except, later, with the ‘inherit if conjoin’ option.


How do FTFs perform against these criteria?

Are FTFs adequate for ‘linguistically meaningful' directed queries?

The release of ICECUP 3.1, which incorporated logic in tree nodes, and sets and wildcards in word slots, addressed many perceived deficiencies of the original system. You could now state that a node could be an adjective or a nominal adjective, or require a word in a slot to be a form of a regular or irregular verb.

Nonetheless, the following types of queries remained difficult to perform:

  • Matching optional conjoins – eventually resolved by an extension in ICECUP 3.1.1.
  • Specifying that two elements in an FTF must have the same function, category or feature without naming it. (Hans van Halteren and Theo van den Heuvel's (1990)  LDB program allows this, but FTFs do not.)
  • Specifying a ‘nearest ancestor’ match – procedural, permitted in ICECUP IV (ICECUP 3 workaround: permute FTFs with unspecified intervening nodes).
  • Counting nodes, words, etc. under a match – procedural, permitted in ICECUP IV.
  • Counting intervening nodes (grammatical distance) – procedural

Of course it may be argued that some of these are not linguistically meaningful and should not be supported. But this is not really a decision for software designers!

Are FTFs sufficiently transparent?

Some users have difficulty with the ‘ancestor’ option for parent-child relations because the FTF is more ‘elastic’ and doesn’t match so intuitively against the tree.

However this kind of evaluation cannot be separated from the structure of the ICE grammar.

Apart from the fact that FTFs in ICECUP are ‘abstract ICE trees’, many users find that they do not know the grammar sufficiently to form ‘definitive’ queries easily. This is what prompted us to design ICECUP around what we call an ‘exploratory cycle’ (see Corpus Query report).

Are FTFs expressive enough beyond current linguistic concerns?

In practice, after more than twenty years of attempting to explore radically new experimental designs, we can honestly report that only rarely did we find ourselves finding ICECUP and FTFs insufficiently powerful to carry out the experiment alone, and resort to programming.

For example, Wallis (2021: 226) describes an experiment investigating grammatical priming (or spreading activation) within tree structures, where we had to write code to extract the distance between two nodes by counting intervening elements.

A more common limitation is that there is no simple method for counting words, clauses or phrases under a tree without programming (or using ICECUP IV – currently deprecated).

For more on these issues, see Wallis (2008).

Are FTFs computationally efficient?

ICECUP 3.1’s implementation of FTFs is highly efficient.

Indexing

ICECUP uses extensive indexing (~30Mb out of 80Mb of the ICE-GB package are indexes).

  • Complete precomputed first-order indexes of unique nodes and lexicon node+word combinations (in ICECUP 3.0 this was only an index of words). Every single exact node pattern is stored as a series of references to text units in the corpus.
  • For nodes, extensive precomputed second-order indexes (indexes of the above), plus a direct index where second order patterns have 100 or more combinations. When ICECUP performs a search like 'CL' (clause) it finds a list of all the possible patterns, plus the pre-computed short-cut of all references to clauses in the corpus.
  • Similar second-order indexes for nodes in the lexicon index. We also store indexes for letters, by position. We have indexes for ‘a*’ (all words beginning with A/a, ‘?a*’ (all words whose second-place letter is A/a), ‘??a*’ (third), etc. We also have indexes counting from the last word, and word length indexes. This accelerates wildcard searches and enables the lexicon to work efficiently.

Where a search can be completed without individually reviewing corpus trees, the search is completed at that point. Otherwise the candidate list must be reviewed, a process that takes place as a ‘background search’. ICECUP 3.1 runs these background searches in parallel. Multiple searches are performed by sequencing candidate trees, reducing seek and retrieval time as well as hard disk impact. Trees are also cached internally, so if a tree a search needs has been recently loaded it can be copied from the cache.

Matching

The algorithm that matches FTFs to corpus trees is also highly efficient.

This algorithm is an AI algorithm termed a ‘theorem prover’ that is optimised for phrase structure trees. It finds all possible ‘proofs’ (matching patterns) against the corpus tree.

A simplified description is below. 

Stage 1 - match nodes

  • All node edges and links implied by the existing FTF relations are set. (If a child has a parent, it cannot be a root; if a second child, it cannot be the first child, etc.)
  • Rank the nodes by an ‘ease of matching’ score. (If a node is likely to be hard to match, prioritise it, because if any node cannot be matched the process stops.)
  • All individual nodes are then matched against the tree, including their edges.

At this point, any FTF node may match more than one node, but we do not know whether how many valid combinations exist. One way we might do this is to enumerate all combinations and then test them. But this can be inefficient. For example, if we know that one matching node is not a valid parent of another, all combinations which contains this assignment must be invalid.

A cleverer way is to try to eliminate invalid patterns as we create combinations. This is what we do.

Stage 2 - identify matching combinations

  • If each FTF node has only one match, we check that pattern quickly in a simplified process. (For many searches there is only one FTF match per corpus tree, and this process may be used 90% of the time or more.)
  • Otherwise, the algorithm traverses the FTF tree from top to bottom, checking whether candidate nodes match the parent or next child (etc.) of adjacent ones.
  • For each combination, all relationships are then checked, prioritising direct and ordered ones (because they are more restrictive), eliminating the combination if a node fails the test.

The description is necessarily simplified. See also Wallis and Nelson (2000). We also have to deal with a number of other constraints or relaxations of rules, including ‘skipping over’ certain nodes like pauses or punctuation, ditto-tagged compounds (where one FTF node matches multiple corpus tree nodes) and conjoin inheritance.

If the algorithm generates a ‘memory error’ because it finds too many possible combinations of FTFs per tree, ICECUP will stop. This is by design: if your search generated hundreds of possible patterns it is likely to be because the FTF is linguistically meaningless!

If you wish to comment on any of these issues, mail us on s.wallis@ucl.ac.uk.


References

Nelson, G., Wallis, S.A. and Aarts, B. (2002). Exploring Natural Language: Working with the British Component of the International Corpus of English. G29, Varieties of English Around the World series. Amsterdam: John Benjamins.

Van Halteren, T., and van den Heuvel, T. (1990). Linguistic Exploitation of Syntactic Databases. Language and Computers vol. 5, Amsterdam: Rodopi/Brill.

Wallis, S.A. and Nelson, G. (2000). Exploiting fuzzy tree fragment queries in the investigation of parsed corpora. Literary and Linguistic Computing 15, 3: 339-361.

Wallis, S.A. (2008). Searching treebanks and other structured corpora. In Lüdeling, A. and Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter. 738-759.

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research: A new approach. New York: Routledge.