Fuzzy Tree Fragments: Evaluation
A brief page summarising how FTFs have worked out in practice.
Criteria
When we started work on FTFs, we proposed that we should use the following criteria to evaluate grammatical query systems (in order of importance, from most to least). See Wallis and Nelson (2000).
Linguistic adequacy.
Queries are sufficiently expressive to capture linguistically meaningful research questions.
Transparency.
Queries are clear and meaningful to end users, to whom any new system will be unfamiliar.
General expressivity.
One can express a broad range of queries that extends beyond current linguistic concerns.
Computational efficiency.
Naturally, we all prefer a rapid response to a poor one. In practice, however, it is not the proof algorithm that determines the speed but the disk access time.
Note
ICECUP and FTFs have developed over time. Wallis and Nelson (2000) summarises ICECUP 3.0, its technical infrastructure and the first version of FTFs.
With the release of ICECUP 3.1 in 2006, to coincide with the publication of DCPSE and Release 2 of ICE-GB, the expressivity of FTFs was extended in a number of ways – in particular, to permit the expression of logic in nodes and lexical wild cards.
It was not found necessary to extend the set of links and edges, except, later, with the ‘inherit if conjoin’ option.
How do FTFs perform against these criteria?
The release of ICECUP 3.1, which incorporated logic in tree nodes, and sets and wildcards in word slots, addressed many perceived deficiencies of the original system. You could now state that a node could be either an adjective or a nominal adjective, or, using sets and wildcards, require a word in a slot to be a form of a regular or irregular verb.
Nonetheless, the following types of queries remained difficult to perform:
- Matching optional conjoins – eventually resolved by an extension in ICECUP 3.1.1.
- Specifying that two elements in an FTF must have the same function, category or feature without naming it. (Hans van Halteren and Theo van den Heuvel's (1990) LDB program allows this, but FTFs do not.)
Some occasional desiderata are procedural, for example:
- Specifying a ‘nearest ancestor’ or ‘nearest sibling’ match.
- Counting nodes, words, etc. under a match.
- Counting intervening nodes between two nodes (grammatical distance).
The principal issue is that ICECUP III uses logic and is deliberately intended not to make procedural distinctions. Thus it will find all matching cases, not just the nearest.
ICECUP IV (which is experimental and not openly distributed) defines a 'project' which applies FTFs to trees in a more procedural way.
Of course it may be argued that some of these are not linguistically meaningful. The ‘nearest ancestor’ of a node is not a well-defined linguistic relationship, and may be better addressed by finding cases and then reviewing them manually.
Simple FTFs with direct relationships between nodes are easy to understand. But some users have difficulty with the ‘ancestor’ option for parent-child relations because the FTF is more ‘elastic’ and doesn’t match so intuitively against the tree.
Evaluating transparency is not easily separated from the difficulty of learning ICE grammar.
Apart from the fact that FTFs in ICECUP are ‘abstract ICE trees’, many users find that they do not know the grammar sufficiently to form ‘definitive’ queries easily. This is why ICECUP is designed around what we call an ‘exploratory cycle’ (see Corpus Query report).
In practice, after more than twenty years of attempting to explore radically new experimental designs, we can honestly report that only rarely did we find ourselves finding ICECUP and FTFs insufficiently powerful to carry out the research alone, and resort to programming.
There are exceptions. For example, Wallis (2021: 226) describes an experiment investigating grammatical priming (or spreading activation) within tree structures, where he had to write code to extract the distance between two nodes by counting intervening elements.
A more common limitation is that there is no simple method for counting words, clauses or phrases under a tree without programming (or using ICECUP IV – which is currently deprecated).
For more on these issues, see Wallis (2008).
ICECUP 3.1’s implementation of FTFs is highly efficient.
Indexing
ICECUP uses extensive indexing (~30Mb out of 80Mb of the ICE-GB package are indexes).
- Complete precomputed first-order indexes of unique nodes and lexicon node+word combinations (in ICECUP 3.0 this was only an index of words). Every single exact node pattern is stored as a series of references to text units in the corpus.
- For nodes, extensive precomputed second-order indexes (indexes of the above), plus a direct index where second order patterns have 100 or more combinations. When ICECUP performs a search like 'CL' (clause) it finds a list of all the possible patterns, plus the pre-computed short-cut of all references to clauses in the corpus.
- Similar second-order indexes for nodes in the lexicon index. We also store indexes for letters, by position. We have indexes for ‘a*’ (all words beginning with A/a, ‘?a*’ (all words whose second-place letter is A/a), ‘??a*’ (third), etc. We also have indexes counting from the last word, and word length indexes. This accelerates wildcard searches and enables the lexicon to work efficiently.
Where a search can be completed without individually reviewing corpus trees, the search is completed at that point. Otherwise the candidate list must be reviewed, a process that takes place as a ‘background search’. ICECUP 3.1 runs these background searches in parallel. Multiple searches are performed by sequencing candidate trees, reducing seek and retrieval time as well as hard disk impact. Trees are also cached internally, so if a tree a search needs has been recently loaded it can be copied from the cache.
Matching
The algorithm that matches FTFs to corpus trees is also highly efficient.
This algorithm is an AI algorithm termed a ‘theorem prover’ that is optimised for phrase structure trees. It finds all possible ‘proofs’ (matching patterns) against the corpus tree.
A simplified description is below.
Stage 1 - match nodes
- All node edges and links implied by the existing FTF relations are set. (If a child has a parent, it cannot be a root; if a second child, it cannot be the first child, etc.)
- Rank the nodes by an ‘ease of matching’ score. (If a node is likely to be hard to match, prioritise it, because if any node cannot be matched the process stops.)
- All individual nodes are then matched against the tree, including their edges.
At this point, any FTF node may match more than one node, but we do not know how many valid combinations exist. One way we might do this is to enumerate all combinations and then test them. But this is inefficient. For example, if we know that one matching node is not a valid parent of another, all combinations which contains this assignment must be invalid.
A better way is to try to eliminate invalid patterns as we create combinations. This is what we do.
Stage 2 - identify matching combinations
- If each FTF node has only one match, we check that pattern quickly in a simplified process. (For many searches there is only one FTF match per corpus tree, and this process is used over 90% of the time.)
- Otherwise, the algorithm traverses the FTF tree from top to bottom, checking whether candidate nodes match the parent or next child (etc.) of adjacent ones.
- For each combination, all relationships are then checked, prioritising direct and ordered ones (because they are more restrictive), eliminating the combination if a node fails the test.
The description is necessarily simplified. See also Wallis and Nelson (2000). We also have to deal with a number of other constraints or relaxations of rules, including ‘skipping over’ certain nodes like pauses or punctuation, ditto-tagged compounds (where one FTF node matches multiple corpus tree nodes) and conjoin inheritance.
If the algorithm generates a ‘memory error’ because it finds too many possible combinations of FTFs per tree, ICECUP will stop. This is by design: if your search generated hundreds of possible patterns per text unit it is likely to be because the FTF is linguistically meaningless!
If you wish to comment on any of these issues, mail us on s.wallis@ucl.ac.uk.
References
Nelson, G., Wallis, S.A. and Aarts, B. (2002). Exploring Natural Language: Working with the British Component of the International Corpus of English. G29, Varieties of English Around the World series. Amsterdam: John Benjamins.
Van Halteren, T., and van den Heuvel, T. (1990). Linguistic Exploitation of Syntactic Databases. Language and Computers vol. 5, Amsterdam: Rodopi/Brill.
Wallis, S.A. and Nelson, G. (2000). Exploiting fuzzy tree fragment queries in the investigation of parsed corpora. Literary and Linguistic Computing 15, 3: 339-361.
Wallis, S.A. (2008). Searching treebanks and other structured corpora. In Lüdeling, A. and Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter. 738-759.
Wallis, S.A. (2021). Statistics in Corpus Linguistics Research: A new approach. New York: Routledge.