XClose

UCL English

Home
Menu

Fuzzy Tree Fragments

This is the FTF home page at the Survey of English Usage.

Fuzzy Tree Fragments

Introduction

Fuzzy Tree Fragments (FTFs for short) are a handy way of specifying a grammatical query in a parsed corpus or treebank. The idea is that you represent your query by drawing the kind of structure that you are looking for. FTFs were initially developed under the auspices of the Corpus Query project (the report summarises FTFs quite well from a technical point of view), but we are continuing to work on, and with, the system. Latest developments included as part of the latest ICECUP 3.1 release are summarised here.

At present, FTFs are used in the ICECUP software, which you can download for free from our website, together with 20,000 words from the ICE-GB corpus. Alternatively, if you are interested in recent change in grammar, you may prefer to download the DCPSE sample corpus.

The idea of these pages is to explain FTFs: to help users use and understand FTFs, what they can do, and what (at present) they can’t. Although FTFs are designed to be easy to use – see comments on evaluating FTFs and other query systems – we have found that some users find the more sophisticated options difficult to understand, or cannot see how to use them to form the query that they want. On the principle that, if a user is making a mistake, it’s probably our fault for not explaining the method properly, we have decided to set up this resource!

We expect the site to grow, particularly in response to feedback. So if you think that there is a question that is not properly covered here, email us and we will respond and possibly add to or modify these pages. You may see the answer to your query up on the site (don’t worry, we won’t put your name up in lights!).


What are FTFs?

Fuzzy Tree Fragments are approximate ‘models’, ‘diagrams’ or ‘wild-cards’ for grammatical queries on a parsed (tree-analysed) corpus. Because they are models, they are essentially declarative, that is, there is no right or wrong order for evaluating the elements - like logical statements, elements must be true together.

FTFs are generalised grammatical subtrees representing a model of the grammatical structure sought, with only the essential elements retained - a ‘wild-card’ model for grammar. The idea is fairly intuitive to linguists while retaining a high degree of flexibility. Nodes and text unit elements may be approximately specified, as may links between elements, and ‘edges’ (unary structural properties such as ‘first child’).

FTFs are diagrammatic representations: they make sense as drawings of partial trees rather than as a set of logical predicates. Such diagrams have the property of structural coherence, that is, it is immediately apparent if an FTF is feasible and sufficient (grammatically and structurally). You can’t draw a tree containing two nodes where each one is the parent of the other, but you might write this expression in logic by mistake.

FTFs, links and edges

The Basics

FTFs contain the following elements.

Each link and edge can be set to one of a set of different ‘values’ or ‘statuses’, which are summarised below. The status of a link or edge can be set by clicking with the mouse on the ‘dot’ or ‘cool spot’ in the middle of the element. In order to keep the distinction between them clear, blue dots are used for node edges and links, while green is for words.

In this example, both ‘Parent’ links are set to ‘immediate parent’, the ‘Next (child)’ link is ‘immediately after’ (hence the arrow), and the ‘Next word’ link is <unknown>. All edges are set to <unknown>, i.e., they are unspecified.

  • Nodes’, which are drawn as white ‘boxes’ divided into functioncategory and feature partitions (see the ICE grammar). At least one node must contain a yellow border indicating that it is the ‘focus’ of the FTF. ICECUP employs this focal point to indicate the portion of text ‘covered by’ the FTF, and to organise concordancing displays.
  • Words’, including all lexical items and pauses (strictly, we should call them ‘text unit elements’). These are drawn on the other side of the divider from the tree structure. In the example above the words are unspecified.
  • Links’ joining two elements together. There are two kinds of link between two nodes (called ‘Parent’ and ‘Next’) and one type of link between two words (‘Next word’).
  • Edges’, which are properties of single nodes or words. An edge might specify, for example, that a node is a leaf node, or a word is the first in the sentence.

Links are coded for adjacencyorder and connectedness and depicted so as to exploit this notion of structural coherence. Thus, the ‘Parent’ parent:child relation in an FTF can be either immediately or eventually adjacent (called ‘parent’ and ‘ancestor’ respectively, and coloured black or white).

On the other hand, the ‘Next (child)’ sibling child:child relation may be set to one of a number of options, from ‘immediately following’ (depicted by a black directional arrow), through ‘before or after’ (a white bi-directional arrow), to ‘<unknown>’ (no arrow). Bear in mind that two ‘siblings’ (sisters, if you prefer) in an FTF need not match sibling nodes in a corpus tree (as in ‘Text fragment’ examples). Links and edges are summarised in the next section.

The final benefit of the graphical approach is that it is relatively easy to see the relationship between an FTF and corpus trees. This applies both to matching and to abstraction (creating an FTF from an example in the corpus).

Below we list the links and edges used in ICECUP and make some points about how these work together in common situations. This is explained in further detail, using a series of examples, on the FTF matching page.

See also


Advanced Options

ICECUP 3.1 provides a number of enhancements to Fuzzy Tree Fragments. ICECUP 3.1 is available with DCPSE and ICE-GB R2 corpora.

These enhancements affect the definition of nodes and words in FTFs. No changes have been made to existing links and edges, the topology or matching of FTFs.

ICECUP 3.1.1 provides an additional edge switch for nodes. “Inherit if CJ” allows an FTF node to match coordination patterns as if each coordinated node was a single item rather than a member of a set.


Tree Nodes

In the first version of ICECUP, an FTF node could have an optional single function or category label. It could also have any number of feature labels provided that they were consistent with the category, with only one feature per feature class.

In ICECUP 3.1 an FTF node can contain a logical combination of node patterns. Each pattern can contain:

  • sets of possible functions and categories (optionally, negated)
  • any number of features, positive or negative (plus any feature class can be marked as ‘unspecified’)

The following examples illustrate these extensions.

Levels of FTF node complexity in ICECUP
 functioncategoryfeature 
SimpleSUSU,CLCL(cop)3.0
Unspecified0,0,0CL(!transy)3.1
Sets{OD,SU} {OD,SU},CLCL(intr,cop) 
Negation¬SU{¬OD,¬SU},CL CL(¬cop) 
Logic¬(SU)(SU ∧ CL)(CL(cop) ∨ SU) 

Unspecified

You can search for an unspecified function, category or feature class. Although in a complete corpus, functions and categories should only be unspecified if a tree is an empty ‘extracorpus’ node, unmarked feature classes are quite common. They may be unspecified because the feature is optional, the element is ambiguous or they may be unmarked in error.

Searching for an unspecified feature class is particularly useful when you want to exhaustively list all subtypes of a particular node pattern. This is labelled IV = 0 or DV = 0 in the experiment pages. In ICECUP 3.0 you had to calculate the remaining unspecified elements. Now you can obtain the values easily, e.g. ‘CL(!transy)’ finds all clauses whose transitivity has not been marked.

Sets and negation

Function and category sets allow you to easily define broader groupings than those defined by the grammar.

For example, you may want to embrace all types of direct object ‘{PROD, NOOD, OD}’ within the same query. The easiest way to do this is with a set. A negated set can be used to likewise remove possibilities from a node. Thus ‘{¬OD,¬SU},CL’ is a clause which is anything other than a direct object or subject.

Feature sets can be used to obtain results where different subtypes of features are not of interest, or where the frequency is very low (what is known as ‘collapsing values’). Both intransitive and copular are transitivity features of clauses. The query ‘CL(intr,cop)’ searches for clauses which are either intransitive or copular.

NB. If two or more listed features belong to different feature classes, as in ‘N(com,plu)’, the features are independent, and the interpretation is the intersection (meaning “nouns that are both common and plural”). If they are members of the same class, e.g., ‘N(com,prop)’, then they are treated as members of a disjoint set (“nouns that are common or proper”). To express the idea that a noun may be common or plural (or both), we use logic.

Logic

ICECUP 3.1 adds a whole new set of possibilities by including propositional logic within nodes. It can be as simple as wholesale negation (where you say a particular node in an FTF may not conform to pattern A), or disjunction (where you say that a node could be either pattern B or pattern C, such as ‘N(com) ∨ N(plu)’).

The node logic editor in ICECUP 3.1 lets you edit these expressions. As with Drag and Drop logic, it includes a simplify command which draws out the logical consequences of a particular expression.

Other pattern extensions

In addition, for any pattern you can specify:

  • structural pseudo-features such as ‘ditto’ (“ditto-tagged”), and
  • that any pattern is exactly matched, e.g. ‘=SU,NP’.

Exact matching works by replacing all unstated features with the explicit unmarked feature class (see above) and removing features which fall within the same feature class.


Words and Wildcards

The second major extension to ICECUP 3.1 is the introduction of an extensive wild card system into the ‘word’ slot in an FTF.

Originally you could optionally include a lexical item and these items could be ambiguously matched by case or accent. But this was an exact match!

ICECUP 3.1 lets you specify sets of wild card patterns (including negated patterns). Each wild card consists of a string of characters optionally including the following special characters.

Lexical wild cards in ICECUP
 descriptionexamplesexplanation
*Multiplea* *ing b*ingAny number of characters
?1 charactera??? b?c?u?eAny one character
{ }Setw{0123} t{a-z??}User defined set
^Escapeb^vd be^c^vPredefined set
  ^? ^' ^{ ^- ^& ^^Literal ?, {, etc.

For more information see ICECUP: lexical wild cards.

The set representation lets you list alternatives or define a wild card and delete specific alternatives. Moreover, because these patterns can appear in the word slot of an FTF, any lexical pattern can be constrained by the node which tags it.

Examples:

  • ‘{am are is aren't isn't}+<V(cop)>’ means “copular singular forms of BE” (excluding auxiliary verbs)
  • ‘{*ing ~thing}+<N>’ means “any -ing noun except thing

Whereas ICE-GB and DCPSE are not lemmatised, the wild card system offers a work-around in many cases.


The ‘Inherit if CJ’ Option

The very latest version of ICECUP, version 3.1.1, provides one further extension to FTFs. This is quite an advanced option, and requires some explanation.

Normally each node in an FTF matches a single node in the tree. This is perfectly intuitive. But this 1:1 relationship is not always upheld.

  • Where nodes are ditto-tagged, the same FTF node might match the compound. The reasoning is that “a compound” is really a single grammatical concept realised by more than one word. So the compound a bit is tagged PRON(quant, sing), and a Node search for ‘PRON(quant,sing)’ finds a bit as well as mucha littlea little bit and so on.
  • ICECUP’s search also ‘skips over’ certain elements by default. This behaviour is managed by the Search Options. Although the FTF might specify the following node must match this pattern, unless you change the setting, ICECUP is allowed to skip over pauses or material marked ‘ignored’. This is sensible behaviour, allowing us to deal with real-life transcribed and annotated language data, but it is not a strict 1:1 relationship!

Coordination presents a different problem. In the TOSCA/ICE grammar, a coordinated group includes an additional bracketing node, marked with the feature coordn (coordination), and each conjoined item within the set is given the function CJ (conjoin). We found that researchers would tend to design FTFs for the single, uncoordinated stereotypical pattern and forget about coordinated patterns. 

Before ICECUP 3.1.1, we had to enumerate multiple patterns to get around this problem. For example, in the instance below we would need two FTFs – one for a regular pattern and one for a coordinated pattern – at any point in a structure that could be coordinated. It is very easy to forget to do this, and so cases got missed!

Compare the matching examples below.

Single unconjoined matching example
Multiple conjoined matching example

The difference between the two trees is that the second consists of a series of items of the kind found in the first. But to find both patterns we had to create two FTFs.

The second tree shows how information is shared between the upper and lower node in coordination structures in the TOSCA/ICE grammar. The function, ELE (element), is at the top of the entire bracketed structure in the second. But the feature, appos (appositive), is found in the lower node.

Inherit if conjoin = No
The new Inherit if CJ’ edge (right) is a switch available on all nodes. It solves this problem by allowing the researcher to tell ICECUP to relax the search constraints for that node in the following way. You have to actively relax the constraint – the option is off by default.

ICECUP employs a new rule, illustrated by the scheme below.

Inherit if conjoin rule

With the inherit if CJ edge set to Yes, an FTF node is now allowed to match the function of the coordinated group plus the category and features of any single coordinated item. The category of the coordination node, which is either the same as the conjoin or (if conjoined elements are unequal, tagged 'disparate'), is not considered. During the search, ICECUP evaluates each node in a matching coordinated pattern as if were not coordinated, matching the CJ node and inheriting the function.

Links and edges are also interpreted as if the node was not coordinated. The Parent, Previous and Next child links apply to the upper node (above, before and after the conjoined group). Child and word links apply to the lower (item) node.

For more information, see also the latest version of ICECUP help.


Links and Edges

Fuzzy Tree Fragments are abstract tree diagrams.

  • Links are two-place relations. They relate element to element y. Links allow you to specify how two nodes or words may be related in the corpus trees you are searching for.
  • Edges are one-place relations, i.e., they are properties of x. Edges allow you to specify positional information about nodes (e.g. it is (or is not) at the top of the tree, the first child, etc.) and words (the first or last word in the utterance).

To understand FTFs, you have to appreciate how links and edges work together. The best way to do this is to experiment.


FTF Links

There are three different two-place relations (called ‘links’) in FTFs, which are drawn either as lines joining one node to another, or directional arrows. The general principle underwriting the set of options is that they should be as simple as possible but as complex as necessary.

NB. You can achieve quite subtle queries using a combination of relations. For example, in a phrase structure grammar such as TOSCAICE, ‘crossing links are not allowed’, i.e., nodes are strongly ordered by the word order. You may use word ordering (e.g., ‘Next word = after’) in conjunction with ‘Next child’ relations such as ‘different branches’ and <unknown>, in order to specify that one branch of a tree, while being disconnected from a first, must nevertheless follow it in the sentence.

Parent:child (‘Parent’) links

There are only two ‘Parent’ links, and they are both ordered. This means that a child node in an FTF can never match a node ‘above’ its parent.

DisplayedNameMeaning
Parent = Immediate Parent
parentThe child in the FTF matches a node immediately below the parent.
Parent = Ancestor
ancestorThe child in the FTF matches a node below the parent in the tree.

Why is there no <unknown> option? Having experimented with this, we do not believe that there is a linguistically useful query involving an unordered relationship between an FTF parent and child. Moreover, with such an option it is very easy to form a structurally nonsensical query – precisely what FTFs are meant to avoid.

Child:child (‘Next’) and word:word (‘Next word’) links

Apart from a small difference between the two (there is no ‘different branches’ option for words), the set of options for ‘Next (child)’ and ‘Next word’ are identical. 

*Next child only.
DisplayedNameMeaning
Next child = Next
just afterThe next child in the FTF immediately follows the first in the tree.
Next child = After
afterThe next child in the FTF follows the first, not necessarily immediately.
Next child = Just Before/After
just before or afterThe next child in the FTF immediately precedes or follows the first.
Next child = Before/After
before or afterThe next child in the FTF either precedes or follows the first.
Next child = Different Branches
different branches*The next child (node) in the FTF is on a different branch to the first.
Next child = unknown
<unknown>No restriction is imposed.

When is the ‘different branches’ option useful? Answer: if either of the sibling nodes are not immediately connected to their parent. Note that the first four ‘arrow’ links all relate to the ordering of child nodes in the tree, i.e., they imply that the pair of nodes share the same parent, irrespective of other links.

Different branches is not the same as <unknown>, which makes no restriction on how a node may be matched. 


FTF Edges

General node edges

There are four different unary links, or ‘edges’ in FTF nodes (‘Root’, ‘Leaf’, ‘First (child)’, and ‘Last (child)’), but one rule applies to all of them. They are drawn topographically, as if each were a link to a further node, absent in the FTF, that must be present in the tree. This becomes clearer when you look at the FTF as a whole (see below).

The general pattern is summarised below. When editing, the default for all links is <unknown>. (Note that the ‘Text Fragment’ query composes FTFs with different defaults: see below).

DisplayedNameMeaning
Edge = No
NoThere must be another node beyond this one.
Edge = Yes
YesThere cannot be another node beyond this one.
Edge = unknown
<unknown>No restriction is imposed.

Leaves

Leaf’ edges are a special case. The ‘leaf’ setting implies something else.

  • If a node is a leaf, it must be directly connected to the word that it annotates (in other words, it must be a ‘tag node’).
  • Conversely, if it is not a leaf, or it is not known whether it is a leaf or not, it will at least be eventually connected.

The implied relation is shown by a dotted line linking the node and text unit element in the FTF. There is thus no need for a separate word:node link.

DisplayedNameMeaning
Leaf = No
NoThere must be another node beyond this one. The word is only eventually connected.
Leaf = Yes
YesThere cannot be another node beyond this, and the word is directly connected.
Leaf = unknown
<unknown>No restriction is imposed. 

Word edges

The same general rule for node edges also applies to the ‘word edges’ (‘First word’ and ‘Last word’). These are drawn as triangles because there is no explicit linking structure in the depiction of sentences.

DisplayedNameMeaning
Last (or First) word = No
NoThere must be another word beyond this one.
Last (or First) word = Yes
YesThere cannot be another word beyond this.
Last (or First) word = unknown
<unknown>No restriction is imposed.

‘Inherit if conjoin’ edge

This option has only two values: whether to apply the rule or not.

DisplayedNameMeaning
Inherit if conjoin = No
NoDo not apply the rule.
Inherit if conjoin = Yes
YesApply the ‘inherit if conjoin’ rule.

For more information see The ‘Inherit if CJ’ Option.


Using Links and Edges

So far we have discussed links and edges individually. In order to understand how links and edges work, you should experiment with them together in ICECUP.

  • The ‘cool spot’ in the centre of the link or end of the edge line controls the link. Press down with either the left or right mouse button to rotate through the set of links.
  • Alternatively, you can use a popup menu. Select the node that governs the edge or link and press down with the right button in the grey area outside the node. This then lists the alternatives.

The following notes are meant to help you get started and explain certain aspects of the representation. There is a detailed description here on how FTFs match corpus trees and the impact of different collections of links. Further information is available in the online help manual, which is also part of the complete download package. However, you will need to experiment yourself.

Unspecified is the default for edges

The default status for edges is <unknown>, indicated by the white bars in the FTF below. In other words, the default is not to impose a restriction about the location of a node.

  • If you do a ‘New FTF’ and then press the ‘Insert child after’ key twice you will get an FTF structure like this.
  • You can add the node labels yourself (use the ‘Edit node’ command).

A simple example created with ‘New FTF’ and a few edit operations.

To see how FTFs like this match against the corpus, press here.

(Over-)specifying edges

You can edit edges in the tree, as we have done in the case below. What is the difference between the following FTF and the previous one? Well, this FTF will only match parts of trees in the corpus that:

  1. have nodes before and after the verb phrase (VP) (see the links at the far left of the figure),
  2. where the verb phrase is realised by only an auxiliary operator (OP, AUX) and a main verb (MVB, V), and
  3. these two nodes must be leaf nodes (this should be guaranteed by the grammar in this case).

The FTF will reject cases that matched the FTF above if any of these additional restrictions do not hold.

The same example with node edges marked.

This example also illustrates how the edge colouring system operates. The requirements that there is a parent for the VP node, and that there are nodes before and after the VP node, are marked by the presence of black links ‘to these nodes’. Likewise the absence of preceding or following nodes within the VP is indicated by the absence of black links.

Fuzzy Text Fragments

If you create an FTF using the ‘Text Fragment’ query, the nodes you introduce must ‘tag’ (i.e., be directly connected to) the words, and be ordered by the word sequence (this is indicated by the ‘Next word’ arrow on the right hand side).

The second point is that any nodes must be independent of the grammar, i.e., not connected together by any restricting ‘Next child’ or ‘Parent’ relation. Since all nodes have the root node as an ancestor, the shared parent node in the FTF (on the right) is linked as an ‘ancestor’ and marked as ‘Root’.

A tag search looking for similar structures.

Naturally, this is a weaker search than the first one (it is less specific). Apart from picking out any auxiliary-verb pair, regardless of grammar, it cannot relate to any other parts of the tree. The ‘Text Fragment’ command performs searches that are typical of a tagged, unparsed corpus. Finally, note that the focus (indicated by the yellow border) is spread across the tag nodes.

To see how FTFs like this match against the corpus, press here.

See also