XClose

UCL English

Home
Menu

Fuzzy Tree Fragments: Matching

How FTFs match corpus trees.

Introduction

In order to understand FTFs you need to understand how their components work together. How does a program like ICECUP decide that an FTF matches (part of) a tree in the corpus?

FTFs are declarative, in other words: all aspects of the FTF must be true together, and the order in which elements are evaluated is not important.

All the examples on this page are based on the performance of the latest version of ICECUP, although other programs may use the same principles. For democratic reasons of accessibility, all examples of matching cases are taken from our freely-available sample corpus download (which is itself a sample of ICE-GB).

First we discuss single element FTFs, then we turn to FTFs with two 'children' and a single 'parent' node. Finally we consider what happens when we relax the parent-child relationship, replacing it with an eventual ‘ancestor’ one.


Single Elements

We will start with some simple examples. In this section we will consider examples of FTFs consisting of a single node and a single word.

Single node FTFs

Our first examples are single node FTFs.

  • You can generate a single node FTF in ICECUP using the ‘(inexact) Nodal’ query, typing the expression, e.g., “OD,CL” and hitting the ‘Edit’ button.

The FTF you get should look like the example below.

Example single node FTF

Next, find an example of a matching case.

  • If you then press the key ‘F4’ or the ‘Start!’ button, you will get a complete list of examples of (in this case) direct object clauses.
  • Then double-click on an example to open a tree window.

The resulting tree should look something like the one below (this is S1A-010 #149). The matching case is the node in brown, and the part of the text dominated by the node is shaded. This is because the node also has the ‘focus’ of the FTF.

Example matching tree for the single node FTF

The FTF contains a number of unspecified edges apart from the ‘OD,CL’ designation. But because these are unspecified, they do not limit the position of the matching case. So this query will match direct objects in the last position in the branch, or in other positions, e.g., as in “I think you will agree because... they were dumbfounded” [S1A-094 #52], where “because... they were dumbfounded” is analysed as an adverbial clause.

Notice that the FTF explicitly contains a ‘word’ element, which is unspecified. If we did specify the word, the FTF would only match examples that contained that word (note that in this FTF the position of the word would not be specified within a set of covered words).

Finally, note that the FTF can match more than once within a single corpus tree.

Single word FTFs

In the case of single word FTFs, that is, an FTF that searches for a single text unit element, we must specify aspects of the (unspecified) node.

  • Using the ‘Text Fragment’ command, type the word “work” and press the ‘Edit’ button.

Example single word FTF

The result should be the FTF shown. If you then perform the query, you will get examples like this one (W1B-001 #179).

Matching corpus tree for the single word FTF

The empty node still has the focus but is specified as a leaf. There is no white ‘stub’ between node and word, and there is a black dotted line, meaning that word and node are immediately connected. The node must ‘tag the word’. No other edges have been specified.

Tagged word FTFs

The previous FTF finds examples of “work” as a noun or (more rarely) as a verb, as in “We will have to work very hard.” [W2C-009 #44]. If you want to find examples of “work” as a verb, you can also use the ‘Text Fragment’ command.

  • In the ‘Text Fragment’ window, type the word “work”.
  • Then, without pressing <SPACE>, press the ‘Node’ button.
  • Position the input caret (blinking cursor) between the angled brackets and type “V”.
  • The query should look like this: “work+<V>”.
  • Then press the ‘Edit’ button.

Example tagged single-word FTF

If you have done this successfully, the FTF will look like the one shown here. In fact, the only difference from the previous example is that the node, still in the ‘tag position’, has the category ‘V’, meaning “verb”. An example match is shown below.

Matching corpus tree for the tagged single-word FTF

This FTF query excludes cases of work as a noun. Thanks to the extended index system of ICECUP 3.1, where words and their tag nodes are listed in full, the search is near-instantaneous. This ‘word+tag’ index also powers the Lexicon tool.

In the examples that follow, ICECUP cannot rely on a pre-defined index. It calculates the set of candidate text units to examine and then checks the relationship of FTF elements, one-by-one. It does this in the background, so you can carry on using the software while it does so.


Simple FTFs (parent)

So far we have looked at single node FTFs. Thanks to the detail of the ICE grammar – and, in particular, the large number of features – you can do a lot with a single node query in ICE-GB. However, sooner or later you will want to perform a query that involves more than one node. In the first set of examples, we will look at cases where the parent node in the FTF must match the parent node in the tree. We will then look at some more complex examples using the ‘ancestor’ option.

A tightly bound three-node FTF

We could start with a two-node (parent and child) fragment, but this is rather limited and anyway, it is similar to the previous ‘tagged word’ example. Instead, we turn to a more obviously tree-like example consisting of three nodes.

  • You can create this FTF in more than one way. One of the simplest methods is to press ‘New FTF’ to create an empty node with the focus (the one on the left), and then to press 'Insert child after' twice. This inserts two child nodes side-by-side.
  • Next, move to the first child node and press ‘F2’ or the ‘Edit node’ button.
  • Select “noun phrase head” and “noun” for the function and category respectively.
  • Repeat for the second child node, selecting “clause” for the category. Leave the parent node unspecified.

A simple three-node FTF

The result should look like the FTF below, left. If you need more help, look at the help manual (included in the complete sample download) under “Editing FTFs”.

This FTF specifies that, thanks to the black arrow (‘Next (child) = immediately after’), the two sibling nodes must follow one another immediately in the sequence (the ‘skip over’ search option may affect this). Secondly, the FTF specifies that the upper node in the FTF that acts as a parent for the other two, is indeed, specified to match only the parent in the tree (in the case below, “[She] she’s a joy to listen to”, the subject complement NP).

A tree that is matched by the simple FTF above

The way that FTFs like this match the corpus is relatively easy to anticipate. If nodes are adjacent and in a particular order in the FTF, they will be adjacent and in that order in the tree. However, sometimes we are interested in weakening these restrictions.

An FTF with an ‘eventually after’ node

Suppose we remove the requirement for the clause to immediately follow the NP head node. 

  • Change the link to ‘after’ (the white arrow) by pressing down with the right mouse button over the blue cool spot in the middle of the arrow.

A simple three-node FTF with a Next child = 'after' link

The result should look like the FTF shown here. Performing the search again will find all the previous cases plus some new ones. Thus, in the example shown below (W1A-001 #15, with branches closed for clarity), there are two matches.

A matching corpus text with multiple cases matching the tree
  1. The matching case on the left, or upper position (i.e., under the subject NP node), contains within it a postmodifying prepositional phrase (“of events beginning...”) followed, eventually, by a (postmodifying) clause.
  2. The case on the right, within the first, “events beginning in the late fourth century”, also matches the FTF. Here the clause and head nodes are adjacent.
  • In passing, note that this example illustrates an interesting question regarding sampling that we cannot go into here – can we say that these two cases are entirely independent? See the note below.

An FTF with an unordered relation

What if we search for examples but do not specify the order of nodes under the parent (i.e., use a bi-directional arrow)? You may find that it does not appear to make much difference: the grammatical terms you introduce are invariably in a particular order in the tree (or are extremely rare otherwise). If you substitute a ‘before or after’ arrow for the ‘after’ arrow in the previous example and search the sample corpus, there won’t be any additional cases. This is because in the ICE grammar, NP structure is highly ordered.

Not all structures are so regular, and the ability to specify either order can be useful in certain circumstances. Employing an unordered link can also cause problems. To illustrate this, we will experiment with conjoined NPs.

  • Create a ‘New FTF’ and add 'two child nodes after' as before.
  • This time, label the first node’s function a “conjoin“ and its category a ”noun phrase” and label the second node with “coordinator“ and “conjunction” respectively.
  • Then, set the linking arrow to ‘just before or after’.

A three-node FTF with an unordered sibling relationship

The resulting FTF is given above. Note that when you specify that the link between two nodes is in either order, the two sibling nodes gain additional ‘edge’ options (for the first, ‘Last child’, and the second, ‘First child’). With ordered links (see above), FTFs can dispense with these, because the link guarantees that there must be a node after the first one and vice-versa.

Corpus tree with multiple matches with the FTF above

This FTF will find examples of coordinated NP conjoins regardless of order. In the case on the right, it matches twice because there are two NPs and thus two distinct legal matching arrangements.

  1. The highlighted example (bright purple) is in the same ordering arrangement as the FTF.
  2. The other example is slightly hidden here, due to overlapping, but matches in the other order. It shares the same coordinating node in the middle and the same parent NP node.

If you replace the ‘just before or just after’ (black arrow) with the (white) ‘before or after’ arrow you will get even more combinations, particularly in cases of coordinated triples (“x or y or z”) etc. This kind of FTF may be useful for exploration, but should be avoided for experimentation. You will find that you need to make queries like this more specific in order to perform quantitative methods using their results. See the brief note below on counting combinations and cases.


Loose FTFs (ancestor)

The previous examples have one thing in common: the location of the parent node is specified. But sometimes it is necessary (or advantageous) to specify that a parent in an FTF is not immediately related to a child. One possible reason might be to allow less strictly “tree-like” structures to be built, such as sequences of tags and words. In the following examples we look at this in more detail.

A loose FTF with siblings

Suppose we look for clauses which dominate ordered ‘auxiliary, verb’ sequences:. An appropriate FTF is given below, left.

  • Again, create a ‘New FTF’ and add 'two child nodes after'.
  • Label the first node an “operator“ (function), ”auxiliary” (category) and the second node as “main verb“ and “verb” respectively.
  • Then, set both of the parent links to ‘ancestor’ (click on the cool spot on the link).

A loose FTF with genuine siblings

The FTF obviously matches the following tree (S1A-010 #149) twice. However, there is also a third, not so obvious match.

This tree matches the FTF above three times
  1. The highlighted match is the entire clause “don’t know what you’re doing”, where “don’t know” is the ‘auxiliary, verb’ pair.
  2. The second match is the direct object clause “what you’re doing”, with “[a]re doing” as the ‘auxiliary, verb’ pair.
  3. The third, and less obvious, match is the entire clause “I don’t know what you’re doing”, with the auxiliary and verb being represented by “[a]re doing”, again.

What if you just want to match cases where the nearest clause to the pair is found?

  • Here you have to apply observation and a little grammatical knowledge. Since ICE is a complete, rather than skeleton grammar, there will always be an intermediate (VP) node between the clause and the verbal elements. You can therefore introduce this intermediate node into the FTF and then insist that all parent-child relationships are direct ‘parent’ links.

Another possibility is to try to restrict how the clause node matches in other ways.

For example, if you were just interested in a list of different verb and auxiliary pairs which were within a clause, you could require that the clause matched the root (‘Root: yes’ in which case it would exclude match (2) above). This would also exclude cases where the root node was not a clause, however. (In case you were wondering, FTFs do not offer the option of a ‘nearest ancestor’ link because such a link is by definition procedural and FTFs are declarative.)

In this example, the FTF’s ‘child nodes’ are genuine siblings in the tree, i.e., they share the same parent. This restriction is entailed, not by the ‘Parent’ link, but by the status of the ‘Next child’ link. ‘Immediately after’ means “immediately after in the sequence of siblings in the tree,” and therefore implies that the nodes share the same parent. Note that this property is shared by three other values of ‘Next child’: ‘after’, ‘just before or just after’ and ‘before or after’. (This property is implied by the ‘stem’ of the arrow).

If you want to allow siblings to match tree nodes regardless of their parenthood, you would have to use a different ‘Next child’ option. Thus, if two nodes are connected together with ‘Next child = <unknown>’ (and the ‘ancestor’ parent link is employed) then no restriction is placed on their relative position. However, this situation can be too weak in many circumstances. A more desirable constraint would be to state that the two nodes must be on different branches of the tree.

FTFs with nodes on different branches

The restriction that two nodes must be on different branches can be rephrased simply, as meaning that one node cannot be the parent of the other. The nodes matching each sibling cannot share a path to the node matching their common parent. This option is more general than ‘before or after’, because it does not require a common parent, and is drawn like the white double arrow link without the common stem.

The FTF below left looks for examples of clauses containing a NP acting as a direct object (note that this is directly linked to the clause) and, somewhere within the clause – but not within the direct object – a noun phrase head.

  • Create a three-node FTF in the normal way, i.e., with a ‘New FTF’ command and 'two child nodes after'.
  • Label the nodes as shown using the ‘Edit node’ command (F2).
  • Next, click on the cool spot for the ‘Parent’ link for the noun phrase head node and then set the ‘Next child’ relation to ‘different branches’ either by clicking down on the cool spot for ‘Next child’ several times or invoking the pop-up menu and setting the value.

Moreover, we can insist that the NP head must follow the direct object in the textual sequence by introducing the ‘Next word’ link. (This works because (a) the ICE grammar is a phrase structure grammar, which denies the possibility of crossing links, and (b) the ‘Next word’ link is interpreted to mean that there is a word under the first node that precedes a word under the second.)

  • Finally, rotate the ‘Next word’ link until it reads ‘after’ (white arrow).

Loose three-node FTF with ‘different branches’ relationship

You should get quite a lot of matches. The tree below, right (S2B-002 #36), contains several examples.

Corpus tree matching the FTF above with several matching combinations

There are three distinct matches in this tree.

  1. The first matches the subject clause “What that has meant is...”, where the direct object is realised by “what” and the noun phrase head “that” is in subject position.
  2. The second matches the subject complement clause node “...is that we had to reduce staff <,> from thirty-two to fourteen” and the direct object is realised by “staff”. The isolated noun phrase head element is part of an adverbial prepositional phrase “from thirty-two”.
  3. The third match is identical to the second, save the position of the noun phrase head, which is in the other prepositional phrase, “to fourteen”.

As we discuss below, you should be careful using these ‘loose’ links when you are formalising your experimental design so as to minimise the number of multiple overlapping instances.

We recommend that you experiment with structural variations on this theme using ICECUP. Try each of the following in turn, resetting the link after the experiment.

  • What happens if ‘Next child = different branches’ is set to <unknown>?

    You get many more matching cases, including those where the noun phrase head is the head of the direct object NP. The ‘Next word’ restriction means that there must still be a node prior to the head within the NP (a determiner, for example).

  • What happens without the word order restriction, i.e., ‘Next word = <unknown>’?

    You get additional cases with NP heads prior to the direct object.

  • What happens if we weaken the restriction that the clause is the parent of the direct object?

    You obtain many more cases per tree, and eventually, the “out of memory” error. This is because the number of distinct matching arrangements can increase combinatorially.

    The following (quite mild) example illustrates the principle. The first three highlighted locations, (reading left to right) match the clause element (as the clause can be any distance “above” the direct object “your S” which is hidden to the right). The two rightmost locations match the NP head element. Since all three locations of clause are legitimate for both positions of the NP head, there are six matches in total. Now suppose there are more than one direct object node. This is called “underspecifying” your search in the help manual.

A runaway series of matches in one tree, all matching a version of the FTF above

The problem can be avoided by restricting the location of nodes in various ways (as we did in our example). You should link elements together immediately if at all possible, even if this means introducing new nodes. You should avoid introducing loosely connected nodes which are very generally specified (clauses are common, “empty” unspecified nodes will match anything). The following advice is reproduced from the help manual.

A general solution to the problem of underspecification

  1. Eliminate all unnecessary empty nodes in the FTF, apart from where they preserve tree structure. Restrict nodes by introducing grammatical terms or text unit elements, but only where appropriate.
  2. If you must have an empty node in your FTF, try to connect it directly to another, non-empty element, or to the root of the tree. You can insert an empty node safely if it is intimately bound to another node.
  3. Failing that, specify the edge position of the node.

None of the above necessarily means that you should always avoid the ‘different branches’ or <unknown> options, or stick to using the ‘immediate parent’ link. If you want to express a query consisting of two tightly-bound fragments that are connected together only loosely, the ability to specify that neither is above the other can be very useful. It is just a good idea to be sure that neither of the fragments are over-general.

Text fragments

One situation where ‘Next child’ is routinely set to <unknown> is when you want to specify a text fragment. The idea is that all nodes which might have words associated with them are specified in the tag, or leaf position, and, if the query will match more than one word or tag in a sequence, the set is grouped together by ancestor links under a common node set to ‘Root = yes’. (We considered single word FTFs and tagged-word FTFs in the first section.) The sequence itself is specified by ‘Next word’ relations, which ignore tree structure.

It can be useful to consider the query as a ‘comb’, or ‘hedge’, instead of a ‘tree’. Structure may be added by moving up the tree from the leaves toward the root.

The following is a simple example of a two-element text fragment which finds examples of “this” followed by a verb, as in “This is too salty.” [S1A-010 #86].

  • In the ‘Text Fragment’ window, type the word “this”.
  • Then, press <SPACE> and hit the ‘Node’ button.
  • Position the input caret (blinking cursor) between the angled brackets and type “V”.
  • The query should look like this: “this <V>”.
  • Then press the ‘Edit’ button.

FTF for text fragment ‘this V’

The FTF should look like this. You will see that the upper node is specified as the root of the tree (matching the ‘PU,CL’ element on the right), while the nodes for “this” and the verb are specified as leaves. ‘Parent’ links are set to ‘ancestor’ and the ‘Next child’ link is, as we suggested, <unknown>. We do not wish to restrict the query grammatically in this case (we might subsequently choose to do so, but that is another matter). Finally, the ‘immediately after’ arrow indicates that the verb must immediately follow the word “this” in the text sequence.

Corpus tree matching the text fragment ‘this V’

The FTF matches a series of examples, including the one shown here. Although there is considerable ambiguity introduced on the tree side – empty nodes, ancestor links, unspecified ‘Next child’ relations – the query is not underspecified (see above). For one thing, lexical items tend to be more specific than a simple node specification.

NB. In the complete ICE-GB corpus there are, including capitalisation and spelling variants, over 46,000 distinct lexical items (and over 63,000 word+tag tokens). But there are ‘only’ around 7,500 distinct nodal patterns (distinct combinations of function, category and features).

Moreover, the nodes are bound to specific positions in the tree (root, leaf) and the ‘immediately after’ link is employed. As a result, the leaf nodes are related to each other via the sentence. This dramatically reduces the ambiguity of the FTF.

Setting up an FTF like this from scratch using the FTF editor is quite difficult, and it is easy to make mistakes (typically, forgetting to specify the ‘Root’ or ‘Leaf’ positions, see the help manual). The “Text fragment” query window constructs queries like this very easily. You can then modify the query, for example, by laddering up, but note that if you add elements you will need to set links appropriately. The help manual contains a worked example of this.


Counting Combinations and Cases

ICECUP is designed to allow researchers to explore a parsed corpus. Fuzzy Tree Fragments are possible arrangements of nodes and words/wildcards which allow this to happen. 

But this flexibility that allows an FTF to match a corpus tree structure in a variety of ways has a downside when it comes to counting cases for research purposes.

ICECUP highlights all of these matches so we can see the issues:

  • Cases can be found on independent branches of the same tree. These might be structurally independent or they might be associated, for example if they are conjoined.
  • Cases are embedded within each other. A clause might be embedded within another clause, and a matching FTF can match each clause.
  • Cases may overlap each other, i.e. one matching pattern might share nodes with another.

Apart from the increased difficulty of separating these matches visually – ICECUP lets you distinguish each match by ‘tracking through’ a concordancing display (see the help manual) – this also has a number of important implications for the employment of statistics.

In brief, standard statistical methods assume the independence of samples. But it would be hard indeed to argue that two overlapping matches were independent from one another! Even where two matches are nested within one another at some distance you might argue that they are not independent. Nelson, Wallis and Aarts (2002: 272) discuss this problem of case interdependence through a series of examples. Finally, Wallis (2021) has an entire chapter discussing a mathematical solution to the problem of ‘random-text sampling’, the wider issue in corpus linguistics that any two cases drawn from the same source (sub)text are not truly independent.

What can you do?

First, be careful when counting cases! Examine cases where more than one match is found in the same tree.

In particular, avoid treating overlapping cases as if they were truly independent. Does your object of interest, which might be one or two nodes, but not the whole of the FTF, overlap with another? If so, the cases are not independent! Here it would be wise to count the case as if it were merged into a single one.

Then consider the following steps.

  1. Refine queries. You should rewrite queries to avoid unordered links. In some cases you may need to manually review cases and subtract repeated matches.
  2. Subsample. In high frequency cases you could apply random sampling to ‘thin out’ the sample by exporting data to Excel and taking a random subsample. (But a random sample of the results of a loose query is still loose. It is usual preferable to clarify your query.)
  3. Correct for text-clustering. Where you suspect data is clustered in a smaller number of texts, rather than randomly distributed across the corpus, if you are performing a scientific study, you should consider Wallis’s random-text sampling adjustment or multi-level modelling in analysis.

See also


References

Nelson, G., Wallis, S.A. and Aarts, B. (2002). Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: Benjamins.

Wallis, S.A. (2021). Statistics in Corpus Linguistics Research: A new approach. New York: Routledge. » order