How FTFs can be used to perform natural experiments with a parsed corpus like ICE-GB and DCPSE.
Part 1. The idea of a corpus experiment
Introduction
If you want to find examples of a grammatical construction in a parsed corpus such as ICE-GB or DCPSE you can perform an FTF query, a framework that is described on other pages on this site. Performing a query with ICECUP will produce a (potentially very long) list of results, consisting of a sequence of sentences or ‘text units’.
Within every sentence an FTF will match in at least one distinct arrangement (a ‘hit’ or case). Note that if elements within the FTF are common or the structure is flexible, there can be many hits per sentence.
Here we address the following question:
- How should researchers employ grammatical queries to carry out experiments on a parsed corpus?
This question is perfectly valid. Provided that the corpus is collected systematically, and annotated consistently and in good faith, there is no particular reason why an experimental approach cannot be applied to a corpus, even a parsed one. For some comments on the philosophical implications of this, see here. (Note for the avoidance of doubt, the term experiment on these pages is used to mean a natural experiment rather than a laboratory experiment, see below.)
The question is also very important. It is one thing to find examples of a particular construction in a corpus, and quite another to make any generalisations about the presence of such constructions in contemporary British English or in English in general. Moreover, examples merely indicate the existence of possible phenomena, they do not explain under which circumstances they appear. The latter requires both an experimental method and a clear theoretical defence.
ICECUP 3.1 supports the construction of some simple tables and the collection of frequency statistics. However in many of the examples described here you will have to perform the process of extracting data from the corpus by hand. (We believe that automating many of the procedures would be highly advantageous, for a number of reasons, and this is the subject of the Next Generation Tools project.)
A note of caution: these pages are written for linguists (including PhD and MA students) wishing to experimentally investigate parsed corpora. The explanations necessarily touch on technical issues in statistics. This is not meant to be an introduction to the chi-square test, or the assumptions and ramifications of the test. If you are unfamiliar with chi-square, you can look at other pages on the web or a good experimentalist workbook.
This page, Part 1, makes some introductory comments about experimental design. Part 2 summarises the general method and common pitfalls you should be aware of if you use FTFs in experimental studies. Part 3 introduces effect sizes. Parts 4 and 5 discusses the more complicated problem of examining how two grammatical aspects of a phenomenon interact. (NB. You should read these pages in order.)
Many of these questions are, in fact, central to the investigation of any corpus, including plain text and tagged corpora. The problems we discuss simply become more acute with a parsed corpus and rapid retrieval software like ICECUP. As we shall see, it also becomes easier to identify a hierarchy of linguistic choices when a detailed grammatical analysis is present. Wallis (2021) is a comprehensive monograph covering these issues in much more detail.
It is easy to collect tables of frequencies. It is less easy to design experiments to make sense of the numbers.
Statistical health warning
Performing an experiment is not a substitute for thinking about the research problem.
You need to (a) get the research design right and (b) relate the results back to a linguistic theory. These pages are meant to help you get the design right.
What is a scientific experiment?
Answer: It is the evaluation of a hypothesis against data.
A hypothesis is a statement that may be believed to be true but is not verified, e.g.,
- smoking is good for you
- dropped objects accellerate toward the ground at 9.8 metres per second squared
- ’s is a clitic rather than a word
- the word whom is used less in speech than writing
- the degree of preference for objective pronoun whom rather than who differs in contemporary spoken and written British English
In each case, the problem for the researcher is to devise an experiment that allows us to decide whether evidence drawn from the real world supports or contradicts the hypothesis.
Compare examples 4 and 5 above. If a statement is very general, there are two problems:
- it is difficult to collect evidence to test the hypothesis and
- any evidence of a relationship might also support other explanations.
The process of experimental design refinement is one where we attempt to ‘pin down’ a general hypothesis (which may be called a ‘research question’) to a series of more specific hypotheses that are both more precise and more easily testable. The art of experimental design is to collect data appropriate to the research hypothesis. We will discuss how to collect data from a corpus below.
In brief, a simple experiment consists of
- a dependent variable, which may or may not depend on
- an independent variable, which varies over the normal course of events.
As a convenient shorthand, we will also refer to the independent variable as the ‘IV’ and the dependent variable as the ‘DV’.
(By the way, it is also possible to have experiments with more than one possibly contributing factor (IV) or more than one kind of outcome (DV) but it is always better to keep it simple. In any case, such ‘experiments’ are carried out as a series of simpler experiments.)
Thus in example 5, the independent variable might be ‘text category’ (e.g., spoken or written), and the dependent variable, the choice of whom out of linguistically plausible cases, i.e., when either who or whom could be used in the text. Moreover, since pronouns in subject position should not be whom, limiting the context to objective pronouns further improves the experimental design.
A statistical test lets us measure the strength of the correlation between the dependent and independent variables.
Some people say that we shouldn’t call research with a corpus ‘an experiment.’
- When performing research with a corpus, we usually sample data from a corpus that has already been sampled by other people, the participants were free to engage in a wide range of activities, and data collection is not focused on our research aims.
- By contrast, a conventional ‘laboratory’ experiment involves researchers collecting data themselves, often under controlled conditions in a focused manner. A researcher may manipulate participants, set tasks, or provide cues to elicit particular responses.
For this reason corpus research may be termed a ‘natural experiment.’ But natural experiments are a type of experiment.
- Instead of asking participants to perform a particular task, we observe participants performing those tasks and many others in their everyday lives. We then select from that data appropriately and perform post hoc analysis of this data.
Our key point is that the same principles of scientific rigour should be applied to both types of experiment.
Does a significant result mean that we have proved our hypothesis?
Strictly, no. There is always the possibility that something else is going on. A correlation does not indicate a cause.
Example
Consider the following.
- Imagine that we found that, taken across a population, the height of children (A) and their educational performance (B) correlated.

- But growing taller does not increase one’s thirst for knowledge, nor one’s ability to pass exams!
- The reverse implication may also be true, i.e., that knowledge and education tends to improve diet and well-being.
- Other root causes (e.g. distribution of wealth, C, or different historical period, D), might be said to contribute instrumentally to both height and education simultaneously.

To deal with this ‘root cause’ problem, one experimental technique is to eliminate any such possible cause by obtaining a sample of similar wealth, or from the same time period, and performing the experiment again. However, the result then applies specifically to the population from which the sample was taken. This method is what is meant by experimental ‘reductionism’ (not to be confused with philisophical reductionism).
Note what we say above about ‘laboratory’ experiments and ‘natural’ experiments. In a lab experiment researchers can (to some degree) control experimental conditions in order to limit the effect of alternative hypotheses in advance. But the disadvantages with lab experiments are that they may be
- artificially constrained (e.g. reading from a screen with the lights off with the participant’s head anchored to apparatus), or
- unrepresentative (e.g. data based on a particular collaborative task).
Corpus studies, by contrast, have the potential to overcome some of these problems.
In our case, we must use a linguistic argument to try to establish a connection between any two correlating variables.
What about the converse? Does a non-significant result disprove the hypothesis?
The conventional (Popperian) language used to describe an experiment is couched in double negatives. We refer to a ‘null hypothesis’, which is the opposite of the hypothesis that we are interested in. If a test does not find sufficient variation, we say that we cannot reject the null hypothesis.
This is not the same as saying that the original hypothesis is wrong, rather that the data does not let us give up on the default assumption that nothing is going on.
- More comments on the philosophy of corpus experimentation may be found on our FTF FAQ page.
So, what use is an experiment?
Experiments allow us to advance a position. If different bits of evidence point to the same general conclusion, we may be on the right track. This is termed ‘triangulation’.
Designing an experiment
An experiment consists of at least two variables: a dependent and an independent variable. The experimental hypothesis, which is really a summary of the experimental design, is couched in terms of these variables.
Suppose we return to our example 5. Our dependent variable is the usage of objective pronoun whom versus the usage of who, and our independent variable is the text category. Suppose that we take data from ICE-GB, although we could equally take data from other corpora containing spoken or written samples. (NB. The size of the two samples need not be equal.)
Our experimental hypothesis is now a more specific version than our previous one, i.e.,
- the rate of the choice of objective pronoun whom versus who varies in usage between spoken and written British English sampled in a directly comparable way to ICE-GB categories.
Absolute or relative?
Note that we are not suggesting looking at the absolute frequency of who or whom, e.g., the number of cases per 1,000 words.
Rather, we should examine the relative frequency of whom when the choice arises.
- Suppose someone tells you that train journeys are becoming safer. Between 2010 and 2020, they say, the number of accidents on the railways fell by ten percent.
- But what if the number of journeys halved over the same time period? Should you believe their argument?

- The relative risk of injury has increased, not fallen. The later risk per journey is now 90/50x100% = 180% more than the earlier.
- An absolute frequency tells you how frequent a word is in the corpus. But the reason that a word is there in the first place might depend on many factors that are irrelevant to the experimental hypothesis.
- Using relative frequencies focuses in on variation where there is a choice. The bad news is that you may need to check the examples in the corpus to see if there really is a choice in each case. If you use other annotation, such as tagging or parsing, to classify your cases, you may need to double-check this. (So, in case you were wondering, we haven’t given up on the naturalistic study of the corpus just yet.)
If you are examining a parsed corpus like ICE-GB and are trying to investigate explicitly represented grammatical phenomena (i.e., the queries can be represented as an FTF or series of FTFs), then constructing a set of choices need not be especially difficult (see the issue of enumeration). If you have a large number of cases, you can rely on the parsing, provided that there is not a systematic bias in the annotation (i.e., either the corpus has been hand-corrected or uncertain choices have been resolved at random). Annotation errors that are systematic represent a bias, those that are random are noise.
In our example in Part 2, the use of relative frequencies means that the expected distribution is calculated by scaling the total “who+whom” column. If we calculated it on the basis of absolute frequencies, the expected distribution would be merely proportional to sample size, and the result would be much more easily skewed by sampling differences between subcorpora.
The benefits of a choice-based method
One of the common criticisms of corpus linguistics, namely that it consists of atheoretical frequency counting, is addressed.
Focusing on an individual choice factors out variation in the probability of the choice occurring in the first place.
- If we are interested in who vs. whom, we focus on objective position where there is a choice. Subjective cases, inevitably who unless a production error, are distracting noise, as are other pronouns, other NP heads, or indeed, other words.
- Likewise, in examining modal will vs. shall, we are interested in verb phrases which express futurity or prediction.
Focusing on a genuine choice, termed onomasiology, improves reproducibility, although you should always aim for a representative sample of cases!
From a mathematical point of view there are several benefits.
- The information content of your research design doubles. Instead of sampling an event frequency, f, you sample a event frequency and a meaningful baseline frequency, n, to obtain a meaningful proportion, p = f / n.
- This proportion should be free to vary, i.e. p can be any number from 0 to 100% (i.e. f can range from 0 to n).
- This principle underpins many statistical methods, including contingency tests, confidence intervals and logistic regression.
See also
- That vexed problem of choice and Chapter 3, Wallis (2021).
A methodological continuum
Our methodological position is located somewhere between two polar opposites: exemplar-based introspection and the reporting of absolute frequencies in the absence of a context.
From our point of view, the upper bound of an acceptable method is probably illustrated by the clause experiment (does the mood of a clause affect the transitivity of a clause?). Cases in this experiment (clauses) occur in a variety of distinct linguistic contexts where the factors affecting the correlation of the two variables (mood and transitivity) will also vary from situation to situation. Such an experiment is really too general to form any definite conclusions.
The lower limit of our empirical approach is where we have a single linguistic choice and sufficient data to permit a statistical test. We will see this kind of experiment later. Note that the issue of specificity (of the linguistic context) need not necessarily be defined by only grammatical criteria.
- As we shall see, we can explore a set of experimental hypotheses by working from the top, down, or from the bottom, up.

This kind of corpus research is based on counting tokens rather than types. That is, our evidence is frequency data from a natural sample, each case counted irrespective of whether it is unique. This experimental design is distinct from studies of dictionary or lexicon data concerned with the frequency of unique types.
The sample chosen for an experiment determines the theoretical ‘population’ you can generalise a significant result to. A significant result in a corpus can be generalised to a comparably selected population of sentences. A significant result in a lexicon study (where corpus frequency data is not taken into account) generalises to a comparably selected population of lexicon entries. If the lexicon is derived from a corpus, then this allows us to predict that, were a lexicon to be compiled using the same process from a comparably sampled corpus, the results would be likely to be repeated. If the lexicon is derived from dictionaries, then the dictionary compilers’ biases must be taken into account.
Experiments with a parsed corpus
The issue of relating two or more grammatical elements together becomes central when we consider the question of performing experiments on a parsed corpus.
- It is easier to be more precise when establishing the grammatical typology of a lexical item, saying “retrieve this item in this grammatical context”. We can also vary the preciseness of our definitions by adding or removing features, edges, links, etc.
- It is easier to precisely relate two items, e.g., “the and man are both part of the same noun phrase”.
Against this perspective is the objection that an experiment must necessarily be in the context of a particular set of assumptions, i.e., a specific grammar.
This does not rule out the possibility of scientific experiments, however. Rather it simply means that we must qualify our results – “NP, according to this grammar,” etc. We discuss this issue at greater length here. In a parallel-parsed corpus one could study the interaction between two different sets of analyses (although in practice the workload would be very high unless more of the task could be automated).
In ICECUP, we use Fuzzy Tree Fragments to establish the relationship between two or more elements. Moreover, we can study how grammatical features, elements and constructions interact. This is discussed in some detail on another page.
We turn first to a slightly simpler problem: investigating if a sociolinguistic variable affects a grammatical one.