How FTFs can be used to perform natural experiments with a parsed corpus like ICE-GB and DCPSE.
Part 2. Predicting a grammatical variable from a sociolinguistic one
Constructing a contingency table
Consider the task of finding out whether a socio-linguistic variable (text category) affects the grammatical ‘choice’ of using whom rather than who.
We will first outline the basic approach. Recall that an experiment tests the hypothesis that the IV affects the DV (the independent variable affects the dependent one).
- We construct a contingency table as follows, filling in the cells we have coloured in green by searching the corpus. Using ICECUP, we can perform a series of FTF queries, one for each grammatical outcome (DV = x, y, ...). We then calculate the overlap, or ‘intersection’, between each of the FTF queries and each value of the socio-linguistic variable (IV = a, b,...). In ICECUP, you can employ drag and drop logic to compute the intersection.
We can express a general contingency table algebraically as follows. Naturally, each total is the sum of the preceding row or column, and ‘a and x’ means the intersection of ‘IV = a’ with ‘DV = x’.
dependent variable (grammatical choice) | |||||
DV = x | DV = y | ... | TOTAL | ||
independent variable (sociolinguistic context) | IV = a | a and x | a and y | a and (x or y or...) | |
IV = b | b and x | b and y | b and (x or y or...) | ||
... | |||||
TOTAL | (a or b or ...) and x | (a or b or ...) and y | (a or b or ...) and (x or y or...) |
We want to find out whether the independent variable affects the value of the dependent variable, i.e. the choice of the grammatical construction. To do this we can contrast the distribution of each grammatical choice with the distribution that would be expected if it were unaffected by the IV.
- We can set up a simple chi-square goodness of fit test (written χ²) for the outcome DV = x. If this chi-square is significant it would mean that the value of the independent variable appears to affect the choice of outcome x. (Strictly, the null hypothesis – that it is unaffected – is not supported.) The chi-square compares an observed distribution for DV = x with an expected distribution based on the total (DV = <any>).
Important reminder: do not assume that the total (which is proportional to the likelihood of the choice occurring in the first place) is necessarily in proportion to the quantity of material in those sections of the corpus. This is an all-too-common mistake. For more on this, see the discussion on relative frequency in Part 1 or, for the experimental conclusions of this argument, below.
dependent variable (grammatical choice) | |||
DV = x | TOTAL | ||
independent variable (sociolinguistic context) | IV = a | a and x | a and (x or y or...) |
IV = b | b and x | b and (x or y or...) | |
... | |||
observed | expected |
- Before performing the test, the expected distribution should be scaled so that it sums to the same as the observed distribution. To calculate the scale factor, divide the column totals by one another: SF = TOTAL(obs) / TOTAL(exp). To scale the expected column, multiply all values by SF.
We can perform the chi-square for any other choice: DV = y, etc., and for the entire table (see below). All observed distributions are compared against all expected distributions in a single chi-square.
Example: who/whom in ICE-GB
Suppose our IV is spoken or written, i.e., the simplest subdivision of text category, and we are interested in a grammatical choice: who vs. whom. The table now looks like this.
dependent variable (who vs. whom) | ||||
DV = who | DV = whom | TOTAL | ||
independent variable (speech vs. writing) | IV = spoken | who in spoken | whom in spoken | (who+whom) in spoken |
IV = written | who in written | whom in written | (who+whom) in written | |
TOTAL | who in (spoken or written) | whom in (spoken or written) | (who+whom) in (spoken or written) |
This is a simple 2×2 contingency table, i.e., both variables have two possibilities. By performing the necessary searches we obtain the four central figures, then sum the rows and columns. We then apply this data to a chi-square test.
We want objective pronoun cases. Wallis (2021: 39) reports the following data from ICE-GB, obtained by using FTFs and subtracting subjective cases.
dependent variable (whom vs. who) | ||||
DV = who | DV = whom | TOTAL | ||
independent variable (speech vs. writing) | IV = spoken | 135 | 41 | 176 |
IV = written | 22 | 41 | 63 | |
TOTAL | 157 | 82 | 239 |
- You can input the data into this 2x2 chi-square spreadsheet. (Note that in this spreadsheet, the DV and IV are the other way around.)
In order to visualise the distribution we can plot it in the form of a simple bar chart (histogram). The vertical represents the frequency, i.e. number of cases, in each category.

The most striking observation about this data is the large 'spike' for who in the spoken data. But there are also more examples of either who or whom in the spoken part of the corpus.
Having collected the data we can now pose some experimental questions.
Question 1. 2×1 goodness of fit for writing (against all data)
Consider the following question.
- Is there evidence of a tendency to use ‘whom’ more in writing?
If we look at the graph and table above, we can see nearly two-thirds of the written cases are whom.
This question asks whether this proportion is in line with the sample overall. The test, termed a goodness of fit χ² test for the written column, takes the total number of cases as given. With a two-way (binomial) dependent variable, this can be considered a single-proportion test (Wallis 2021: 140).
The observed χ² score exceeds the critical value of χ²(0.05, 1) = 3.841, so we may reject the null hypothesis. We might express that null hypothesis as the claim that writing does not differ from the overall tendency to use whom over who in objective position.
Question 2. 2×2 homogeneity test for independence
Next, consider the following question
- Is the choice of ‘whom’ vs. ‘who’ significantly affected by the text category?
This question asks whether the whom proportion is consistent across each text category. The test, termed a homogeneity χ² test, simply assumes that both written and spoken column data are sampled. This is a two-independent-proportion test (Wallis 2021: 149).
How do these two questions and tests differ?
The best way to think about these two types of question and test is by visualising these proportions with confidence intervals. Wallis (2013a; 2021) and his corp.ling.stats blog covers this in some detail. See also this Statistics Resources page.
In the graph below, we plot the proportion of cases that were whom out of the choice {who, whom} for spoken and written data. (If we wished to plot the graph for who, we would simply turn it upside down.)

Question 1 compares the written data proportion with the average (the red dashed line). This written proportion was observed in our sample.
- We assumed that our observed proportion in the written data was uncertain. This uncertainty arises from the simple fact of random sampling. Were we to repeat the experiment, we would be unlikely to get exactly the same numbers again. We want to know if an observed difference from the average is too large to explain by blind luck.
- The goodness of fit test compared that uncertain proportion with a fixed average.
Question 2 does something different. It combines the sampling uncertainty for rates in both the spoken and written columns. This ‘two-proportion’ test for homogeneity (independence) compares both uncertain observations with each other. It asks, are these rates different from each other?
- The homogenity χ² test estimates a Normal interval centred on the average and asks whether the two observed points are sufficiently distant from the average to be explained by random luck.
- A different test, called the Newcombe-Wilson difference method, is a little easier to interpret. This method calculates an interval for the difference d = p(whom, written) – p(whom, spoken), by combining interval widths on opposite sides (we take the square root of the two squared widths). See Wallis (2013b; 2021: 124).
- Either way, it should be obvious from the figure above that the difference between proportions is substantially greater than the interval widths. (A simple shortcut is to spot if there is a gap between intervals: in this case, the two points must be significantly different from each other.)
How was this graph calculated? The answer is that we used the Wilson score interval method.
We repeat the method for the spoken data. This yields an interval of (w⁻, w⁺) = (0.1766, 0.3007). In other words, were we to repeat the same experiment 20 times, we would expect to see a proportion outside of this range no more than once.
What do significant results mean?
A significant result means that a perceived difference between observed and expected data is large enough for it to be unlikely that it happened by chance. (Often it is best to say that “we observed a significant difference between x and y,” rather than “our test was significant.”)
But all statistical results indicate a correlation, they cannot prove a cause.
A significant result suggests that something systematic is going on in the data which would be reproduced were we to take similar samples (of text) from the population (of all texts from the same period, authorship, genres, etc., and annotated in the same way). It doesn’t tell us what the cause is. This is the corollary of the argument we made before regarding the difficulty of proving or disproving a hypothesis.
As researchers have become more sophisticated in their use of statistics, there has been a tendency to fit data to particular models, and sometimes we see results reported as “explaining” the data. But this is still a correlation, so all that is meant mathematically is “we could explain this pattern in the data with these variables to this degree, but we still don’t know whether this explanation is correct!”
In practice, it is often wiser to focus on simple two-way choices and two variables initially, and focus on pivotal changes that have linguistic meaning. The more complicated the model, the more you are taking the method on trust. Statistics is difficult enough to get right as it is.
There is another aspect to significant difference. If the sample size is very large, almost every apparent difference will be significant (i.e., the direction of difference observed would likely exist in the population). The question is then, it may be significant, but how big is the effect? We consider effect size measures in Part 3.
Does a significant result prove the experimental hypothesis? Not necessarily...
An artefact of the annotation?
A result could be an artefact of the annotation. There are at least three different kinds of problem here.
- Circularity. Maybe the result flows from the way that the grammar has been adapted to deal with, say, spoken English. You may find that in practice you were measuring the same thing in two different ways. No wonder the result was significant!
- Inaccurate sampling. It could be due to an imprecise FTF definition, incorrect annotation or poor sampling in the first place (for an example of an annotation weakness, see here). Are all your cases really expressing the same linguistic phenomenon? Are there other cases in the corpus that should have been included? Another possibility is that cases are not strictly independent.
- Poor design. Are all possible values of a variable listed? Are some of the outcomes expressing something linguistically quite distinct to the others? You might want to restructure the experiment to deal with two distinct subgroups (see the example below).
To decide you must inspect the cases and relate the design to a theory.
A third factor?
The correlation could be a result of a third factor affecting both the DV and the IV separately. This is more likely if the DV and IV are both grammatical. (This is our old friend the root cause.)

The central point is that to ascertain the reason for a correlation requires more work. You should inspect the corpus and argue your case. Relate your results to those reported in the literature: if they differ, why do they do so? Play ‘devil’s advocate’ and try to argue against your own results. Dig deeper – try looking at subcategories of each FTF, for example – and anticipate objections.
The beauty of a public corpus and query system is that, provided you report your experiment clearly, everyone has the opportunity of reproducing your results (and possibly raising alternative explanations).
The corpus is a focus for discussion, not a substitute for it.
As well as detecting a significant difference and even explaining it, it is also useful to be able to measure how big an effect is: a small variation can be significant without being very interesting, while a large variation can indicate that there is something more fundamental going on.
References
Gould, S.J. (1984), The mismeasure of man, London: Penguin.
Wallis, S.A. (2013a), Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » corp.ling.stats
Wallis, S.A. (2013b), z-squared: The origin and application of χ². Journal of Quantitative Linguistics 20:4. 350-378. » corp.ling.stats
Wallis, S.A. (2021), Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » order