How FTFs can be used to perform natural experiments with a parsed corpus like ICE-GB and DCPSE.
Part 3. Effect sizes and confidence intervals
How big is the effect?
A statistically significant result could be small or large. As we have seen, significance just tells us that a result is unlikely to be due to chance. Moreover, in order to generalise from the significant result, the experimental sample should approximate to a random sample from a specified population of utterances (e.g., in the case of ICE-GB, the population might be 1990s British English, spoken and written).
If the result is significant, it doesn’t mean that the effect is large. Below, we discuss four different measures of effect size. These are
- single proportion (sample probability)
- difference in proportion and percentage difference
- Cramer’s ϕ measure of association
- Goodness of fit ϕp (root mean square)
(This section owes extensively from Sean Wallis’s work in statistics and corp.ling.stats blog. See also Statistics Resources.)
We will use the following data from ICE-GB, obtained by using FTFs to find object pronouns by subtracting subjective cases. This data may be substituted into this 2x2 chi-square spreadsheet.
dependent variable (whom vs. who) | ||||
DV = who | DV = whom | TOTAL | ||
independent variable (speech vs. writing) | IV = spoken | 135 | 41 | 176 |
IV = written | 22 | 41 | 63 | |
TOTAL | 157 | 82 | 239 |
A single proportion p
One simple measure is the proportion of cases in a category, i.e., the sample probability that a specific outcome is chosen (e.g., who over whom), given a particular value of the independent variable (in the example, text category). Suppose IV = a = spoken and DV = x = whom.
dependent variable | ||||
DV = whom | ... | TOTAL | ||
independent variable | IV = spoken | o11 = O(whom, spoken) | n1 = TOTAL(spoken) | |
... | ||||
TOTAL | TOTAL(whom) | TOTAL |
In the previous example, 41 out of 176 cases (23.30%) found in the spoken category are whom, with 41 out of 63 cases (65.08%) in the written. The probability of selecting whom in the spoken data may be expressed as
- p(whom | {whom, who}, spoken) = O(whom, spoken)/TOTAL(spoken) = o11/n1 = 41/176 = 0.2330.
Probabilities are a normalised way of viewing frequencies (they are between 0 and 1 and, for a fully enumerated set of alternatives, must sum to 1). This probability is the observed probability of the experiment, that is, if the sample were truly representative of the population, this would be the likelihood that a particular dependent value was chosen if a specific value of the independent variable was selected.

Confidence intervals on p
As we noted, the probability quoted above is a probability estimate based on the corpus sample. Were we to reproduce the experiment with a different corpus sample taken from the same population of English utterances, we might not get the same result. Consequently, when we make an estimate like this it is useful to also calculate the probabilistic ‘margin of error’ or, to use the proper term, confidence interval. This can be thought of as the most likely range of probability values.
A confidence interval for the single proportion can be calculated using the Wilson score interval. This obtains an interval that is guaranteed to be consistent with the equivalent goodness of fit χ² test. If we applied a goodness of fit test with a true proportion P outside the interval range for p it would be significantly different from p, and if P was within the interval it would not.
The 95% Wilson score interval for p(whom | {whom, who}, spoken) = 0.2330 with sample size n1 = 176 is (0.1766, 0.3007). See the figure above, left.
We might write this expression as p = 0.2330 ∈ (0.1766, 0.3007) as a shorthand.
An interval p ∈ (p⁻, p⁺) is the set of all values between lower bound p⁻ and upper bound p⁺.The '∈' symbol means that p is a member of the set. To specify the Wilson score interval we use the labels (w⁻, w⁺) because there are other methods for estimating an interval.
Difference between two independent proportions
A commonly-cited effect size measure for 2 × 2 tables is the difference between two independent proportions, d = p2 − p1.
In our who vs whom example, the proportions p1 = p(whom | {whom, who}, spoken) and p2 = p(whom | {whom, who}, written) are independent. They are cited from two different parts of the corpus, or, to put it another way, from two different populations: spoken utterances sampled comparably to those in ICE-GB, and written sentences sampled comparably to those in ICE-GB.
The difference between two scores can be obtained by simple subtraction,
- d = p2 − p1 = 0.6508 − 0.2330 = 0.4178.
A good quality confidence interval can be obtained by a Pythagoras-type method called the Newcombe-Wilson difference interval, or, more generally, ‘MOVER’ (method of variance estimates recovery). This is set out below.
The idea is that the interval width for each bound of d is calculated separately as the hypotenuse of a right-angled triangle where the other two sides are the interval widths of each proportion. See the sketch below. The lower bound hypotenuse, labelled wd⁻, is the lower bound interval width for d.

This obtains the 95% interval d = 0.4178 ∈ (0.2778, 0.5378).
In other words, we predict the true difference to be between 0.2778 and 0.5378 based on our data, with an error rate of 1 in 20. Since this interval does not include 0, we can say that it is ‘significant’, i.e. significantly different from zero.
Percentage difference
A popular effect size measure which we often see quoted is percentage difference. This is the difference between two scores, divided by the starting point. We might write it as
- d% = (p2 − p1) / p1.
With our data, this obtains an oberved rate of
- d% = (0.6508 − 0.2330) / 0.2330 = 1.7937.
A good quality confidence interval for d% can be calculated using Method 3 in this blog post.
This yields the 95% confidence interval
- d% = 1.7937 ∈ (1.0073, 2.8921).
In plain English: the tendency to use whom rather than who is some 180% higher in writing than speech data in ICE-GB. We can predict this increase is between +101% and +289% (or between 2 and 3.9x the original rate) for comparably sampled data, at an error level of 5% (1 in 20).
Although it is popular, percentage difference has one major counterintuitive defect.
Positive and negative percentage differences are not the same thing. +100% means doubling, whereas the inverse (halving) is -50%. Suppose you read in a newspaper that crime rates declined by 10% in the first year and rose by 10% in the second. What is the crime rate now compared to the start? What if it went down by 50% and up by the same amount?
In addition, if the starting point p1 is zero, we obtain a meaningless result (an infinite increase) that we cannot compare with anything. It is also quite difficult to make meaningful comparisons between proportions and differences that are widely diverse.
For these reasons it is usually preferable to cite simple difference.
Cramér’s ϕ measure of association
This is a measure closely related to chi-square that is particularly useful for interaction experiments, i.e. experiments where we are interested whether one lexico-grammatical variable influences another.
There are two formulae. There is a special signed formula for a 2 × 2 table, which generates a score ranging from [-1, +1], i.e. any number from -1 to +1 inclusive. There is a second formula for an arbitrary multinomial r × c table, which generates an unsigned score from [0, 1].
The unsigned ϕ score is
- ϕ =√χ² / n(k – 1),
where n is the total number of observations (the grand TOTAL) in the table and k is the minimum of the total number of rows and columns.
One way of thinking about ϕ is it is a normalised homgeneity χ² statistic. In other words, if we wanted to compare two χ² results, we would be better off to compare ϕ scores. The maximum possible value of χ² is n(k – 1), and we take the square root of the whole amount to convert the property to a 'standard deviation' scale.
The main problem with multinomial evaluations of this kind is that they average many differences. So a difference in r × c scores is not that powerful. There are better methods for comparing results of experiments (see Wallis 2021: Chapter 15).
A more usable formula is that for 2 × 2 ϕ.
- ϕ = (o11o22 – o12o21) / √{(o11 + o12)(o21 + o22)(o11 + o21)(o12 + o22)},
for a 2 × 2 table of the form [[o11, o12], [o21, o22]].
Here is our data again:
dependent variable (whom vs. who) | ||||
DV = who | DV = whom | TOTAL | ||
independent variable (speech vs. writing) | IV = spoken | o11 = 135 | o12 = 41 | o11 + o12 = 176 |
IV = written | o21 = 22 | o22 = 41 | o21 + o22 = 63 | |
TOTAL | o11 + o21 = 157 | o12 + o22 = 82 | 239 |
This data gives us
- ϕ = ((135 × 41) – (41 × 22)) / √{176 × 63 × 157 × 82} = -0.3878.
This has an absolute value equal to √χ² / n. But the sign (minus) also tells us the direction of the change.
This ϕ score has another property: it is the geometric mean of differences in proportions calculated across both dimensions of the 2 × 2 grid.
Suppose we define d(IV) as the difference across the independent variable, and d(DV) for the dependent variable,
- d(IV) = p2 – p1 = o21/(o21 + o22) – o11/(o11 + o12), and similarly
- d(DV) = o12/(o12 + o22) – o11/(o11 + o21),
then ϕ is their geometric mean:
- ϕ² ≡ d(IV)² + d(DV)²
This insight gives us an efficient way to calculate a confidence interval for 2 x 2 ϕ.
We have to work out the correct sign of the score from these squares, but using this data:
- d(IV) = 22/63 – 135/176 = 0.6508 – 0.2330 = 0.4178 ∈ (0.2772, 0.5378),
- d(DV) = 41/82 – 135/157 = 0.5000 – 0.8599 = 0.3599 ∈ (0.2368, 0.4751).
To recover the sign we employ the formula ϕ⁻ = -sign(d⁺(DV))√{d⁺(DV) × d⁺(IV)}. For a detailed explanation see the blog post.
- ϕ = -0.3878 ∈ (-0.5055, -0.2562).
Goodness of fit ϕp
Just as ϕ is a kind of generalisation of simple difference d for homogeneity χ², it is useful to have a similar score for goodness of fit χ². Wallis proposes the root mean square fit, which he labels ϕp.
Let observed proportions pi = oi/n, and similarly expected Pi = ei/n. Both ∑pi and ∑Pi = 1. We can define ϕp as
- ϕp = √½∑(pi – Pi)²,
which has a maximum of 1. A small score means a small difference and therefore a close fit. A confidence interval can be obtained by the method set out in this blog post.
For a 2 × 1 fitness test, a signed score is extremely simple to obtain!
- signed ϕp = p1 – P1,
i.e. the relative swing from a given value or mean of the table.
For our data, comparing p(whom | {whom, who}, spoken) with the mean,
- p1 = p(whom | {whom, who}, spoken) = 0.2330.
This has interval p1 ∈ (0.1766, 0.3007). To obtain an interval for signed ϕp, we simply subtract P1 from these figures because it is not uncertain.
If we compare p1 with the mean of all data (as in the goodness of fit test), we have
- ϕp = 0.2330 – 0.3441 ∈ (0.1766 – 0.3441, 0.3007 – 0.3441),
- ϕp = -0.1101 ∈ (-0.1665, -0.0424).
Wallis notes that despite its simplicity, this measure is a kind of mutual fit, and it is not quite the same as goodness of fit χ².
A slightly more complicated measure, ϕe, is a scaled goodness of fit χ² statistic. It requires that the observed data is a strict subset of the expected data. For more information see this blog post.
See also
- Statistics Resources
- Sean Wallis has published an extensive resource on confidence intervals in his blog on corpus linguistics, corp.ling.stats.
References
Newcombe, R.G. (1998). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.
Wallis, S.A. (2021), Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » order
Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.