How FTFs can be used to perform natural experiments with a parsed corpus like ICE-GB and DCPSE.
Part 4. Common problems of experimental design
The following problems are discussed below:
- have we incorrectly specified the null hypothesis?
- have we listed all the relevant values?
- are we really dealing with the same linguistic choice?
- have we counted the same thing twice?
Additional problems are discussed on the final page, where we discuss experiments into the interaction between lexico-grammatical variables.
- have we effectively captured the aspect of the annotation we require?
- can we tell if cases overlap one another by examining the corpus?
- Have we specified the null hypothesis incorrectly?
The null hypothesis is a statement that the variation does not differ from the expected distribution. If the expected distribution is incorrectly specified, the experiment will measure the wrong thing, and your claims will not be justified. This is another way of saying, get the experimental design right.
If you followed the steps in the basic example in Part 2, you should have no problem. However, often researchers make the assumption that the expected distribution is just based on the quantity of material in each subcorpus. In a lexical or tagged corpus it can be more difficult to calculate an expected distribution or baseline of choice because it might involve listing a large number of possible alternatives.
For example, suppose we compare the frequency of the modal verb may in a series of subcorpora with the quantity of material in each one. Appropriately scaled, this could be quoted as the frequency ‘per million words’, ‘per thousand words’, etc.
However, there is an obvious problem with this: what if we found that may appeared more frequently than expected in, say, formal writing?This does not tell us if either
- there are more uses of modals in formal writing, or
- there are more uses of modal “may” than other modal alternatives (e.g., can) in formal writing.
In summary, the useful null hypothesis is that the outcome does not vary when the choice arises in the first place rather than that it does not vary in proportion to the amount of material in the corpus!
Define the general phenomenon first (e.g. “verb phrases expressing possibility”). How can we search for all verb phrases expressing possibility? This might be difficult! Here we have two options: to work top-down or to work bottom-up.
Top-down
On the next page we illustrate this idea with some examples exploring grammatical interaction. For lexical expressions it is more common to work bottom-up.
We might not be able to express all verb phrases expressing possibility. But we might find an alternative, such as all tensed verb phrases. This is still better than the total number of words.
Bottom-up
List the outcomes that we are interested in, e.g. {may, can, might, could}, and then define the expected distribution by summing these. This is what we did in our “who vs. whom” experiment.Ultimately, whether you work from the top-down or bottom-up, and which baseline you choose is a question of research design.
But by focusing on the choice we can seek to eliminate the noise introduced by the simple natural variation of speakers and writers simply deciding to express possibility in their utterances at particular points in the corpus!
If we investigate may as a member of this set, we can examine within-group choice variation. But if we were to investigate may as a proportion of words, we will inevitably conflate this variation with opportunity variation – the opportunity to express may
Finally, note that you can, of course, work in both directions within the same experiment. In the following example we define who vs. whom and work from the bottom, up; then we define a which/what+N group and subdivide this (i.e., we work top down).
- Have we listed all the relevant values?
When you fail to list or enumerate all possible outcomes of a particular type, it also affects the conclusions that you can draw.
In the worked example we suggested that there were two possible alternative outcomes, who and whom. The DV is defined as one or the other of these. Note that in the formal outline we wrote that the row total was an expression like ‘a and (x or y or...)’ rather than just ‘a’ – there might be other values that are not included in the choice of ‘(x or y or...)’.
If we fail to consider all plausible outcomes, our results may be too specific to be theoretically interesting (see this discussion on meaning) and may be more likely to be due to an artefact of the annotation.
Sometimes we may want to add other alternatives while prioritising certain grammatical distinctions over others. To do this, we may group alternatives hierarchically to form an outcome taxonomy.
Where two outcomes are subtly different from one another, but quite distinct from a third (e.g. who vs. whom vs. which/what+N constructions), we might like to consider splitting tables and tests:
- to compare the first two aggregated together vs. the third (when do we use who or whom?), and
- to focus on the difference between the first two (when do we prefer who over whom?).
- Are we really dealing with the same linguistic choice?
One advantage of working from the bottom, up, rather than from the top, down, is that this method encourages you to think about the specifics of each linguistic choice. Our assumption has been that who and whom are interchangeable, i.e., any time a speaker or writer uses one expression the other could equally be substituted. Likewise, in the illustration above, we effectively assumed that in all cases where who or whom were used, a which/what+N expression could be substituted, and vice-versa. But sometimes you need to check sentences in the corpus to be sure.
This is a particular problem if you want to look at a phenomenon that is not represented in a corpus. For example, in ICE-GB, particular sub-types of modals are not labelled. Suppose we wished to contrast deontic (permissive) may with deontic can, and then compare these with other ways of expressing permission. We don’t want deontic may to be confused with other types of modal may, because these other types do not express the same linguistic choice. The solution is to first classify the modals manually (effectively by performing additional annotation). This may be possible through the use of ‘selection lists’ in ICECUP 3.1.
This problem is related to issues arising from the annotation.
- Have we counted the same thing twice?
If you include cases that are not structurally independent from one another you flout one of the requirements of the statistical test. You must not count the same thing twice.
The problem arises because a corpus is not like a regular database, and ‘cases’ (instances) interact with one another.
There are several ways in which this “bunching” problem can arise. We have listed these from the most grammatically explicit to the most incidental.
- Matches fully overlap one another. We discussed this in the context of matching the corpus (see here). The only aspect of this matching that can be said to be independent is that different nodes refer to different parts of the same tree. If you avoid using unordered relationships in FTFs, this will not happen.
- Matches partially overlap one another, i.e., part of one FTF coincides with part of another. There are two distinct types of partial overlap: where the overlapping node(s) in the tree are the same in the FTF, or different. The first type can arise with eventual relationships in FTFs, the second can only occur if two nodes in an FTF can match the same constituent in a tree (see detecting case overlap).
- One match can dominate, or subsume another, for example, a clause within a clause.
- Alternatively a match might be related to another via a simple construction, e.g., co-ordination.
- The construction may be repeated by the same speaker (possibly in the same utterance) or replicated by another speaker (termed ‘priming’).
If you are using ICECUP to perform experiments, you can ‘thin out’ the results of searches by defining a 'random sample' and then applying it to each FTF. However, this thins out text units, not matching cases. If several cases co-exist in the same sentence and the sentences are included they will all be included. Moreover, since this will mean that the number of cases will fall, the likelihood that an experiment is affected by the limited independence of cases will increase.
We should put this in some kind of numerical context: if you are looking at (say) thirty thousand clauses, it is unlikely that a small degree of bunching will affect the significance of the result. If, on the other hand, if you are looking at a relatively infrequent item (say, in double figures) and, when you inspect the cases, you find that many of these are co-ordinated with each other, you may want to eliminate these co-ordinated items, either by limiting the FTF or by subtracting the results of a new FTF that counts the number of additional co-ordinated terms for each outcome.
One solution, explored in the Next Generation Tools project, would be for the search software to identify potential problems in the sampling process and discount interacting cases in some way. Another approach involves adjusting statistics to account for clustering, termed random-text sampling (See Chapter 17 in Wallis 2021).
References
Newcombe, R.G. (1998). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine, 17, 873-890.
Wallis, S.A. (2021), Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » order
Zou, G.Y. & A. Donner (2008). Construction of confidence limits about effect measures: A general approach. Statistics in Medicine, 27:10, 1693-1702.