Performing experiments using FTFs

Predicting one grammatical variable with another

We now turn to the study of the interaction between two grammatical variables, i.e., how one aspect of grammatical structure might interact with another.

This is a bit more complicated than studying how socio-linguistic context might affect a grammatical outcome. The central practical problem is that you cannot complete a contingency table by computing intersections with logical ‘and’ in the normal way.

If you are not quite sure that you understand the simpler example on the previous page, please reread the discussion.

The concept of a ‘case’ in a grammatical sample

A corpus consists of independently - if not randomly - sampled “texts”. If we find two examples of a phenomenon in separate texts, we can readily assume that these examples have arisen independently. However, what if the phenomenon appears in the same text, in the same flow of utterances or in the same single utterance? We discussed this previously.

Note that a corpus text is not like a regular database, where records are typically independent from one another. Where records in a regular database are related (e.g., samples collected over time), they should be analysed differently.

In grammar, a case could be a single constituent, like a clause, or a group of associated constituents expressed as a more complex FTF.

If we want to investigate how two aspects of a grammatical phenomenon interact, we should note the following.

  1. The two variables must apply to the same phenomenon. That is, both variables must be based on the same fundamental definition of the case in question. Note that this fundamental definition could be specified as a set of alternative FTFs (as in the example below), but these alternatives should form a meaningful group (e.g., with/without a constituent). We will adopt the convention of specifying a definition of the case in the top left of the contingency tables.
  2. In practice, this means that we specify FTFs for every cell in the contingency table, not just for every column. We cannot use drag and drop logic to calcuate the intersections. However, we must avoid ambiguity in FTF relations (avoid unordered or eventual relations if at all possible).
  3. We should enumerate all alternatives of the case. Where an FTF cannot be used directly (ICECUP 3.1 allows a search for unspecified features but ICECUP 3.0 did not) we may infer these values by subtracting from a more general case, representing the total.

Extending the basic approach to grammatical interactions

Suppose that we are interested in investigating aspects of clause structure and we want to find out whether one grammatical variable (say, the mood {exclamative, interrogative, etc.} = IV) affects another (say, the transitivity feature = DV).

  1. We construct a contingency table as before. Instead of performing FTF queries for each grammatical outcome we must define FTFs for each combination of dependent and independent variable (the cells shown in green). As before, each total is the sum of all preceding rows or columns.
CL
dependent variable (transitivity)
DV = m
DV = d
...
TOTAL

independent variable

(mood)

IV = e
exclam
CL
(exclam, montr)
CL
(exclam, ditr)
 
e and
(m or d or...)

IV = i
inter

CL
(inter, montr)
CL
(inter, ditr)
i and
(m or d or...)
...      
TOTAL
(e or i or ...) and
m
(e or i or ...) and
d
 
(e or i or ...) and
(m or d or...)

A contingency table (DV x IV) for the example

  1. If you want to ensure that mood and transitivity are fully enumerated, you may also allow for them to be unmarked. This makes the experiment more robust. The way to do this with ICECUP 3.0 is to add a new column or row for the unmarked element, compute an FTF for the total (e.g., for the first column, simply ‘CL(montr)’) and then subtract the frequencies from the total. The result will be the total number of all those monotransitive clauses whose mood is not marked, which you put in the ‘unmarked’ cell. In ICECUP 3.1 you can retrieve these values directly.

    Note that there is an important difference between the mood and transitivity of clauses. All clauses should be classified by transitivity, so if the feature is absent, the clause is incomplete or an error. Mood, on the other hand, is optional (and meaningful): if unmarked, it is assumed to be indicative.

    The grand total is then simply the result of performing a query for ‘CL’. If you can write an explicit FTF here, this FTF defines the case. In the table above, the grand total is the set of all clauses where both transitivity and mood are stated (which is not always the case).
CL
dependent variable (transitivity)
DV = m
DV = d
...
DV = 0
TOTAL

independent variable

(mood)

IV = e
exclam
CL
(exclam, montr)
CL
(exclam, ditr)
   
CL
(exclam)

IV = i
inter

CL
(inter, montr)
CL
(inter, ditr)
 
CL
(inter)
...
     
IV = 0
 
       
TOTAL
CL
(montr)
CL
(ditr)
 
CL

A fully enumerated contingency table (DV x IV)

The white cells contain the result after subtracting all other values from the total.

How does including ‘unmarked’ elements increase the robustness of the experiment?

Including an ‘unmarked’ column or row increases the background noise in the experimental design slightly but makes your claims more general. For example, if you want to see if the mood interacts with the monotransitive case (DV = m), it is preferable to say that the probability that the clause is marked as monotransitive is affected by mood rather than if the transitivity and mood are stated, the probability that the clause is marked as monotransitive is affected by mood. Note that predicting the unmarked outcome (DV = 0) may not be very useful (except to detect errors).
  1. We set up a simple chi-square test for each outcome of the dependent variable as before. The chi-square compares an observed distribution for DV = m, d with an expected distribution based on the total (DV = <any>). You scale the expected distribution as before.
CL
dependent variable (transitivity)
DV = m
DV = d
.
TOTAL

independent variable

(mood)

IV = e
CL(exclam, montr)
CL(exclam, ditr)
CL(exclam)
IV = i
CL(inter, montr)
CL(inter, ditr)
CL(inter)
...
 
IV = 0
   
 
observed
expected

Observed and expected distributions for DV = m in the first contingency table

You can perform a single chi-square test for the entire table, as before, to see if there is an interaction going on, without specifying where.

In summary: we define what we mean by a case, either explicitly - “it’s a clause” - or implicitly - “here are x alternative types of a case”, and collect frequency statistics separately for each cell in the table. The variable is completely enumerated for the dataset if the total number of cases always adds up to the total for each separate column or row in the table.

We will work through some real examples on the following page.

Next page

FTF home pages by Sean Wallis and Gerry Nelson.
Comments/questions to s.wallis@ucl.ac.uk.

This page last modified 12 June, 2013 by Survey Web Administrator.