A Quick Guide to the TOSCA/ICE Grammar

Function, Category, and Wordclass Labels

The TOSCA/ICE Grammar is the annotation scheme used in the parsed ICE-GB and DCPSE corpora.

It is a phrase structure grammar based on Quirk et al. (1985). Each node in the tree consists of three elements. By default ICECUP displays these in the following locations.

These pages contain a quick reference guide to all the grammatical labels used in the corpus.

Much more information is available in the online Help file supplied with the corpus. Below is only a very brief outline.

Our ICECUP software is designed from the ground up around the problem that all users have to ‘learn the grammar in order to explore the grammar’. So: don’t be put off by the apparent complexity of the grammar. The software is very forgiving and you will learn by trying things out for yourself.

Outline of the scheme

Individual terms form phrase structure trees. The scheme can be thought of at two levels.

1. Part of speech tagging

Consider the sentence in ICE-GB/DCPSE, Never can it (S1B-010/DI-B80 #107). This is tagged as follows:-

Never ADV(ge) can AUX(modal,pres) it PRON(pers,sing)    [1]

Part of speech tags, or ‘wordclass tags’ classify words into types. Here these are adverb, auxiliary verb, and pronoun. They may also include features. Here we have the following: general adverb, modal present tense auxiliary, personal singular pronoun.

The problem with part of speech tagging is that this doesn’t record the structure of the sentence.

We rectify this by fully parsing the sentence. The part of speech tags sit at the ‘leaves’ of the phrase structure tree. The tree then looks like this.

2. Phrase structure trees

A parse analysis attempts to provide a structural account of the syntax of a sentence.

Each POS node has a function (shaded in blue here for clarity). In a phrase structure grammar such as ICE, functions define the role a node plays within the structure of the parent branch.

Thus, in this example, the general adverb, ADV(ge), Never is the head (AVHD) of the adverb phrase. The general feature, ge, is said to percolate up (purple arrow) to the adverb phrase, AVP. The role of the adverb phrase is in turn simply described as an adverbial, A, within the host clause (CL) which makes up the parsing unit, PU.

Likewise, the modal auxiliary can is labelled an inverted operator, INVOP (cf. It can never), while the subject (SU) of the clause is an NP consisting of the personal singular pronoun, it.

The present, pres, feature percolates up to the clause from the auxiliary. In other words, the clause is said to be in the present tense because can is a present tense modal auxiliary.

See also:

Extensive information is available in the online Help file supplied with the corpus.

This may be freely downloaded with the complete ICE-GB Sample Corpus. It also contains cross-references to the ICECUP software, so you can find lots of real examples in the sample corpus!

ICE-GB DCPSE

Survey Glossary

Notes

1. Of course this sentence is ambiguous! It could be tagged so that can is treated as an infinitive verb, (colloq.) “to put in trash”, i.e., don't ever can it.

Our reading is Never can it [be the case that...]. Our interpretation is supported by the context - something that automatic taggers and parsers struggle with.

This is just one illustration as to why a parsed and corrected corpus is a better test dataset for linguistic research than a set of sentences pushed through an algorithm.

References

Quirk, R., Greenbaum, S., Leech, G., & Svartvik J. (1985). A Comprehensive Grammar of the English Language. London: Longman.

This page last modified 12 June, 2013 by Survey Web Administrator.