Corpus Queries

Development of an effective grammatical query methodology in the context of a parsed corpus

Ref: R 000 22 2598
Institution: University College London
Department: Department of English (Survey of English Usage)
Investigator: Sean Wallis
Period: 1 March 1998 to 31 January 1999 (leave of absence in November 1998)

Aims and objectives

Summary of Research Results

We have produced a software platform to explore and research a parsed text corpus. Parsed corpora (sometimes called ‘treebanks’) are collections of text samples that contain grammatical tree analyses, one for each sentence. These corpora have a number of applications, including: improving natural language processing software, teaching grammar, and linguistic research. Many of the components of the platform we have developed will have significant reuse in these areas.

The software, ICECUP III, was developed in parallel with the completion of the parsed million-word British Component of the International Corpus of English (ICE-GB). We produced a number of prototypes of ICECUP implementing two distinct, but complementary, methodologies for exploiting this parsed corpus. These are

  1. Fuzzy Tree Fragments (FTFs): a formal representational scheme to allow users to specify grammatical queries, and
  2. Cyclic Corpus Exploration: a mode of corpus engagement, allowing a user to evolve their understanding of the grammar in parallel with their queries. This addresses the fact that even experienced linguists cannot expect to have a complete knowledge of the realisation of a detailed grammar in a large corpus.

This software was disseminated to a range of different users during the project (experienced linguists, as well as novices, evaluated early versions of the software, and applied it to their own research). In November 1998, ICECUP 3.0 was published on the internet from the ICE-GB web site at the Survey of English Usage (SEU) at University College London (http://www.ucl.ac.uk/english-usage). This internet publication includes a 20,000 word sample corpus. The SEU also published ICECUP with the entire million-word ICE-GB corpus on CD-ROM, available for education and research use at cost. The ICE-GB web site included an on-line questionnaire for users to provide feedback and gain technical support.

Fuzzy Tree Fragments are regular topological queries in the form of grammatical trees. To construct a query, users draw a tree. However, unlike complete grammatical trees in the corpus, FTFs are general and incomplete.

FTFs have been kept as simple as possible. There are two kinds of component in an FTF: tree nodes and text unit elements (words, pauses and punctuation). Tree nodes form a tree, which means that two nodes at the same level in an FTF must have a common ‘parent’ node. Each leaf (‘childless’) node has a text unit element. FTFs employ two broad classes of structural relationship.

  1. Unary relationships are structural properties of a single tree node or text unit element. These include the statement that a node is, or is not, the root, a leaf, or the first or last child in a sequence of child nodes. For text unit elements, they specify whether they are the first or last word in the sentence.
  2. Binary relationships relate the position of one tree node or text unit element to another. The parent of a node in an FTF is either ‘immediately’ or ‘eventually’ connected. The next child in a sequence of children may be ‘just after’ another, ‘just before or just after’, ‘following after’, or ‘before or after’, or the relationship may be unknown. Text unit elements take a parallel set of relations. Finally, one may specify that two tree nodes must be located on different branches.

We have found that this definition of FTFs is intuitive, yet sufficiently expressive for highly specialised queries. FTFs were designed for tree-based queries, but the system is equally applicable to text-based ones. Word and wordclass sequences are implemented as FTFs. One can even produce a kind of ‘hybrid’ query, introducing tree structure from the text upwards.

A proof system was devised to exhaustively match FTFs to corpus trees (i.e., to find every different matching combination). This is optimised utilising tree topology. ICECUP also exploits indices, so it can ignore sentences omitting any of an FTF’s components. ‘Simple FTFs’ (a single node or word with no topological restrictions) just extract a single index.

The cyclic corpus exploration methodology was based around FTFs, supported by a number of other innovations. The ‘corpus map’ provides a structured sociolinguistic overview of the corpus. To examine part of the corpus, the user performs a query. A ‘query’ may be an FTF, a sociolinguistic query (e.g., speaker gender), a specific subtext or a speaker, or a random sample. The query and results are displayed in a ‘text viewer’ window. Tree analyses of individual sentences may be explored in a further window. If a user finds an interesting grammatical construction in a tree, she may apply a tool called the ‘FTF Creation Wizard’ to isolate the structure and form an FTF. This may be edited before being applied to the corpus. A novice can rapidly create a sophisticated grammatical query from the corpus itself. We thus have a cycle: searchexamine abstract edit/refinesearch.

Queries are combined together in ICECUP with propositional logic. The interface allows a user to ‘drag and drop’ two queries together and edit the logical relationships between them (the text view reflects this change immediately). Sophisticated concordancing is supported by FTFs. Concordancing is a way of summarising the results of corpus queries by aligning sentences horizontally according to a point of interest. ICECUP supports ‘key construction in line’ concordances whereas traditional concordancing is applied to words and wordclasses. We have continued to develop ICECUP after the November release. Version 3.1 will be available later this year, including a number of refinements based on users’ comments. Presently, it sports a number of improvements in a number of areas, including FTFs, search, concordancing, context retrieval, and support for novice users. It also supports the synchronised playback of the original sound recordings.

A number of papers have been, or are about to be, published about FTFs and ICECUP, aimed at a varied audience, from English language (Aarts, Nelson and Wallis, 1998), teaching (Aarts and Nelson, 1999), corpus linguistics (Wallis, Aarts and Nelson, 1999; Wallis, 2003), CL tools (Wallis and Nelson, 2000a), and scientific knowledge discovery (Wallis and Nelson, 2000b). A book on ICE-GB (Nelson, Wallis and Aarts, in preparation) will be published by John Benjamins in 2002.

References

Aarts, B., Nelson, G., and Wallis, S.A. (1998) Using fuzzy tree fragments to explore English grammar. English Today, 14(3): 52-56.

Aarts, B., Nelson, G., and Wallis, S.A. (1999) Global resources for a global language: English language pedagogy in the modern age. In Claus Gnutzmann (ed.) English as a Global Language: Native and Nonnative Perspectives. Tübingen: Stauffenburg Verlag. pp273-290.

Nelson, G., and Aarts, B. (1999) Investigating English around the World: The International Corpus of English. In R.S. Wheeler (ed.), The Workings of Language: From Prescriptions to Perspectives, New York: Praeger. pp107-116.

Nelson, G., Wallis, S.A., and Aarts, B. (2002) Exploring Natural Language: Working with the British Component of the International Corpus of English, (Varieties of English around the World series), Amsterdam: Benjamins. More...

Wallis, S.A. (2003) Completing parsed corpora: from correction to evolution. In: A. Abeille (ed.). Treebanks: Building and using syntactically annotated corpora, Kluwer.

Wallis, S.A., Aarts, B., and Nelson, G. (1999) Parsing in reverse - Exploring ICE-GB with Fuzzy Tree Fragments and ICECUP. In J.M. Kirk (ed.), Corpora Galore: papers from the 19th International Conference on English Language Research on Computerised Corpora, ICAME-98, Amsterdam: Rodopi. pp335-344.

Wallis, S.A., and Nelson, G. (2000a) Exploiting fuzzy tree fragments in the investigation of parsed corpora. Literary and Linguistic Computing, 15(3): 339-361.

Wallis, S.A., and Nelson, G. (2000b) Knowledge discovery in grammatically analysed corpora. Data Mining and Knowledge Discovery, Kluwer.

This page last modified 14 May, 2020 by Survey Web Administrator.