The Survey of English Usage
Annual Report 2005

News
Research
Staff
Publications, conference presentations, etc.

1. News

We are very pleased to announce that the Survey was awarded a major new grant by the Economic and Social Reseach Council (ESRC) entitled Next generation tools for linguistic research in grammatical treebanks (R000 231286). The project started on 1 January 2006 and will run for two years. For more details, see section 2.1 below.

The Survey has completely redesigned its website at www.ucl.ac.uk/english-usage. One of the new features is the webpage 'A brief history of the Survey of English Usage', which we hope to expand and improve over time. It includes the biographies of the 'Gang of Four' authors of the Comprehensive grammar of the English language, Randolph Quirk, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik.

Rodney Huddleston, one of the world's most eminent grammarians of the English language, was awarded a DLit honoris causa by UCL in September 2005. For more information, see here.

Gerry Nelson chaired a panel on the ICE project at the International Association of World Englishes (IAWE) Conference, Purdue University, Indiana, July 20-23, 2005. He attended the launch of ICE Ireland at Queen's University Belfast in November 2005. Together with former colleagues at The University of Hong Kong, he launched a website for Hong Kong secondary schools, called Reach Out for English. It is partly based on the Internet Grammar of English website, and contains interactive grammar and spelling exercises. The website was very favourably reviewed in the South China Morning Post. For details of further activities of members of the Survey, see section 4.

Yordanka Kavalova organised a linguistics seminar group. So far we have had the following speakers: Yordanka Kavalova, Gabriel Ozón, Dr Gerry Nelson, Dr Mariangela Spinillo, Professor Ruslan Mitkov, Dr Eva Eppler and Sean Wallis.

Back to top

2. Research

2.1 New project: Next Generation Tools

As noted above, the Survey was awarded funding by the ESRC for a new two-year project, starting in January 2006, to develop a new 'next generation' software suite for carrying out experiments on parsed corpora.

The aim of the research is to develop, in conjunction with an existing community of researchers, a software environment for carrying out experimental research in corpus linguistics, with a particular emphasis on parsed text corpora. Corpora are collections of (transcribed) spoken and written passages, annotated in a number of different ways. A parsed corpus is one where the grammatical structure of each sentence is included. As corpora grow in size and complexity (typical parsed corpora contain around one million words), corpus linguistics research becomes increasingly dependent on computer software support.

Previous research has focused on developing an effective, linguistically meaningful, grammatical query system. This transformed possibilities in corpus linguistics. Searches which had previously taken weeks could be performed in seconds. Simple experiments could be constructed around these queries, the results being analysed manually. Where a previous generation of researchers has focused on studies of distributions and differences between samples, or used a corpus as a starting point for philosophical discussions of language, the parsed ICE-GB corpus and its accompanying software can be used to investigate how one aspect of a linguistic choice (e.g. use of shall vs. use of will) can affect another.

While the range of potential research possibilities continues to expand, using current software, experiments require careful planning, are difficult, time-consuming and prone to error. There is no easy fit between directed searches and enumerative approaches to research. What's more, the separation of statistical analysis from the corpus means that it is difficult to work out how alternative hypotheses interact.

In this project we aim to extend the existing exploration platform (ICECUP) into one that supports the entire experimental cycle which involves

  • defining the structure of an experiment and its variables;
  • extracting a sample from the corpus;
  • carrying out an appropriate statistical analysis on this sample, and
  • evaluating the results with respect to the original corpus from which they are obtained.

Supporting this formal approach - which presumes that the researcher knows precisely how to express what s/he is looking for in advance - are algorithms which define variables from actually existing variation in the corpus. One can define discrete variables by underspecifying a query and asking the software to list the alternatives that exist in the corpus. A similar approach elaborates numeric (integer) variables.

Once data has been collected the platform will support a variety of statistical and computational methods for analysis. The structure of chosen variables (discrete, hierarchical, ordinal) specifies the statistical test to be used. We will also provide algorithms which look at more than one predictor variable at a time and search for combinations of variables which most cogently predict the target linguistic event. Researchers will then be able to consider alternative hypotheses. A researcher faced with alternative explanations of the same phenomenon will wish to know just how independent one hypothesis is from another. We will provide analysis visualisation tools to identify how cases justifying particular hypotheses overlap, and identify these cases in the corpus. Evaluation of results is then by reference to wider linguistic interpretations and with respect to original sentences.

The approach is highly general and extensible. We use the medium of grammatical investigations into parsed corpora for reasons of access (both to data and a community of active researchers), and because in this field there is a substantial existing literature for comparison. The addition of further levels of annotation in corpora offers the promise of future work in constructing theories that relate these levels to grammar. In this project the crucial question is to ensure that the approach is valid and usable by linguists.

ICECUP is a general purpose 'exploratory' platform, designed to make it easy for people to find examples worthy of study and to collect simple statistics.

The idea is that the new software will take ICECUP as its starting point, but will be expanded with a suite of tools to allow the user to define variables, extract a formal sample and analyse this sample in a variety of ways. We have demonstrated at various conferences over the years that ICECUP can be used to carry out simple experiments, and guidance for this can be found on our website and in Nelson, Wallis and Aarts (2002). But we also recognise that this can be a difficult and error-prone process, and that we can all find it hard to 'scale up' our experiments. The simple act of considering an alternative hypothesis involving a new variable may require a lengthy process of applying new queries and recalculation.

One new facility in ICECUP 3.1 hints at the power of the basic idea of doing experiments with the software. Drag and drop can be used to add columns of statistics in the corpus map and lexicon. It is possible to drop an FTF into the corpus map, and see its frequency across, say, the text category variable. The software can then calculate whether the distribution of your FTF differs significantly from another (say, that of a more general query). This tells you if this particular FTF is chosen more frequently in certain text types than in others. This is all done automatically, and as a result is very powerful. However, this is as far as the present software goes.

In the new platform it will be possible to modify sociolinguistic variables or define your own, including lexical and grammatical variables. The new suite will revolve around the notion of an experimental sample, rather than a simple index of results. This makes it much easier to evaluate the interplay of two or more grammatical phenomena. The program will be able to consider the effect of several variables in order to contrast explanations for a particular phenomenon. The result will be a new 'experimental' rather than 'exploratory' platform.

We think that such an experimental platform could also integrate other excellent corpus linguistics methods into a single experimental cycle. For example n-gram methods could be used to define variables, by defining an outline of an FTF and then getting the software to list the different instantiations of a particular 'slot' that it finds, each different case being a 'value' of the variable. Numeric variables can be generated in a similar, corpus-sifting manner, e.g. to count the number of clauses below a node.

The software will initially work with all ICECUP-compatible corpora. It will be ideal for teaching experimental methods and statistics in corpus linguistics.

The project runs until December 2007. We expect to have the first beta version of the software available in early 2007 for those who are interested.

For more information see the webpage for the project, www.ucl.ac.uk/english-usage/projects/next-gen.

2.2 ICE-GB Release 2

We will shortly be publishing Release 2 of ICE-GB. The material in the corpus will be synchronised with sound recordings for the spoken part of the corpus, which can optionally be purchased separately (see below). At the time of writing we are checking this synchronisation before the release. Like the DCPSE corpus (see section 2.4. below), ICE-GB will be supplied on a CD with the new ICECUP 3.1 software. See also section 2.3.

To cover our costs, for the new corpus and software package there will be a small upgrade price for those who bought ICE-GB Release 1. For details see here.

We have halved the cost of the student licence.

The sound recordings will be available as a set of CDs containing uncompressed 'wave' files that should ideally be installed on a hard disk. Anyone who has already bought a set of pre-release sound recordings will be sent a finalised set of CDs for free if they choose to upgrade.

2.3 ICECUP 3.1

After many years of development, we are very pleased to announce the publication of ICECUP 3.1 for use with ICE-GB (Release 2) and the DCPSE corpus (see section 2.4 below). ICECUP 3.1 is an 'evolutionary' advance on ICECUP 3.0.

Versions of ICECUP 3.1 have been demonstrated at conferences over the years. Beta versions have been available from our website for over a year, and have been tested by a number of colleagues.

We have delayed this release until we were satisfied with two things: first that the program was very stable, and second, that some of the more advanced features, such as defining logic in tree nodes, were properly supported by the user interface and straightforward to carry out.

New features in ICECUP 3.1 include a lexicon and a grammaticon, providing an overview of distributions of words and tags, and grammatical nodes, respectively. These overview tools contain user-defined tables of statistics allowing users to explore, for example, the distribution of any word or Fuzzy Tree Fragment (FTF) across the corpus map, or how the lexicon for speech and writing (or in DCPSE, for 1960s English and 1990s English) may differ. These tables can also be output, e.g. to Excel.

Fuzzy Tree Fragments have been extended to support logic in nodes and complex 'wild card' nodal patterns.

  1. Each node in an FTF can consist of a single nodal pattern including:
    1. sets of functions and categories, optionally negated. For example, '{SU,NOSU},CL' = 'the function equates either a subject or notional subject and the category is not a clause'.
    2. feature sets organised where possible by feature class, including negation, and where a particular feature is unspecified. For example, '(cxtr,ditr)' = 'carries the feature complex transtive or ditransitive'.
    3. structural features, e.g. 'ignore'.
  2. The node can also consist of a logical combination of these nodal patterns. Logic permits you to express, for example, that the node either has the function of subject or the category of clause, '(SU, ∨ ,CL)'.
  3. A final option integrates exact matching into FTFs. For example, '=SU,CL' indicates a subject clause with no features.

The 'word slots' in FTFs similarly support both logic and wild cards, allowing users to write expressions such as '{*ing ~thing}+<N>' (='noun ends in ing but is not thing'). In this way it is possible to search for particular morpheme prefixes and suffixes, although neither ICE-GB nor DCPSE is lemmatised.

These features of ICECUP 3.1 were summarised in our book Exploring Natural Language book (Nelson, Wallis and Aarts 2002). However, when we wrote the book, we were conscious that it was still quite difficult to edit complex FTF expressions. As a result, the entire interface for editing FTF nodes has been completely overhauled, with the old static 'Edit Node' window replaced by a floating editor and inspector window with tabbed panels and controls for all the options listed above. Users can even drag and drop node patterns to and from the logic panel.

There are many other improvements to ICECUP in this release, including pre-indexed lexicon (word+tag) queries, parallel searching, mouse-based pan-and-zoom, quick-find controls, viewable context, more pop-up help, manual filtering of results, a new FTF Creation Wizard, and integrated playback of speech recordings. Integrated sound playback has been much anticipated, and will be available with the publication of ICE-GB Release 2, with an option of purchasing integrated sound files. See section 2.2.

ICECUP 3.1 also features a thoroughly revised on-line help manual covering all the new features. A new Getting Started booklet, featuring ICE-GB and DCPSE, will be published with the corpora.

We thank everyone who sent us comments and requests over the years. We've tried to incorporate all your suggestions.

2.4 Creating a parsed and searchable diachronic corpus of present-day spoken English (DCPSE)

We are pleased to announce the publication of a new corpus: the Diachronic Corpus of Present Day Spoken English (DCPSE). In linguistics a distinction is traditionally made between diachronic and synchronic approaches to the study of language. The first considers language through time, whereas the latter takes a 'snapshot' look at language viewed from the present. This dichotomy has recently been questioned by some linguists, who have argued that the distinction is an artificial one. They claim that languages change all the time, even synchronically. As a result of these new attitudes to language development there is an emerging research impetus in linguistics, which concerns itself with recent change. The aim of this project was the construction of a diachronic corpus of spontaneous spoken English containing directly comparable material from the London-Lund Corpus (LLC) and the British Component of the International Corpus of English (ICE-GB). The resource has been made fully searchable with the International Corpus of English Corpus Utility Program exploration software (ICECUP). The main features are:

  • The corpus contains a total of 800,000 words of spoken English from comparable categories in the LLC and in ICE-GB (400,000 words from each corpus). The design of these corpora is similar, and it will thus be possible to study the linguistic features of analogous categories of spontaneous spoken English over time. As noted, in each case we have selected matching texts, and we cross-checked the structural markup and tagging in the LLC. We integrated the LLC and ICE-GB material. Very long monologue utterances (over 1,000 words) were broken into segments. These could then be read into ICECUP and indexed in an integrated fashion. ICECUP was originally developed to operate on ICE-GB. We modified it to handle the combined data in DCSPE. The corpus has been parsed consistently with ICE-GB, resulting in some 87,000 trees analysed with the full ICE analysis. This was carried out automatically to phrasal level and then corrected by hand.
  • DCPSE is the largest single collection of checked and parsed orthographically transcribed spoken English in the world.
  • We have written documentation for use with the corpus and software. The new corpus will provide linguists interested in recent linguistic changes in English with a new, innovative and searchable database containing spoken English covering a period of 25-30 years. We will disseminate the corpus via the Survey of English Usage website later this year. We believe that this resource offers unprecedented possibilities for new research into changes in English.
  • The corpus is supplied on a CD with the new ICECUP 3.1 software.

For more information, see http://www.ucl.ac.uk/english-usage/projects/dcpse.

Back to top

3. Staff

Christine Bowles continues as the Survey's part-time Administrator.

Isaac Hallegua continues as Systems Administrator.

Gerry Nelson continues as Deputy Director of the SEU, and as coordinator of the International Corpus of English project.

Sean Wallis has been working as Senior Research Fellow on the ESRC project Creating a parsed and searchable diachronic corpus of present-day spoken English (DCPSE) and on the new version of ICECUP. From 1 January 2006 he has been working on the new ESRC project described above.

Our Research Assistants Yordanka Kostadinova-Kavalova and Gabriel Ozón have been completing the work on the DCPSE project, as has Lesley Kirk. Yordanka and Gabriel are completing their PhD theses and are also both working: Yordanka full-time at Palgrave/Macmillan and Gabriel at Roehampton University.

Back to top

4. Publications, conference presentations, talks, theses and other studies using Survey material

Please let us know if you would like us to include your publications based on SEU material. We will appreciate it if you send us offprints of any such publications.

Aarts, Bas (2005) Grammar. In: Keith Brown (ed.) The Encyclopedia of Language and Linguistics. Second edition. Oxford: Elsevier. 113-115 (vol. 5).

Aarts, Bas (2005) Subordination. In: Keith Brown (ed.) The Encyclopedia of Language and Linguistics. Second edition. Oxford: Elsevier. 248-255 (vol. 12).

Aarts, Bas (2005) (ed., with David Denison and Richard Hogg). English Language and Linguistics. Volumes 9.1 and 9.2.

Aarts, Bas (2005) Recent developments in the syntactic annotation of corpora: a demonstration of ICE-GB and DCPSE. Paper presented at the Ninth International Symposium on Social Communication, Centro de Lingüística Aplicada. Santiago de Cuba, Cuba.

Aarts, Bas (2005) Approaches to the English gerund. Plenary lecture at the First International Conference on the Linguistics of Contemporary English. University of Edinburgh.

Aarts, Bas (2005) Grammatical gradience: a constrained account. Paper presented at the Oxford English Dictionary Forum, Oxford.

Aarts, Bas (2005) Categorial gradience in grammar. Paper presented in the English Department, University of Munich.

Arai, Yoichi (2005) A corpus-based analysis of some neg-intensifying expressions. Paper presented at the 26th ICAME Conference, University of Michigan.

Dehé, Nicole and Yordanka Kostadinova-Kavalova (2005). Parenthetical what. Paper presented at the First International Conference on the Linguistics of Contemporary English. University of Edinburgh.

Depraetere, Ilse (2005) Non-deontic root necessity: a contradiction in terms? Paper presented at the First International Conference on the Linguistics of Contemporary English. University of Edinburgh.

Hoffman, Thomas (2005) English relative clauses and construction grammar. Paper presented at the First International Conference on the Linguistics of Contemporary English. University of Edinburgh.

Hundt, Marianne The committee has/have decided ... On concord patterns with collective nouns in inner and outer circle varieties of English. Paper presented at the 26th ICAME Conference, University of Michigan.

Kaltenböck, Gunther (2005) It-extraposition in English. International Journal of Corpus Linguistics 10.2, 119-159.

Kaltenböck, Gunther and Mehlmauer-Larcher, Barbara (2005) Computer corpora and the language classroom: on the potential and limitations of computer corpora in language teaching. ReCALL 17.1. 65-84.

Kirk, John, Jeffrey Kallen, Anne Rooney, and Orla Lowry (2005) Pragmatics, prosody, and syntax in spoken language corpora. Paper presented at the 26th ICAME Conference, University of Michigan.

Kostadinova-Kavalova, Yordanka (2005) Parenthetical and-clauses. Paper presented at LingO, University of Oxford. Oxford, UK.

Kostadinova-Kavalova, Yordanka (2005) Parenthetical clauses. Department of English language and literature, UCL.

Kostadinova-Kavalova, Yordanka and Nicole Dehé (2005) Parenthetical what. Paper presented at the First International Conference on the Linguistics of Contemporary English. University of Edinburgh.

Mukherjee, Joybrato (2005) English ditransitive verbs: aspects of theory, description and a usage-based model. Language and Computers Vol. 53. Amsterdam: Rodopi.

Mukherjee, Joybrato (2005) Describing verb-complementational profiles of outer circle varieties of English:the case of Indian English. Paper presented at the 26th ICAME Conference, University of Michigan.

Nelson, Gerald (2005) Description and prescription. In: Keith Brown (ed.) The encyclopedia of language and linguistics. Second edition. Oxford: Elsevier. vol. 3. 460-465.

Nelson, Gerald (2005) Expressing future time in Philippine English. In: Danilo T. Dayag and J. Stephen Quakenbush (eds.) Linguistics and language education in the Philippines and beyond: festschrift in honor of Ma. Lourdes S. Bautista. Manila: De La Salle University Press. 41-59.

Nelson, Gerald (2005) Introducing English grammar. Reach Out for English website, Department of Linguistics, The University of Hong Kong. www.hku.hk/reachout.

Nelson, Gerald (2005) Graduate seminars on the International Corpus of English in the Department of Applied Linguistics, Birkbeck College and in the Department of Linguistics, University of Manchester.

Nelson, Gerald (2005) The Case of the disappearing subordinator. Paper presented in the English Department, UCL.

Paradis, Carita (2005) Ontologies and construals in lexical semantics. Axiomathes 15, 541-573.

Peters, Pam (2005) Australian English grammar: Variation across speech and writing. Paper presented at the 26th ICAME Conference, University of Michigan.

Rosenbach, Anette (2005) Descriptive genitives in English. Paper presented at the First International Conference on the Linguistics of Contemporary English. University of Edinburgh.

Sampson, Geoffrey (2005) The 'language instinct' debate. Revised edition. London: Continuum.

Sampson, Geoffrey and Diana McCarthy (2005) (eds.) Corpus linguistics: readings in a widening discipline. London: Continuum.

Schmid, H.-J. (2005) Englische Morphologie und Wortbildung. Berlin: Erich Schmidt Verlag.

Smith, Nicholas (2005) A corpus-based investigation of recent change in the use of the progressive in British English. PhD thesis, Lancaster University.

Spinillo, Mariangela (2005) Determiners in English: reconceptualising the English determiner class. Paper read at the 27th Annual Meeting of the German Society of Linguistics (DGfs): Evolution and Functions of Nominal Determination. University of Cologne.

Spinillo, Mariangela (2005). On the categorial status of present-day English determiners. Paper presented at the First International Conference on the Linguistics of Contemporary English. University of Edinburgh.

Wichmann, Anne (2005) Please - from courtesy to appeal: the role of intonation in the expression of attitudinal meaning. English Language and Linguistics. 9.2. 229-253.

Back to top

Bas Aarts
Director

January 2006

This page last modified 21 July, 2014 by Survey Web Administrator.