The Survey of English Usage
Annual Report 2004
Creating a parsed and searchable diachronic corpus of present-day spoken English (ESRC R000239643)
The DCPSE project was graded 'Outstanding'
by the ESRC, which "indicates that a project has fully met its objectives
and has provided an exceptional research contribution well above
average or very high in relation to the level of award". A full
set of referees' comments can be found here.
In linguistics a distinction is traditionally made between diachronic and synchronic approaches to the study of language. The first considers language through time, whereas the latter takes a 'snapshot' look at language viewed from the present. This dichotomy has recently been questioned by some linguists, who have argued that the distinction is an artificial one. They claim that languages change all the time, even synchronically. As a result of these new attitudes to language development there is an emerging research impetus in linguistics, which concerns itself with recent change.
The aim of this project was the construction of a diachronic corpus of spontaneous spoken English containing directly comparable material from the London-Lund Corpus (LLC) and the British Component of the International Corpus of English (ICE-GB). The resource has been made fully searchable with the International Corpus of English Corpus Utility Program (ICECUP) exploration software. Main results:
We selected a total of 800,000 words of spoken English from comparable categories in the LLC and in ICE-GB (400,000 words from each corpus). The design of these corpora is similar, and it will thus be possible to study the linguistic features of analogous categories of spontaneous spoken English over time. As noted, in each case we have selected matching texts, and we cross-checked the structural markup and tagging in the LLC. We integrated the LLC and ICE-GB material. Very long monologue utterances (over 1,000 words) were broken into segments. These could then be read into ICECUP and indexed in an integrated fashion. ICECUP was originally developed to operate on ICE-GB. We modified it to handle the combined data in DCSPE. We parsed the LLC material. This was carried out automatically to phrasal level and then corrected the trees by hand. This involved a number of technical challenges (for more information, see the final report on the link below). We manually checked the results of the automatic parsing process.
We have written documentation for use with the corpus and software.
The new corpus will provide linguists interested in recent linguistic
changes in English with a new, innovative and searchable database
containing spoken English covering a period of 25-30 years. We will
disseminate the corpus via the Survey of English Usage website later
this year. We believe that this resource offers unprecedented possibilities
for new research into changes in English. For full details, see
A sample corpus is available for download from this site, packaged with ICECUP.
A pre-release version of ICECUP 3.1 is available, with ICE-GB and
DCPSE sample corpora, for download from our website. See: www.ucl.ac.uk/english-usage/projects/ice-gb/beta
This beta-release is available with a set of release notes which explains these additions to the software which were not included in the handbook (Nelson, Wallis and Aarts, 2002).
ICECUP 3.1 is an evolutionary advance on ICECUP 3.0. A user who has learned how to use ICECUP 3.0 should feel entirely at home with the new software. The software has been extended in a number of important respects. The new ICECUP includes:
- An integrated lexicon derived from the corpus
- An integrated grammaticon of nodes
- Simple drag-and-drop statistics Enhanced Fuzzy Tree Fragments,
- logical combinations of lexical wild cards
- logic within nodes
- additional structural features
- improved user interface with 'floating' node properties/inspector window
- An improved FTF wizard
- Manual sentence selection 'query' control
- Improved browsing:
- word wrapping
- context options
- enhanced concordancing
- integrated speech playback (with optional separately available sound files)
- Improved user interface, including:
- new tree editor with zooming and panning
- new quick-find commands
- faster and parallel searching
We are grateful for the comments and suggestions from beta-reviewers.
In many cases these have led to additional facilities in the software.
If you are interested in reviewing the software, please have a look
at our site.
There are currently a small number of outstanding issues with the software, which will be tackled prior to release later this year, including some compatibility issues with Windows XP. Naturally, we have no intention of releasing any version of the software unless it is extremely stable.
Finally, our programmer, Sean Wallis, is looking to the future in a new proposal that would extend ICECUP to support cycles of formally defined experiments in linguistics. ICECUP 3.1 consists of a series of enhancements to the environment and the expressivity of searches, but the only specifically new tools are a grammaticon and lexicon. ICECUP 3.2 would, in our plan, provide a number of inter-related tools that would allow a researcher to define and carry out any number of experiments in grammar on the corpus.
Sound recordings for the 300 spoken texts of ICE-GB (around 75 hours of speech) are available from us by order, as a set of CDs, in three formats (all 16kHz mono).
|SET 1||12 CDs||one file per text||uncompressed wave files|
|SET 2||11 CDs||one file per sentence/group||uncompressed wave files|
|SET 3||5 CDs||one file per sentence/group||compressed (mp3) files|
These are currently available as standalone datasets at an equivalent cost to the computerised ICE-GB data. We also plan to release an integrated package, with sound files, ICE-GB Version II and ICECUP 3.1, later this year. This will permit the playback of sentences or groups of sentences from the corpus. The cost of an advance purchase of sound recordings will be subtracted from the cost of this 'ICE-GB+sound' package.
Professor K. K. Luke and his students visited UCL and the Survey
in July 2004. Gerry Nelson gave them a talk about the ICE project.
Gerry Nelson visited Hong Kong in August, in connection with the ICE-HK project.
Gerry Nelson edited a special volume of World Englishes (May 2004, Volume 23, Issue 2) on The International Corpus of English. The table of contents is shown below:
|Introduction G. Nelson|
|How to trace structural nativization: particle verbs in world Englishes||E. W. Schneider|
|Cultural discourse in the Corpus of East African English and beyond: possibilities and problems of lexical and collocational research in a one million-word corpus.||J. Schmied|
|Conceptualization specifics in East African English: quantitative arguments from the ICE-East Africa corpus||C. Haase|
|Emphasizer now in colloquial South African English||C. Jeffery and B. van Rooy|
|Shared morpho-syntactic features in contact varieties of English: article use||A. Sand|
|Negation of lexical have in conversational English||G. Nelson|
|Comparing world Englishes: a research guide||H. Fallon.|
For further details, see here.
The English Noun Phrase: an empirical study (AHRB B/RG/AN5308/APN10614)
We are pleased to report that Evelien Keizer’s research on this project will be published by Cambridge University Press in the monograph series Studies in English Language.
The London-Lund Corpus
The sound files of the London-Lund Corpus are now available upon request at the Survey. Please contact Christine Bowles (firstname.lastname@example.org).
Gerry Nelson was appointed Deputy Director of the Survey.
Christine Bowles joined the Survey on 1 November as part-time administrator. She works on Mondays and Thursdays. Should you wish to contact her, her email address is email@example.com
Sean Wallis continues as Principal Senior Research Fellow. He has been working on the ESRC project and on the new version of ICECUP. He is seconded part-time to the Human Resources Department at UCL.
Isaac Hallegua continues as Systems Administrator.
Our principal Research Assistants are Yordanka Kostadinova-Kavalova and Gabriel Ozón. They were joined for shorter periods by Dr Dirk Bury, Dr Amela Camdžic, Leslie Kirk, Dr Ann Law and Kate Scott.
We congratulate Mariangela Spinillo on successfully defending her PhD thesis, entitled 'Reconceptualising the English determiner class'.
Two people have left the Survey. Marie Gibney has retired as administrator after working in the Survey for 21 years, first with Sidney Greenbaum, then with Bas Aarts. She has done a wonderful job running the SEU for so many years. We held a farewell party for her which was also attended by many members of the English Department. Toshihiko Kubota will leave the Survey in April after having spent two years as a Visiting Scholar at the Survey. We wish him luck returning to his teaching position in Japan.
Please let us know if you would like us to include your publications based on SEU material. We will appreciate it if you send us offprints of any such publications.
Aarts, Bas (2004) Fuzzy grammar: a reader. Oxford: Oxford
University Press. (Edited with David Denison, Evelien Keizer and
Aarts, Bas (2004) Fuzzy grammar: the nature of grammatical categories and their representation. 2004. (With David Denison, Evelien Keizer and Gergana Popova.) In: Bas Aarts, David Denison, Evelien Keizer and Gergana Popova Fuzzy grammar: a reader. Oxford: Oxford University Press.
Aarts, Bas (2004) Modelling linguistic gradience. Studies in Language 28.1. 1-49.
Aarts, Bas (2004) Grammatici certant. Review Article of Rodney Huddleston and Geoffrey Pullum (2002) The Cambridge grammar of the English language. Journal of Linguistics 40.2.
Aarts, Bas (2004) Conceptions of gradience in the history of linguistics. Language Sciences 26.
Aarts, Bas (2004) Messy or orderly: the nature of grammatical categories. Plenary lecture at the fiftieth anniversary meeting of the English Language and Literature Association of Korea, Seoul.
Aarts, Bas (2004) Recent developments in corpus linguistics. Academy of Korean Studies, Seoul and Pusan National University.
Aarts, Bas (2004) English Language and Linguistics. (With David Denison and Richard Hogg.) Cambridge University Press. Volumes 8.1 and 8.2.
Aijmer, Karin and Anne-Marie Simon-Vandenbergen (2004) Modal adverbs of certainty in the ICE-GB corpus. Paper presented at the 25th ICAME conference, Verona.
De Clerck, Bernard (2004) Imperative subjects in English: a corpus-based pragmatic analysis. Paper presented at the 25th ICAME conference, Verona.
Depraetere, Ilse and Ann Verhulst (2004) Must and have to in ICE-GB: a survey of its meanings. Paper presented at the 25th ICAME conference, Verona.
Fallon, Helen (2004) Comparing world Englishes: a research guide. In: Gerald Nelson (2004)(ed.) World English 23.2: Special issue on the International Corpus of English. 309-316.
Gesuato, Sara (2004) To be going, to be doing. Paper presented at the 25th ICAME conference, Verona.
Gilquin, Gaëtanelle (2004) A corpus-based cognitive study of the main English causative verbs: a ssyntactic, semantic, lexical and stylistic approach. Unpublished PhD Thesis. Louvain-la-Neuve: Centre for English Corpus Linguistics, Université Catholique de Louvain.
Hasselgård, Hilde (2004) The placement of adjuncts in clause-medial position. Paper presented at the 25th ICAME conference, Verona.
Jeffery, Chris and Bertus van Rooy (2004) Emphasizer now in colloquial South African English. In: Gerald Nelson (2004)(ed.) World English 23.2: Special issue on the International Corpus of English. 269-280.
Kaltenböck, Gunther (2004) It-extraposition and non-extraposition in English: a study of syntax in spoken and written English. Vienna: Braumüller.
Keizer, Evelien (2004) Postnominal PP complements and modifiers: a cognitive distinction. English Language and Linguistics 8.2. 323-350.
Kirk, John M. Kirk, Jeffrey L. Kallen, Orla Lowry and Anne Rooney (2004) Standard Irish English: the four hypotheses. Paper presented at the 25th ICAME conference, Verona.
Kostadinova-Kavalova, Yordanka (2004) Integrated parentheticals and discourse parentheticals. Paper presented at the 25th ICAME conference, Verona.
Kostadinova-Kavalova, Yordanka (2004) Niche-filling: completing a parsed corpus through evolution. Paper presented at the sixth conference of General Linguistics, Santiago de Compostela, Spain. (With Gabriel Ozón and Sean Wallis.)
Leech Geoffrey (2004) A new Gray's anatomy of English grammar. Review article of Rodney Huddleston and Geoffrey K. Pullum (2002) The Cambridge grammar of the English language. English Language and Linguistics 8.1, 121-147.
Martinez-Insua, A. E. and I. M. Palacios-Martinez (2003) A corpus-based approach to non-concord in present day English there constructions. English Studies 3. 362-383.
Meunier, Fanny (2004) Native corpora, learner corpora and ELT: the winning team? Paper presented at the 25th ICAME conference, Verona.
Meyer, Charles F. and Hongyin Tao (2004) Grammar, pragmatics, introspection and corpus linguistics: a critique of Newmeyer's 'Grammar is grammar and usage is usage'. Paper presented at the 25th ICAME conference, Verona.
Monschau, Jacqueline, Rolf Kreier and Joybrato Mukherjee (2004) Syntax and semantics at tone unit boundaries. Anglia: Zeitschrift für Englische Philologie 121.4. 581-609.
Mukherjee, Joybrato (2004) The state of the art in corpus linguistics: three book-length perspectives. English Language and Linguistics 8.1, 103-119.
Nelson, Gerald (2004)(ed.) World English 23.2: Special issue on the International Corpus of English. Oxford: Blackwell Publishers.
Nelson, Gerald (2004) Negation of lexical have in conversational English. In: Gerald Nelson (2004)(ed.) World English 23.2: Special issue on the International Corpus of English. 299-308.
Ni, Yibin (2003) Noun phrases in media texts: a quantificational approach. In: Jean Aitchison and Diana M. Lewis New media language. London: Routledge. 159-168.
Ozón, Gabriel (2004) Ditransitive alternation: a weighty account? A corpus-based study using ICECUP. Paper presented at the 25th ICAME conference, Verona.
Ozón, Gabriel (2004) Niche-filling: completing a parsed corpus through evolution. Paper presented at the sixth conference of General Linguistics, Santiago de Compostela, Spain. (With Yordanka Kostadinova-Kavalova and Sean Wallis.)
Paradis, Carita (2004) On the importance of corpora to lexical semantic theory: adjective-noun combinations in ICE-GB. Paper presented at the 25th ICAME conference, Verona.
Paradis, Carita (2004) Where does metonymy stop? Senses, facets and active zones. Metaphor & Symbol, 19.4, 245-264.
Sand, Andrea (2004) Shared morpho-syntactic features in contact varieties of English: article use. In: Gerald Nelson (2004)(ed.) World English 23.2: Special issue on the International Corpus of English. 281-298.
Schmied, Josef (2004) Cultural discourse in the Corpus of East African English and beyond: possibilities and problems of lexical and collocational research in a one million-word corpus. In: Gerald Nelson (2004)(ed.) World English 23.2: Special issue on the International Corpus of English. 251-260.
Schneider, Edgar (2004) How to trace structural nativization: particle verbs in world Englishes. In: Gerald Nelson (2004)(ed.) World English 23.2: Special issue on the International Corpus of English. 227-249.
Spinillo, Mariangela (2004) Reconceptualising the English determiner class. PhD thesis, English Department, University College London.
Trudgill, Peter (2004) New-dialect formation: the inevitability of colonial Englishes. Edinburgh: Edinburgh University Press.
Wallis, Sean (2004) ICECUP 3.1: a sneak preview. Paper presented at the 25th ICAME conference, Verona.
Wallis, Sean (2004) Niche-filling: completing a parsed corpus through evolution. Paper presented at the sixth conference of General Linguistics, Santiago de Compostela, Spain. (With Gabriel Ozón and Yordanka Kostadinova-Kavalova.)
This page last modified 21 July, 2014 by Survey Web Administrator.