The DCPSE Research Project
Creating a parsed and searchable diachronic corpus of present-day
Reference: R 000 239 643
Institution: University College London
Department: Department of English (Survey of English Usage)
Principal Investigator: Professor Bas Aarts
Senior Research Fellow: Sean Wallis
Research Assistants: Dr Dirk Bury, Lesley Kirk, Yordanka Kostadinova-Kavalova, Dr Ann Law, Gabriel Ozón
Period: 1 August 2002 to 31 August 2004 (extended by 1 month)
Full project description
Resources and software for download
Final report (PDF)
This project was rated Outstanding by the ESRC.
Comments from ESRC referees on the final report
Evaluation Grade: Good
This research project in question set out to compile a diachronic
comparable corpus of spoken data. Research based on studies of comparative
diachronic language data is currently a very active field in corpus
linguistics and any addition in the form of available corpora is
therefore very useful. What initially sparked off the interest in
the study of language change was the compilation of the written
corpora FLOB and FROWN. These offered an interesting comparison
to the 30-year older LOB and BROWN corpora. The result of the present
project, the DCPSE corpus, offers an exciting addition to this type
of comparison, as it is the first corpus to offer comparison in
spoken language data over (upto) a 30-year axis.
On the whole, the project has achieved its objectives and as such it has been a success. There is now a searchable diachronic spoken corpus with accompanying software.
However, there are two points that makes the grading Good rather than Outstanding. One is the failure to solve a problem about time span, stated already by a reviewer of the project proposal. Due to the design of the LLC corpus (where half the data comes from), the older texts are recorded from the 60's and up to as late as 1977. The earliest texts in the ICE part are from 1990; this means only a 13 years difference. Very few changes in language are noticeable in such a short span. Could more effort have been put into enlarging the gap between the two corpora? It is not clear from the report that this issue could not have been resolved within the project.
The second point is regarding the question of availability. This question is clearly larger than that of just this project. However, given that the resource has been produced thanks to public money (in order to promote research in this area), it is not obvious why other researchers and research students should have to pay to use it. I think it is very important to stress that every funded corpus compilation project should end in a resource that is freely available to the wider research community, preferably by downloading from a website (unless there are very specific copyright problems for the language data in question.) If resources such as these are made available, they will find there way to university students, to language teachers and to scholars. Only then will enough studies be conducted based on the resource.
Activities and Achievements
The objectives of the project were:
a) To create a diachronic corpus of present-day spoken English.
This corpus would be the first to allow comparisons between spoken data over a 30-year time period. The data were to be tagged and parsed to allow for complex grammatical searches.
b) To implement software to accompany the corpus.
Both these objectives have been carried out successfully. However, one of the reviewers of the original project proposal pointed out issues with the selection of texts. In an ideal world the data selections in LLC and ICE should have 30 years between them. However, due the design of the LLC corpus, some texts are as late as 1977 and a substantial part is from 1976. As the ICE texts are from 1990-1992, it is not clear that language change over the 13 year gap is as informative as the 30-year that was first envisaged (in comparison to the FROWN/BROWN and FLOB/LOB corpora).
The project intended to tag and parse the spoken data, which is
a big achievement given the nature of spoken data. The project made
use of new methodologies in the case of partial parsing which are
They have also altered existing software to make it relevant for diachronic research. Despite the growing interest in this type of diachronic language research there are very few tools that can be used for this task. WordSmith Keywords offers one dimension of analysis but few others are generally available. Unfortunately the DCPSE tool is a highly specialised tool (as it needs to be to handle the extensive mark-up used in the project) and will therefore not be very useful with other corpora, but it is still a valuable addition to the number of tools available for diachronic studies.
The result of the project is disseminated through a website and
available on a CDROM. A sample corpus can be downloaded from www.ucl.ac.uk/englishusage/diachronic.
Apart from the sample, the website also offers all project publications
including the end-of-award report, which I find very commendable.
The information about the corpus should also be disseminated on several discussion lists such as Corpora List and Linguist List. To this date, I have not been able to find such information on the lists or the list archives, however I trust the project to advertise it as soon as they see appropriate.
The researchers in this project have mentioned possibilities of adding pragmatic information to the linguistic annotation of the corpus. That would add a completely new angle to the corpus. However, I fear it would be very time-consuming and perhaps therefore offer little value for money.
Activities and achievements
The goal of the project was to build a particular research product: an electronically stored and searchable corpus consisting of two large and varied bodies of transcribed spontaneous spoken English, collected in different decades of the later 20th century, as a resource for future research on short-term changes in the spoken English language. This goal has been successfully achieved, and the corpus is (almost) ready to be distributed to the research community. This will be a unique resource, not only for the English linguistics, but for linguistic research in general: the use of comparable bodies of textual data for tracking recent change in written English through the 1960s-1990s has already been achieved elsewhere, but in many ways, being able to investigate similar datasets of the spoken language is a more important breakthrough, because of the leading role the spoken language takes in linguistic change. We can expect an exciting body of new research to emerge from the availability of DCPSE. In a research projects of this complexity (and in all other corpus-building activities known to me) there are bound to be loose ends to clear up at the end of the project, such as the final stages of editing the corpus annotation. This, in my view, does not detract from the overall success of the project.
The ICECUP software (partially developed under an earlier ESRC grant) is a highlight of the project. This has been developed further as a result of the present project, and is imaginatively structured to provide the researcher with access to data through syntactically-defined as well as lexically-defined queries. The further development of a corpus-defined lexicon and a 'grammaticon' of grammatical constructions gives access to a new range of research possibilities. The user interface is brilliantly handled.
Dissemination has already taken place through presentations at conferences publications will be forthcoming; but the most important type of dissemination is distribution of the corpus itself, which is already available for sample investigation through downloading from the web. The whole corpus + software will be available at cost from UCL, and (to judge from previous dissemination of the ICE-GB corpus from the same source) this will be vigorously 'marketed' through various channels, so that this resource will receive worldwide exposure and its use will be optimised. It will also be lodged with the Oxford Text Archive and the ESRC Data Archive.
The DCPSE as a new research resource will be keenly taken up by the worldwide English linguistics community and corpus linguistics community. These two research communities are closely intermeshed, and are numerously represented in the UK, N. America, Europe, Asia and other parts of the world. An additional audience for future research outcomes will be found in the English language teaching industry, for which up-to-date information about the developing spoken language is increasingly important, as the focus of language learning falls more on speech than on writing.
The DCPSE is a resource waiting to be exploited: I have no doubt that the team at UCL will seize the new opportunities for analysing and explicating short-term changes in the spoken language. There is also much to do in extending this resource further, e.g. by providing access to the sound files and analysing prosodic aspects of the corpus. A particularly revealing line of research will be a comparison between change in the spoken language and in the written language (the latter has already been investigated) over the same period.
Activities and achievements
This is a highly impressive project. The aims and objectives given in the initial application for funding were ambitious, but they have all been met. Based on the sample corpus provided on the CD which accompanied this rapporteur pack, the research team at UCL have produced an important, timely and welcome dataset for an exploration of recent lexical and (morpho)syntactic change in British English. The question of how much change can be witnessed in a 25 to 30 year period is raised and in part answered by Professor Aarts, in the End of Award Report: researchers interested in grammatical change in recent English now have an excellent resource to enable them to explore this question in depth.
It is particularly useful for researchers to have a parsed version of the LLC, especially one parsed in the same way as ICE-GB. We now have a systematic corpus to he used in the exploration of change in recent British English, which is not only interesting in its own right, but will also allow researchers to investigate new data and apply them to more general theories of language change. The Getting Started booklet gave helpful and clear advice (in addition to further references) and is accessible for both new and established researchers in the field of corpus linguistics. Based on the sample of the Corpus provided, I found the corpus to be user-friendly (especially for someone like me who is not a specialist in corpus linguistics), fast and efficient. I was particularly impressed by one of the consequences of the parsing used by the UCL team: that you can search for a range of 'types' of linguistic pattern, as well as particular instances of words or phrases. I was also impressed by the way in which the corpus addresses interesting and relevant issues concerning the classification and analysis of spoken language.
The corpus is to be disseminated via the Survey of English Usage website, and members of the research team have written booklets, papers and books to support users. They have also given workshops to demonstrate how the corpus can he used. The dissemination to date has therefore been more than satisfactory.
Any researcher interested in grammatical change (in English particularly) will welcome this corpus.
In the End of Award report, Professor Aarts notes that the UCL team have plans to seek funding to allow them to prepare and enhance digitised sound recordings for the LLC data. I would encourage and fully support an application for funding for such a project, which I believe would be of great benefit to researchers.
I would like to thank Professor Aarts, Dr Wallis, and other researchers at UCL involved in this project. I look forward to using the full corpus in my own research in the future.
This page last modified 28 May, 2015 by Survey Web Administrator.