Senior Research Fellow, Survey of English Usage
I'm currently working on the Teaching English Grammar in Schools research project, developing the platform. I believe it is important that our research is put to effective use to benefit society as much as possible. We have recently released a mobile App for the iPhone, iPad and Android mobile devices called the interactive Grammar of English.
I was the joint PI and lead researcher on the Next Generation Tools research project. This project aims to develop a research platform for corpus linguistics which supports the whole experimental research cycle. You can download the ICECUP IV beta from here.
ICECUP IV is built on top of the ICECUP III architecture. It shares components but extends them in new ways to support experimental research. The idea is that exploratory steps can lead to insights that help a linguist formulate hypotheses. She or he can then investigate these thoroughly by formalising their research question and carrying out experiments using the same platform. Conversely, if experiments derive unexpected results, these can be thoroughly explored.
We completed the Diachronic Corpus of Present-Day Spoken English (DCPSE) in 2006, a parsed corpus of spoken English. The main task was to parse some 400,000 words taken from the spoken part of the London Lund Corpus (LLC), collected in the 1960s and 1970s, consistently with the British Component of the International Corpus of English (ICE-GB).
ICE-GB was collected in the 1990s and was partially parsed automatically, and then manually corrected by a team of linguists. The complexity of the parsing task is high, especially when you consider the level of detail of the analysis scheme and the fact that all the material is spoken. A description of the project is here. My input is primarily to manage the project and support it through software.
ICECUP 3.1 can be downloaded from here with sample corpora from ICE-GB R2 (click here for the DCPSE version). The Survey webpages now include an extended description of ICECUP 3.1. The latest version in beta, ported to Visual C++, is compatible with 64-bit platforms.
ICECUP, or to give it its full (and slightly misleading) title, the International Corpus of English Corpus Utility Program Version 3, is a corpus exploration platform. It is a tool designed to make it easy for researchers to explore a parsed corpus, and was initially distributed with the parsed ICE-GB corpus in 1998 in a 3.0 version.
My background is in artificial intelligence (AI). I was a researcher in the (now sadly defunct) AI Group at the University of Nottingham for six years before joining the Survey of English Usage in 1995.
Broadly, my perspective in AI is that ‘artificial intelligence’ is not a substitute for human reasoning, knowledge and culture, but a potential adjunct of it: ‘intelligent assistance’ if you will. AI includes a range of powerful techniques which can be used to support human endeavour, particularly in the “understanding of complex data”, which is my pat answer to the question - what do I carry out research into?
My central research area is methodological - that is, I am concerned with the way we think, learn about, process, evaluate and communicate our research. This work requires me to develop software tools and platforms to help researchers carry out their research, but these tools are a means to an end, not an end in themselves. To my mind, knowledge does not reside in the algorithm, but in the sense we make of it.
Much of my current research work is bordering on mathematics and statistics, at least insofar as this allows us to maximise the value we can get out of corpus data! I'm increasingly of the view that corpus linguists, ourselves included, are really only just beginning to glimpse the value of the data that we have collected.
From Artificial Intelligence to Corpus Linguistics
Since joining the Survey, I have applied AI algorithms and approaches to corpus linguistics in a number of ways. The following is not an exhaustive list.
- Simulated annealing (an heuristic constraint satisfaction method) has been employed to align ICE-GB corpus texts (Wallis and Nelson 1997) and drive a part-of-speech (POS) tagger.
- Knowledge acquisition principles were employed in the design, development and evaluation of tree editors (Wallis and Nelson 1997) and other browsers in ICECUP.
- Deductive reasoning was used for matching Fuzzy Tree Fragments to corpus trees (Wallis and Nelson 2000) and the axiomatic reduction of logical propositions (Nelson, Wallis and Aarts 2002), both used in ICECUP.
- Machine learning techniques were employed in developing and refining a phrasal parser used in the parsing of DCPSE, developing a POS-tagger and knowledge discovery.
- Knowledge discovery techniques, including statistically sound induction, were applied to the 'next generation tools' project.
- I have written on knowledge representation issues, including recently in relation to corpus annotation (Wallis 2007) and corpus query (Wallis 2008).
It should be noted that many of the algorithms described are not simply 'proof-of-concept' systems. ICECUP has been used for corpus linguistics research for almost a decade, its tree editor applied to several hundreds of thousands of trees, etc.
There are two critical requirements for AI technologies embedded in end user tools (Wallis, Cottam and Shadbolt 1994): they must operate reliably (robustness, scalability), and both the specification and results of processing must be meaningful (transparency). Perhaps most of all, embedded AI should be seamless: the 'artificial intelligence' aspects are justified insofar as they solve real problems in the application area. Consequently, on occasion the AI implications of this research may be downplayed.
For example, our use of simulated annealing for aligning ICE-GB tree and text files (Wallis and Nelson 1997) solved a significant problem and permitted the corpus to be reconstructed from over 100,000 separate files.
The critique of longitudinal corpus correction in the same paper paved the way to a more efficient cross-sectional correction approach which was based around the corpus exploration platform, ICECUP, itself embodying a number of AI algorithms (see above).
I have written a number of journal articles and book chapters on Corpus Linguistics methodology, including corpus annotation (Wallis 2007), annotation methods (Wallis 2003a), corpus query (Wallis 2008; Wallis and Nelson 2000; Chapters 3-7 in Nelson et al 2002), experimental methods (Wallis 2003b; Wallis and Nelson 2001; Chapters 8-9 in Nelson et al 2002).
I refer to this perspective as the '3A perspective' - annotation, abstraction and analysis. Without regular annotation, reliable abstraction of concepts is impossible. Hence most of the work in Corpus Linguistics to-date has been in ensuring a reliable corpus annotation, with rather less research effort in corpus query methods. Even less research on applying analysis methods has taken place.
Similarly, the reliable abstraction of concepts is a necessary precondition for analysis. The ability to capture a concept (e.g. noun phrase post-modification) in its many forms is a precondition to the definition of an experiment to investigate the factors that lead a speaker to prefer one form over another.
Finally, this applied AI research has brought me back full-circle to questions of experimental design and statistics. In part this is because we are now discovering exciting and novel research possibilities using our parsed corpora. We just have to ask the right questions! In part this is also because corpus linguists, like most of us, are not trained in experimental statistics.
So in recent years I have been attempting to humanise and make relevant some key tools found in statistics articles and textbooks. This started with conference papers and pages on our website discussing methodology in broad terms, but a more rigorous and in-depth reading and evaluation can be found on my corp.ling.stats blog.
|1985||(with S.K. Wallis) Galactic Chess. Practical Computing, September 1985.|
|1993||Machine Learning with Knowledge. Proc. MLnet Workshop on Scientific Discovery 1993, MLnet.|
|1994a||(with H.D. Cottam and N.R. Shadbolt) Design principles for Embedded AI: a case study. Proceedings of First European Conference on Cognitive Science in Industry, CRP-CU, Luxembourg.|
|1994b||(with H.D. Cottam and N.R. Shadbolt) Using AI techniques to aid repair estimation. Proceedings of Fourteenth International Avignon Conference AI '94, 1. Nanterre, France.|
|1997||(with G. Nelson) Syntactic parsing as a knowledge acquisition problem. Proceedings of 10th European Knowledge Acquisition Workshop, Catalonia, Spain, Springer Verlag. 285-300. » ePublished|
|1998a||(with B. Aarts and G. Nelson) Using fuzzy tree fragments to explore English grammar. English Today 14, 3: 52-56.|
|1998b||(with G. Nelson and B. Aarts, editors) The British Component of the International Corpus of English (software CD-ROM). London: Survey of English Usage.|
|1999a||(with B. Aarts and G. Nelson) Global resources for a global language: English language pedagogy in the modern age. In: C. Gnutzmann (ed.) Teaching and Learning English as a Global Language: Native and Non-Native Perspectives, Tübingen: Stauffenberg Verlag. 273-290.|
|1999b||(with B. Aarts and G. Nelson) Parsing in reverse – Exploring ICE-GB with Fuzzy Tree Fragments and ICECUP. In: J. Kirk (ed.) Corpora Galore: Analyses and Techniques in Describing English, Amsterdam: Rodopi. 335-344.|
|2000||(with G. Nelson) Exploiting fuzzy tree fragment queries in the investigation of parsed corpora. Literary and Linguistic Computing 15, 3: 339-361.|
|2001||(with G. Nelson) Knowledge discovery in grammatically analysed corpora. Data Mining and Knowledge Discovery, 5: 307-340.|
|2002||(with G. Nelson and B. Aarts) Exploring Natural Language: Working with the British Component of the International Corpus of English. G29, Varieties of English Around the World series. Amsterdam: John Benjamins. More...|
|2003a||Completing parsed corpora: from correction to evolution. In Abeillé, A. (ed.) Treebanks: building and using syntactically annotated corpora, Boston: Kluwer. 61-71.|
|2003b||Scientific experiments in parsed corpora : an overview. In Granger S. and Petch-Tyson, S. (ed.) Extending the scope of corpus-based research: new applications, new challenges, Language and Computers 48. Rodopi: Amsterdam. 12-23.|
|2006a||(with G. Nelson and B. Aarts, editors) The British Component of the International Corpus of English Release 2 (software CD-ROM). London: Survey of English Usage.|
|2006b||(with B. Aarts, G. Ozón and Y. Kavalova, editors) The Diachronic Corpus of Present-Day Spoken English (software CD-ROM). London: Survey of English Usage.|
|2006c||(with B. Aarts) Recent developments in the syntactic annotation of corpora. In: Bermúdez, E.M. and Miyares, L.R. (eds.) Linguistics in the twenty-first century. Cambridge: Cambridge Scholars Press. 197-202.|
|2007||Annotation, Retrieval and Experimentation. In Meurman-Solin, A. and Nurmi, A.A. (eds.) Annotating Variation and Change. Helsinki: Varieng, UoH. » ePublished|
|2008||Searching treebanks and other structured corpora. In Lüdeling, A. and Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter. 738-759.|
|2010||(with J. Close and B. Aarts) Recent changes in the use of the progressive construction in English. In Capelle, B. and Wada, N. (eds.) Distinctions in English Grammar. Kaitakusha: Tokyo, Japan. 148-168. » pre-published.|
|2012a||(with B. Aarts and D. Clayton) Bridging the Grammar Gap. English Today 28, 1: 3-8. » ePublished : English Today © 2012 Cambridge University Press.|
|2012b||(with J. Bowie and B. Aarts) That vexed problem of choice. Presented at ICAME 33. London: Survey of English Usage. » ePublished » PowerPoint slides|
|2013a||(with B. Aarts, J. Close and G. Leech, editors) The Verb Phrase in English. Cambridge University Press. More...|
|2013b||(with B. Aarts and J. Close). Choices over time: methodological issues in current change. Chapter 2 in Aarts, Close, Leech and Wallis 2013. » ePublished|
|2013c||(with B. Aarts and J. Bowie). The perfect in spoken British English. Chapter 13 in Aarts, Close, Leech and Wallis 2013. » ePublished|
|2013d||Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » ePublished|
|2013e||z-squared: The origin and application of χ². Journal of Quantitative Linguistics 20:4. 350-378. » ePublished|
|2013f||(with B. Aarts and J. Bowie). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús H,J. and van der Auwera, J. (ed.) English Modality, Berlin: De Gruyter, 57-94.|
|2013g||The London Corpora. Presented at Greek Corpus 20 Workshop, Athens, 28 June 2013 » PowerPoint slides|
|2014a||What might a corpus of parsed spoken data tell us about language? Proceedings of Olinco 2014. Palacký University, Olomouc, Czech Republic. » corp.ling.stats » ePublished|
|2014b||(with B. Aarts) Noun Phrase simplicity in spoken English. Proceedings of Olinco 2014. Palacký University, Olomouc, Czech Republic.|
|2015||(with B. Aarts and J. Bowie) Profiling the English verb phrase over time. In Taavitsainen, I., Kytö, M., Claridge, C., Smith, J. (ed.) Developments in English: Expanding Electronic Evidence, Cambridge University Press.|
Statistics lecture notes and resources
|2009||Binomial confidence intervals and contingency tests. London: Survey of English Usage. Published as Wallis (2013d). » corp.ling.stats » ePublished|
|2010a||Competition between choices over time. London: Survey of English Usage. » corp.ling.stats » ePublished|
|2010b||z-squared: The origin and application of χ². London: Survey of English Usage. Published as Wallis (2013e). » corp.ling.stats » ePublished » PowerPoint slides|
|2011||Comparing χ² tests. London: Survey of English Usage. » corp.ling.stats » ePublished|
|2012a||Goodness of fit measures for discrete categorical data. London: Survey of English Usage. » corp.ling.stats » ePublished|
|2012b||Measures of association for contingency tables. London: Survey of English Usage. » corp.ling.stats » ePublished|
|2015||Adapting random-instance sampling variance estimates and Binomial models for random-text sampling. London: Survey of English Usage. » corp.ling.stats » ePublished|
2 x 2 χ² (multiple 2x2 contingency tests and 2x1 goodness of fit calculations)
χ² separability test (tests whether two 2x2, 2x1, 3x2 or 3x1 tables significantly differ)
Wilson score intervals for a small population (use when the population is finite)
Single-sample z test for comparing two competing frequencies for significant difference
Interaction trend analysis (evaluates a series of repeating decisions)
Binomial demonstrator (classroom demonstrator for Binomial distribution)
|2011-||(with B. Aarts) The interactive Grammar of English. (Mobile App) London: Survey of English Usage / UCL Business PLC. Editions: 1.0 for iOS 2011. 1.1 for iOS/Android, 2013.|
|2013||(with S. Mehl and B. Aarts) Academic Writing in English. (Mobile App) London: Survey of English Usage / UCL Business PLC.|
|2014||(with S. Mehl) English Spelling & Punctuation. (iOS Mobile App) London: Survey of English Usage / UCL Business PLC.|
This page last modified 23 September, 2015 by Survey Web Administrator.