Sean Wallis

Principal Research Fellow, Survey of English Usage

Current research

I'm a corpus linguist, AI technologist and statistician based in the linguistic research unit, the Survey of English Usage at UCL. The intersection between these different research areas can be thought of as one of research methodology - the ‘how do we know what we think we know?’ question about language.

I'm currently working on the Teaching English Grammar in Schools research project, developing the Englicious platform to a second release. I believe it is important that our research is put to effective use to benefit society as much as possible. We have released a number of mobile Apps for the iPhone, iPad and Android mobile devices including the interactive Grammar of English.

I teach corpus linguistics, research methodology and statistics on the UCL MA in English Linguistics.

Research perspectives

My background is in artificial intelligence (AI). I was a researcher in the (now sadly defunct) AI Group at the University of Nottingham for six years before joining the Survey of English Usage in 1995.

Broadly, my perspective in AI is that ‘artificial intelligence’ is not a substitute for human reasoning, knowledge and culture, but a potential adjunct of it: ‘intelligent assistance’ if you will. AI includes a range of powerful techniques which can be used to support human endeavour, particularly in the “understanding of complex data”, which is my pat answer to the question - what do I carry out research into?

My central research area is methodological - that is, I am concerned with the way we think, learn about, process, evaluate and communicate our research. This work requires me to develop software tools and platforms to help researchers carry out their research, but these tools are a means to an end, not an end in themselves. To my mind, knowledge does not reside in the algorithm, but in the sense we make of it.

Much of my current research work is bordering on mathematics and statistics, at least insofar as this allows us to maximise the value we can get out of corpus data! I'm increasingly of the view that corpus linguists, ourselves included, are really only just beginning to glimpse the value of the data that we have collected.


ICECUP, or to give it its full (and rather misleading) title, the International Corpus of English Corpus Utility Program Version 3, is a corpus exploration platform. It is a research platform designed to allow students and researchers to explore a parsed corpus and obtain empirical linguistic evidence to evaluate theories against.

ICECUP was first distributed with the parsed ICE-GB corpus in 1998 in a 3.0 version. Although I wouldn't describe maintaining software through many iterations of Microsoft Windows a labour of love, I have maintained ICECUP 3.1 and supported an international community of researchers who use our corpora for twenty years.

Amazingly perhaps, ICECUP is still cutting-edge. We can still ask questions of corpus data using ICECUP and our parsed corpora that simply cannot be asked of most other corpora. A parsed corpus contains empirical information about grammar that is simply not recorded in unparsed linguistic datasets. One of my main research interests is in exploring and uncovering new possibilities for research with richly-structured corpora (from my perspective, parsing is only one level of potential theory-rich linguistic annotation).

ICECUP 3.1 can be downloaded with sample corpora from ICE-GB R2 or DCPSE. The Survey webpages now include an extended description of ICECUP 3.1. The latest 3.1.1 version, ported to Visual C++, is compatible with 64-bit platforms.

From Artificial Intelligence to Corpus Linguistics

Since joining the Survey, I have applied AI algorithms and approaches to corpus linguistics in a number of ways. The following is not an exhaustive list.

  • Simulated annealing (an heuristic constraint satisfaction method) has been employed to align ICE-GB corpus texts (Wallis and Nelson 1997) and drive a part-of-speech (POS) tagger.
  • Knowledge acquisition principles were employed in the design, development and evaluation of tree editors (Wallis and Nelson 1997) and other browsers in ICECUP.
  • Deductive reasoning was used for matching Fuzzy Tree Fragments to corpus trees (Wallis and Nelson 2000) and the axiomatic reduction of logical propositions (Nelson, Wallis and Aarts 2002), both used in ICECUP.
  • Machine learning techniques were employed in developing and refining a phrasal parser used in the parsing of DCPSE, developing a POS-tagger and knowledge discovery.
  • Knowledge discovery techniques, including statistically sound induction, were applied to the 'Next Generation Tools' project.
  • I have written on knowledge representation issues, including corpus annotation (Wallis 2007), and corpus queries (Wallis 2008).

It should be noted that many of the algorithms described are not simply 'proof-of-concept' systems. ICECUP has been used for corpus linguistics research for two decades, its tree editor applied to several hundreds of thousands of trees, etc.

There are two critical requirements for AI technologies embedded in end user tools (Wallis, Cottam and Shadbolt 1994): they must operate reliably (robustness, scalability), and both the specification and results of processing must be meaningful (transparency). Perhaps most of all, embedded AI should be seamless: the 'artificial intelligence' aspects are justified insofar as they solve real problems in the application area. Consequently, on occasion the AI implications of this research may be downplayed.

For example, our use of simulated annealing for aligning ICE-GB tree and text files (Wallis and Nelson 1997) solved a significant problem and permitted the corpus to be reconstructed from over 100,000 separate files.

The critique of longitudinal corpus correction in the same paper paved the way to a more efficient cross-sectional correction approach which was based around the corpus exploration platform, ICECUP, itself embodying a number of AI algorithms (see above).

I have written a number of journal articles and book chapters on Corpus Linguistics methodology, including corpus annotation (Wallis 2007), annotation methods (Wallis 2003a), corpus query (Wallis 2008; Wallis and Nelson 2000; Chapters 3-7 in Nelson et al 2002), experimental methods (Wallis 2003b; Wallis and Nelson 2001; Chapters 8-9 in Nelson et al 2002, and many papers since).

I refer to this perspective as the '3A perspective' – annotation, abstraction and analysis.

Essentially, this is to recognise the importance of three distinct interlocking research processes.

  • Annotation takes a set of texts and adds linguistic information to it, enriching it and identifying instances of linguistically meaningful entities and relations. At this point the resulting enriched dataset ('corpus') is usually distributed to the research community.
  • Abstraction is the researcher's exploratory process of establishing a mapping between concepts they wish to research, and representations found in the corpus (text + annotation). It also maps the structured corpus to a regular dataset that can be analysed by conventional statistical methods. The key linking element in abstraction is a corpus query.
  • Analysis is the process of applying statistical and other methods to data that has been abstracted in this way.

Without consistent annotation, reliable abstraction of concepts is impossible. Hence much early work in Corpus Linguistics focused on ensuring that a corpus was reliably annotated, with rather less research effort in corpus query methods. Even less research on applying analysis methods has taken place (this is now my primary focus).

Consider the example of research into the structure of noun phrases. First, we need to represent noun phrases in the corpus. We, the annotators, need to identify heads and other elements, and various kinds of modifier relationships in our annotation. Reliably and completely parsing a corpus is a costly exercise, and so many researchers should be able to use the results. Now let us suppose that you, a researcher, are committed to a different framework. You need to map representations in the corpus to your preferred scheme. This is the role of abstraction. Finally, having obtained data within your chosen framework you can now perform statistical generalisation and other analytic steps. You can test hypotheses expressed in your framework against our grammatically-encoded corpus, even though our framework is not yours.

Abstraction offers a way to unlock the hermeneutic trap, which is where researchers are obliged to merely accept the output of an algorithm or a corpus annotation scheme as a given set of framing assumptions for their research. ICECUP is designed to address this problem above all others, and it gives rise to what has been called in some quarters 'the Survey Methodology' in Corpus Linguistics.

Crucially each process is conceived of as a cycle, so for example, annotation may be revised and re-applied to a corpus, corpus queries may be refined to obtain new data, etc.

Experimental Design and Statistics

My applied AI research has brought me back full-circle to questions of experimental design and statistics. In part, this is because we are discovering exciting and novel research possibilities using our parsed corpora. We just have to ask the right questions! Second, being aware of the difficulties we have in conceptualising statistical ideas, I have been attempting to explore alternative approaches to teaching and research.

This started with conference papers and pages on our website discussing methodology in broad terms, but a more rigorous and in-depth reading and evaluation can be found on my corp.ling.stats blog, which contains over 50 posts about statistics, including discussion pieces, original mathematics and published papers.

If you want to sum up my approach, I have become convinced that visualisation and theoretical well-foundedness are crucial. In other words, there is no point counting and measuring phenomena that have no theoretical value, and unless you can visualise the statistical models you are using, you are certain to misunderstand what you have found!

I have just published a new book on statistics, containing a lot of new material, with Routledge.

Selected publications

1985 (with S.K. Wallis) Galactic Chess. Practical Computing, September 1985.
1993 Machine Learning with Knowledge. Proc. MLnet Workshop on Scientific Discovery 1993, MLnet.
1994a (with H.D. Cottam and N.R. Shadbolt) Design principles for Embedded AI: a case study. Proceedings of First European Conference on Cognitive Science in Industry, CRP-CU, Luxembourg.
1994b (with H.D. Cottam and N.R. Shadbolt) Using AI techniques to aid repair estimation. Proceedings of Fourteenth International Avignon Conference AI '94, 1. Nanterre, France.
1997 (with G. Nelson) Syntactic parsing as a knowledge acquisition problem. Proceedings of 10th European Knowledge Acquisition Workshop, Catalonia, Spain, Springer Verlag. 285-300. » ePublished
1998a (with B. Aarts and G. Nelson) Using fuzzy tree fragments to explore English grammar. English Today 14, 3: 52-56.
1998b (with G. Nelson and B. Aarts, editors) The British Component of the International Corpus of English (software CD-ROM). London: Survey of English Usage.
1999a (with B. Aarts and G. Nelson) Global resources for a global language: English language pedagogy in the modern age. In: C. Gnutzmann (ed.) Teaching and Learning English as a Global Language: Native and Non-Native Perspectives, Tübingen: Stauffenberg Verlag. 273-290.
1999b (with B. Aarts and G. Nelson) Parsing in reverse – Exploring ICE-GB with Fuzzy Tree Fragments and ICECUP. In: J. Kirk (ed.) Corpora Galore: Analyses and Techniques in Describing English, Amsterdam: Rodopi. 335-344. » Offprint
2000 (with G. Nelson) Exploiting fuzzy tree fragment queries in the investigation of parsed corpora. Literary and Linguistic Computing 15, 3: 339-361.
2001 (with G. Nelson) Knowledge discovery in grammatically analysed corpora. Data Mining and Knowledge Discovery, 5: 307-340.
2002 (with G. Nelson and B. Aarts) Exploring Natural Language: Working with the British Component of the International Corpus of English. G29, Varieties of English Around the World series. Amsterdam: John Benjamins. More...
2003a Completing parsed corpora: from correction to evolution. In Abeillé, A. (ed.) Treebanks: building and using syntactically annotated corpora, Boston: Kluwer. 61-71.
2003b Scientific experiments in parsed corpora : an overview. In Granger S. and Petch-Tyson, S. (ed.) Extending the scope of corpus-based research: new applications, new challenges, Language and Computers 48. Rodopi: Amsterdam. 12-23.
2006a (with G. Nelson and B. Aarts, editors) The British Component of the International Corpus of English Release 2 (software CD-ROM). London: Survey of English Usage.
2006b (with B. Aarts, G. Ozón and Y. Kavalova, editors) The Diachronic Corpus of Present-Day Spoken English (software CD-ROM). London: Survey of English Usage.
2006c (with B. Aarts) Recent developments in the syntactic annotation of corpora. In: Bermúdez, E.M. and Miyares, L.R. (eds.) Linguistics in the twenty-first century. Cambridge: Cambridge Scholars Press. 197-202.
2007 Annotation, Retrieval and Experimentation. In Meurman-Solin, A. and Nurmi, A.A. (eds.) Annotating Variation and Change. Helsinki: Varieng, UoH. » ePublished
2008 Searching treebanks and other structured corpora. In Lüdeling, A. and Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter. 738-759.
2010 (with J. Close and B. Aarts) Recent changes in the use of the progressive construction in English. In Capelle, B. and Wada, N. (eds.) Distinctions in English Grammar. Kaitakusha: Tokyo, Japan. 148-168. » pre-published.
2012a (with B. Aarts and D. Clayton) Bridging the Grammar Gap. English Today 28, 1: 3-8. » ePublished : English Today © 2012 Cambridge University Press.
2012b (with J. Bowie and B. Aarts) That vexed problem of choice. Presented at ICAME 33. London: Survey of English Usage. » ePublished » PowerPoint slides
2012c Tagging ICE Phillipines and other corpora. London: Survey of English Usage. » ePublished
2013a (with B. Aarts, J. Close and G. Leech, editors) The Verb Phrase in English. Cambridge University Press. More...
2013b (with B. Aarts and J. Close). Choices over time: methodological issues in current change. Chapter 2 in Aarts, Close, Leech and Wallis 2013. » ePublished
2013c (with B. Aarts and J. Bowie). The perfect in spoken British English. Chapter 13 in Aarts, Close, Leech and Wallis 2013. » ePublished
2013d Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20:3, 178-208. » ePublished
2013e z-squared: The origin and application of χ². Journal of Quantitative Linguistics 20:4. 350-378. » ePublished
2013f The London Corpora. Presented at Greek Corpus 20 Workshop, Athens, 28 June 2013 » PowerPoint slides
2014a What might a corpus of parsed spoken data tell us about language? Proceedings of Olinco 2014. Palacký University, Olomouc, Czech Republic. » corp.ling.stats » ePublished
2014b (with B. Aarts) Noun Phrase simplicity in spoken English. Proceedings of Olinco 2014. Palacký University, Olomouc, Czech Republic.
2014c (with B. Aarts and J. Bowie) Profiling the English verb phrase over time: modal patterns. In Taavitsainen, I., Kytö, M., Claridge, C. and Smith, J. (eds.) Developments in English: Expanding Electronic Evidence, Cambridge University Press.
2014d (with B. Aarts and J. Bowie). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Marín-Arrese, J.I., Carretero, M., Arús, H.J. and van der Auwera, J. (eds.) English Modality, Berlin: De Gruyter, 57-94.
2016a (with J. Holmwood, R. Cohen and T. Hickey, editors) The Alternative White Paper for Higher Education. In Defence of Public Higher Education: Knowledge for a Successful Society. London: HE Convention. » order ePublished
2016b (with J. Bowie) The to-infinitival perfect: A study of decline. In Werner, V., Seoane, E., and Suárez-Gómez, C. (eds.) Re-assessing the Present Perfect, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.
2016c (with S. Mehl and B. Aarts) Language learning at your fingertips: deploying corpora in mobile teaching apps. In Corrigan, K., Mearns, A. (eds.) Creating and digitizing language corpora. Volume 3: Databases for Public Engagement. Palgrave, Basingstoke. 211-239.
2018 (with B. Aarts and J. Bowie) –Ing clauses in spoken English: structure, usage and recent change. In Seoane, E., C. Acuña-Fariña, & I. Palacios-Martínez (eds.) Subordination in English. Synchronic and Diachronic Perspectives. Topics in English Linguistics (TiEL) 101. Berlin: De Gruyter. 129-154.
2019a Comparing χ² Tables for Separability of Distribution and Effect: Meta-Tests for Comparing Homogeneity and Goodness of Fit Contingency Test Outcomes. Journal of Quantitative Linguistics 26:4, 330-355. » corp.ling.stats » ePublished 2018
2019b (with I. Cushing and B. Aarts) Exploiting Parsed Corpora in Grammar Teaching, Linguistic Issues in Language Technology, 18(2019). » ePublished
2019c Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24:4, 490-521. » corp.ling.stats » ePublished » Resources
2020 Grammar and Corpus Methodology. Chapter 4 in Aarts, B., Popova, G. and Bowie, J. (eds.) The Oxford Handbook of English Grammar. Oxford University Press. » order
2021 Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » order » Resources
2022 (with Seth Mehl). Comparing baselines for corpus analysis: Research into the get-passive in speech and writing. In: O. Schützler & J. Schlüter (eds.) Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge: Cambridge University Press. 101-126.

Statistics lecture notes and resources

corp.ling.stats blog

2009 Binomial confidence intervals and contingency tests. London: Survey of English Usage. Published as Wallis (2013d). » corp.ling.stats » ePublished
2010a Competition between choices over time. London: Survey of English Usage. » corp.ling.stats » ePublished
2010b z-squared: The origin and application of χ². London: Survey of English Usage. Published as Wallis (2013e). » corp.ling.stats » ePublished » PowerPoint slides
2011 Comparing χ² tests. London: Survey of English Usage. » corp.ling.stats » ePublished
2012a Goodness of fit measures for discrete categorical data. London: Survey of English Usage. » corp.ling.stats » ePublished
2012b Measures of association for contingency tables. London: Survey of English Usage. » corp.ling.stats » ePublished
2015 Adapting random-instance sampling variance estimates and Binomial models for random-text sampling. London: Survey of English Usage. » corp.ling.stats » ePublished
2017 Detecting direction in interaction evidence. London: Survey of English Usage. » corp.ling.stats » ePublished
2018 Plotting the Wilson distribution. London: Survey of English Usage. » corp.ling.stats » ePublished
2020 Further evaluation of Binomial confidence intervals and difference intervals. London: Survey of English Usage. » corp.ling.stats » ePublished
2022 Accurate confidence intervals on Binomial proportions, functions of proportions and other related scores. London: Survey of English Usage. » corp.ling.stats » ePublished PowerPoint slides


Mobile apps

2011- (with B. Aarts) The interactive Grammar of English. (Mobile App) London: Survey of English Usage / UCL Business PLC. Editions: 1.0 for iOS 2011. 1.1 for iOS/Android, 2013.
2013 (with S. Mehl and B. Aarts) Academic Writing in English. (Mobile App) London: Survey of English Usage / UCL Business PLC.
2014 (with S. Mehl) English Spelling & Punctuation. (iOS Mobile App) London: Survey of English Usage / UCL Business PLC.

This page last modified 26 July, 2022 by Survey Web Administrator.