A collection of additional further reading materials, data and spreadsheets for the book Statistics in Corpus Linguistics Research (2021).
Statistics in Corpus Linguistics Research
A New Approach

ISBN 9781138589384
This webpage aggregates resources to be read and used alongside this book.
Citation
Wallis, Sean (2021). Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » Publisher's website
Contents
Preface
Part 1. Motivations
- 1. What might corpora tell us about language?
- Software: ICECUP, see also Fuzzy Tree Fragments
- Corpora: ICE-GB and DCPSE
Aarts, Bas, Jo Close and Sean Wallis 2013. Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis 2013 (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
Wallis, Sean. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24(4), 490-521. » corp.ling.stats » ePublished » Data and spreadsheets
Part 2. Designing effective experiments with corpora
- 2. The idea of corpus experiments
- Simple analysis of who vs. whom data (Excel spreadsheet)
- 3. That vexed problem of choice
Aarts, Bas, Jo Close and Sean Wallis (2013). Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
Bowie, Jill, Sean Wallis and Bas Aarts (2013). The perfect in spoken British English. Chapter 13 in Aarts, Close, Leech and Wallis 2013. » ePublished
Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.
Wallis, Sean (2020). Grammar and Corpus Methodology. Chapter 4 in Bas Aarts, Gergana Popova and Jill Bowie (eds.) The Oxford Handbook of English Grammar. Oxford University Press.
- 4. Choice versus meaning
- Semasiological and onomasiological analysis of very data (Excel spreadsheet)
- 5. Balanced samples and imagined populations
Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.
Part 3. Confidence intervals and significance tests
- 6. Introducing inferential statistics
Stahl, S. (2006). The Evolution of the Normal Distribution. Mathematics Magazine, 79(2), 96-113. » ePublished
- Binomial demonstrator (Excel spreadsheet)
- 7. Plotting with confidence
- Magnus Levin's BE thinking data (Excel spreadsheet)
- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart
- Same sample z test — see Chapter 9
- Magnus Levin's BE thinking data (Excel spreadsheet)
- 8. From intervals to tests
Brown, Lawrence, Tony Cai and Anirban DasGupta (2001). Interval estimation for a binomial proportion. Statistical Science 16, 101-133.
Newcombe, Robert (1998a). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17, 857-872.
Newcombe, Robert (1998b). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17, 873-890.
Wallis, Sean (2013). Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20(3), 178-208. » corp.ling.stats » ePublished
- Mood x transitivity analysis (Excel spreadsheet)
- 2 x 2, 2 x 1 χ² tests (Excel spreadsheet)
- 2 x 1 goodness of fit χ², Yates's χ², Wilson, Wilson c.c.
- 2 x 2 homogeneity (independence) tests including χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c.
- Wilson finite population calculations (Excel spreadsheet)
- Single proportion example
- Newcombe-Wilson test adjustment following resampling — see Chapter 16
- 9. Comparing frequencies within the same distribution
- Magnus Levin's BE thinking data (Excel spreadsheet)
- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart — see Chapter 7
- Same sample z test
- Magnus Levin's BE thinking data (Excel spreadsheet)
- 10. Reciprocating the Wilson interval
- Analysis of sentence length data (Excel spreadsheet)
- 11. Competition between choices over time
Bowie, Jill and Sean Wallis (2016). The to-infinitival perfect: a study of decline. In Valentin Werner, Elena Seoane and Cristina Suárez-Gómez (eds.) Re-assessing the Present Perfect, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.
Neumerzhitckii, Evgenii (2018). Three-body simulator (website).
Wallis, Sean (2020). Boundaries in nature, corp.ling.stats, London: Survey of English Usage, UCL.
- 12. The replication crisis and the New Statistics
Gelman, Andrew (2016). What has happened down here is the winds have changed. Statistical Modelling, Causal Inference and Social Science. » blog post
Gelman, Andrew and Eric Loken (2013). The garden of forking paths. Columbia University. » ePublished
Leech, Geoffrey (2011). The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. International Journal of Corpus Linguistics, 16(4). 547-564.
Millar, Neil (2009). Modal verbs in TIME: frequency changes 1923–2006. International Journal of Corpus Linguistics, 14(2), 191-220.
- 13. Choosing the right test
Gries, Stefan Th. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora, 10(1), 95-125.
Oakes, Michael (1998). Statistics for Corpus Linguistics. Edinburgh: EUP.
Ruxton, Graeme (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behavioral Ecology, 17, 688–690.
Sheskin, David (2011). Handbook of Parametric and Nonparametric Statistical Procedures (5th ed.). Boca Raton, Fl: CRC Press.
Part 4. From effect sizes to meta-tests
- 14. The size of an effect
Wallis, Sean (2012). Goodness of fit measures for discrete categorical data, corp.ling.stats, London: Survey of English Usage, UCL.
Wallis, Sean (2012). Measures of association for contingency tables, corp.ling.stats, London: Survey of English Usage, UCL.
Wallis, Sean (2019). Confidence intervals on pairwise φ statistics, corp.ling.stats, London: Survey of English Usage, UCL.
- 15. Meta-tests for comparing tables of results
- separability tests (Excel spreadsheet)
- homogeneity (gradient, point, multi-point)
2 x 2 χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c. and φ comparisons
r x 2 χ², Yates's χ²
r x c χ² - goodness of fit
2 x 1 χ², Wilson, Wilson c.c.
r x 1 χ² - homogeneity superset/subset (gradient, point, multi-point)
2 x 2 χ², Yates's χ², Wilson, Wilson c.c. - goodness of fit superset/subset
2 x 1 Wilson, Wilson c.c.
- homogeneity (gradient, point, multi-point)
- separability tests (Excel spreadsheet)
Part 5. Statistical solutions for corpus samples
- 16. Coping with imperfect data
Wallis, Sean and Seth Mehl (forthcoming). Comparing baselines for corpus analysis: Research into the get-passive in speech and writing. In: Ole Schützler and Julia Schlüter (eds.) Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge University Press.
- Wilson finite population calculations (Excel spreadsheet)
- Single proportion example — see Chapter 8
- Newcombe-Wilson test adjustment following resampling
- Wilson finite population calculations (Excel spreadsheet)
- 17. Adjusting intervals for random-text samples
- Large sample analysis of p(inter) and p(CL | word) data (Excel spreadsheet)
- Analysis of p(inter) data for each ICE-GB text category (Excel spreadsheet)
- Partitoned and pooled analysis of shall / will data in DCPSE (Excel spreadsheet)
Part 6. Concluding remarks
- 18. Plotting the Wilson distributionr
- Wilson distribution (Excel spreadsheet)
- Wilson, Wilson c.c., Logit Wilson
- Clopper-Pearson (up to n = 10)
- Wilson distribution (Excel spreadsheet)
19. In conclusion
Appendices
Appendix A. The interval equality principle
Appendix B. Pseudo-code for computational procedures