XClose

UCL English

Home
Menu

Statistics in Corpus Linguistics Research – Resources

A collection of additional further reading materials, data and spreadsheets for the book Statistics in Corpus Linguistics Research (2021).

Statistics in Corpus Linguistics Research (book cover)
Statistics in Corpus Linguistics Research
A New Approach

ISBN 9781138589384

This webpage aggregates resources to be read and used alongside this book.

Citation

Wallis, Sean (2021). Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » Publisher's website

Contents

Preface

Part 1. Motivations

1. What might corpora tell us about language?

Aarts, Bas, Jo Close and Sean Wallis 2013. Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis 2013 (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished

Wallis, Sean. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24(4), 490-521. » corp.ling.stats » ePublished » Data and spreadsheets

Part 2. Designing effective experiments with corpora

2. The idea of corpus experiments
3. That vexed problem of choice

Aarts, Bas, Jo Close and Sean Wallis (2013). Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished

Bowie, Jill, Sean Wallis and Bas Aarts (2013). The perfect in spoken British English. Chapter 13 in Aarts, Close, Leech and Wallis 2013. » ePublished

Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.

Wallis, Sean (2020). Grammar and Corpus Methodology. Chapter 4 in Bas Aarts, Gergana Popova and Jill Bowie (eds.) The Oxford Handbook of English Grammar. Oxford University Press.

4. Choice versus meaning
5. Balanced samples and imagined populations

Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.

Part 3. Confidence intervals and significance tests

6. Introducing inferential statistics

Stahl, S. (2006). The Evolution of the Normal Distribution. Mathematics Magazine79(2), 96-113. » ePublished

7. Plotting with confidence
  • Magnus Levin's BE thinking data (Excel spreadsheet)
    • Plotting Wilson intervals, Wilson c.c.
    • Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart
    • Same sample z test — see Chapter 9
8. From intervals to tests

Brown, Lawrence, Tony Cai and Anirban DasGupta (2001). Interval estimation for a binomial proportion. Statistical Science 16, 101-133.

Newcombe, Robert (1998a). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17, 857-872.

Newcombe, Robert (1998b). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17, 873-890.

Wallis, Sean (2013). Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20(3), 178-208. » corp.ling.stats » ePublished

9. Comparing frequencies within the same distribution
  • Magnus Levin's BE thinking data (Excel spreadsheet)
    • Plotting Wilson intervals, Wilson c.c.
    • Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart — see Chapter 7
    • Same sample z test
10. Reciprocating the Wilson interval
11. Competition between choices over time

Bowie, Jill and Sean Wallis (2016). The to-infinitival perfect: a study of decline. In Valentin Werner, Elena Seoane and Cristina Suárez-Gómez (eds.) Re-assessing the Present PerfectTopics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.

Neumerzhitckii, Evgenii (2018). Three-body simulator (website).

Wallis, Sean (2020). Boundaries in nature, corp.ling.stats, London: Survey of English Usage, UCL.

12. The replication crisis and the New Statistics

Gelman, Andrew (2016). What has happened down here is the winds have changed. Statistical Modelling, Causal Inference and Social Science. » blog post

Gelman, Andrew and Eric Loken (2013). The garden of forking paths. Columbia University. » ePublished

Leech, Geoffrey (2011). The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. International Journal of Corpus Linguistics16(4). 547-564.

Millar, Neil (2009). Modal verbs in TIME: frequency changes 1923–2006. International Journal of Corpus Linguistics14(2), 191-220.

13. Choosing the right test

Gries, Stefan Th. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora10(1), 95-125.

Oakes, Michael (1998). Statistics for Corpus Linguistics. Edinburgh: EUP.

Ruxton, Graeme (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behavioral Ecology17, 688–690.

Sheskin, David (2011). Handbook of Parametric and Nonparametric Statistical Procedures (5th ed.). Boca Raton, Fl: CRC Press.

Part 4. From effect sizes to meta-tests

14. The size of an effect

Wallis, Sean (2012). Goodness of fit measures for discrete categorical data, corp.ling.stats, London: Survey of English Usage, UCL.

Wallis, Sean (2012). Measures of association for contingency tables, corp.ling.stats, London: Survey of English Usage, UCL.

Wallis, Sean (2019). Confidence intervals on pairwise φ statistics, corp.ling.stats, London: Survey of English Usage, UCL.

15. Meta-tests for comparing tables of results
  • separability tests (Excel spreadsheet)
    • homogeneity (gradient, point, multi-point)
      2 x 2 χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c. and φ comparisons
      r x 2 χ², Yates's χ²
      r x c χ²
    • goodness of fit
      2 x 1 χ², Wilson, Wilson c.c.
      r x 1 χ²
    • homogeneity superset/subset (gradient, point, multi-point)
      2 x 2 χ², Yates's χ², Wilson, Wilson c.c.
    • goodness of fit superset/subset
      2 x 1 Wilson, Wilson c.c.

Part 5. Statistical solutions for corpus samples

16. Coping with imperfect data

Wallis, Sean and Seth Mehl (forthcoming). Comparing baselines for corpus analysis: Research into the get-passive in speech and writing. In: Ole Schützler and Julia Schlüter (eds.) Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge University Press.

17. Adjusting intervals for random-text samples

Part 6. Concluding remarks

18. Plotting the Wilson distributionr
  • Wilson distribution (Excel spreadsheet)
    • Wilson, Wilson c.c., Logit Wilson
    • Clopper-Pearson (up to n = 10)

19. In conclusion

Appendices

  Appendix A. The interval equality principle

  Appendix B. Pseudo-code for computational procedures

Glossary

References