# Statistics in Corpus Linguistics Research

## A New Approach

ISBN 9781138589384

### Citation

Wallis, Sean (2021). *Statistics in Corpus Linguistics Research
– A New Approach*, New York, London: Routledge. » Publisher's website

### Table of contents

*Statistics
in Corpus Linguistics Research – A New Approach*

### Contents

#### Preface

#### Part 1. Motivations

##### 1. What might corpora tell us about language?

Further reading

- Software: ICECUP, see also Fuzzy Tree Fragments
- Corpora: ICE-GB and DCPSE
- Aarts, Bas, Jo Close and Sean Wallis 2013. Choices over time:
methodological issues in current change. Chapter 2 in Bas Aarts,
Jo Close, Geoffrey Leech and Sean Wallis 2013 (eds.)
*The Verb Phrase in English*. Cambridge University Press.**»**ePublished - Wallis, Sean. 2019. Investigating the additive probability of repeated language
production decisions.
*International Journal of Corpus Linguistics***24**(4), 490-521.**» corp.ling.stats****»**ePublished**»**Data and spreadsheets

#### Part 2. Designing effective experiments with corpora

##### 2. The idea of corpus experiments

Data and analysis

- Simple analysis
of
*who*vs.*whom*data (Excel spreadsheet)

##### 3. That vexed problem of choice

Further reading

- Aarts, Bas, Jo Close and Sean Wallis (2013). Choices over time:
methodological issues in current change. Chapter 2 in Bas Aarts,
Jo Close, Geoffrey Leech and Sean Wallis (eds.)
*The Verb Phrase in English*. Cambridge University Press.**»**ePublished - Bowie, Jill, Sean Wallis and Bas Aarts (2013). The perfect in spoken
British English. Chapter 13 in Aarts, Close, Leech and Wallis
2013.
**»**ePublished - Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change
in modal usage in spoken British English: mapping the impact of
“genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús
Hita and Johan van der Auwera (eds.)
*English Modality*, Berlin: De Gruyter, 57-94. - Wallis, Sean (2020). Grammar and Corpus Methodology. Chapter
4 in Bas Aarts, Gergana Popova and Jill Bowie (eds.)
*The Oxford Handbook of English Grammar*. Oxford University Press.

##### 4. Choice versus meaning

Data and analysis

- Semasiological and onomasiological
analysis of
*very*data (Excel spreadsheet)

##### 5. Balanced samples and imagined populations

Further reading

- Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change
in modal usage in spoken British English: mapping the impact of
“genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús
Hita and Johan van der Auwera (eds.)
*English Modality*, Berlin: De Gruyter, 57-94.

#### Part 3. Confidence intervals and significance tests

##### 6. Introducing inferential statistics

Calculator

Further reading

- Stahl, S. (2006). The Evolution of the Normal Distribution.
*Mathematics Magazine*,**79**(2), 96-113.**»**ePublished

- Binomial demonstrator (Excel spreadsheet)

##### 7. Plotting with confidence

Data and analysis

- Magnus Levin's BE
*thinking*data (Excel spreadsheet)- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart
- Same sample
*z*test — see Chapter 9

##### 8. From intervals to tests

Calculator

Data and analysis

Further reading

- Brown, Lawrence, Tony Cai and Anirban DasGupta (2001).
Interval estimation for a binomial proportion.
*Statistical Science***16**, 101-133. - Newcombe, Robert (1998a). Two-sided confidence intervals
for the single proportion: comparison of seven methods.
*Statistics in Medicine***17**, 857-872. - Newcombe, Robert (1998b). Interval estimation for the
difference between independent proportions: comparison of
eleven methods.
*Statistics in Medicine***17**, 873-890. - Wallis, Sean (2013). Binomial confidence intervals and
contingency tests: mathematical fundamentals and the evaluation
of alternative methods.
*Journal of Quantitative Linguistics***20**(3), 178-208.**» corp.ling.stats****»**ePublished

- Mood x transitivity analysis (Excel spreadsheet)

- 2 x 2, 2 x 1 χ²
tests (Excel spreadsheet)
- 2 x 1 goodness of fit χ², Yates's χ², Wilson, Wilson c.c.
- 2 x 2 homogeneity (independence) tests including χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c.

- Wilson
finite population calculations (Excel spreadsheet)
- Single proportion example
- Newcombe-Wilson test adjustment following resampling — see Chapter 16

##### 9. Comparing frequencies within the same distribution

Calculator

Data and analysis

- Magnus Levin's BE
*thinking*data (Excel spreadsheet)- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart — see Chapter 7
- Same sample
*z*test

- Single-sample
*z*test (Excel spreadsheet)

##### 10. Reciprocating the Wilson interval

Data and analysis

- Analysis of sentence length data (Excel spreadsheet)

##### 11. Competition between choices over time

Further reading

- Bowie, Jill and Sean Wallis (2016). The
*to*-infinitival perfect: a study of decline. In Valentin Werner, Elena Seoane and Cristina Suárez-Gómez (eds.)*Re-assessing the Present Perfect*, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94. - Neumerzhitckii, Evgenii (2018). Three-body simulator (website).
- Wallis, Sean (2020). Boundaries in nature, corp.ling.stats, London: Survey of English Usage, UCL.

##### 12. The replication crisis and the New Statistics

Further reading

- Gelman, Andrew (2016). What has happened down here is the
winds have changed.
*Statistical Modelling, Causal Inference and Social Science*.**» blog post** - Gelman, Andrew and Eric Loken (2013). The garden of forking
paths. Columbia University.
**»**ePublished - Leech, Geoffrey (2011). The modals ARE declining: reply
to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’.
*International Journal of Corpus Linguistics*,**16**(4). 547-564. - Millar, Neil (2009). Modal verbs in TIME: frequency changes
1923–2006.
*International Journal of Corpus Linguistics*,**14**(2), 191-220.

##### 13. Choosing the right test

Further reading

- Gries, Stefan Th. (2015). The most underused statistical
method in corpus linguistics: Multi-level (and mixed-effects)
models.
*Corpora*,**10**(1), 95-125. - Oakes, Michael (1998).
*Statistics for Corpus Linguistics*. Edinburgh: EUP. - Ruxton, Graeme (2006). The unequal variance
*t*-test is an underused alternative to Student’s*t*-test and the Mann-Whitney*U*test.*Behavioral Ecology*,**17**, 688–690 - Sheskin, David (2011).
*Handbook of Parametric and Nonparametric Statistical Procedures*(5th ed.). Boca Raton, Fl: CRC Press.

#### Part 4. From effect sizes to meta-tests

##### 14. The size of an effect

Further reading

- Wallis, Sean (2012). Goodness of fit measures for discrete categorical data, corp.ling.stats, London: Survey of English Usage, UCL.
- Wallis, Sean (2012). Measures of association for contingency tables, corp.ling.stats, London: Survey of English Usage, UCL.
- Wallis, Sean (2019). Confidence intervals on pairwise φ statistics, corp.ling.stats, London: Survey of English Usage, UCL.

##### 15. Meta-tests for comparing tables of results

Calculator

- separability
tests (Excel spreadsheet)
- homogeneity (gradient, point, multi-point)

2 x 2 χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c. and φ comparisons

*r*x 2 χ², Yates's χ²

*r*x*c*χ² - goodness of fit

2 x 1 χ², Wilson, Wilson c.c.

*r*x 1 χ² - homogeneity superset/subset (gradient, point, multi-point)

2 x 2 χ², Yates's χ², Wilson, Wilson c.c. - goodness of fit superset/subset

2 x 1 Wilson, Wilson c.c.

- homogeneity (gradient, point, multi-point)

#### Part 5. Statistical solutions for corpus samples

##### 16. Coping with imperfect data

Calculator

Further reading

- Wallis, Sean and Seth Mehl (forthcoming). Comparing baselines
for corpus analysis: Research into the
*get*-passive in speech and writing. In: Ole Schützler and Julia Schlüter (eds.)*Data and Methods in Corpus Linguistics: Comparative Approaches*. Cambridge University Press.

- Wilson
finite population calculations (Excel spreadsheet)
- Single proportion example — see Chapter 8
- Newcombe-Wilson test adjustment following resampling

##### 17. Adjusting intervals for random-text samples

Data and analysis

- Large sample
analysis of
*p*(inter) and*p*(CL | word) data (Excel spreadsheet) - Analysis
of
*p*(inter) data for each ICE-GB text category (Excel spreadsheet) - Partitoned
and pooled analysis of
*shall / will*data in DCPSE (Excel spreadsheet)

#### Part 6. Concluding remarks

##### 18. Plotting the Wilson distribution

Calculator

- Wilson distribution (Excel
spreadsheet)
- Wilson, Wilson c.c., Logit Wilson
- Clopper-Pearson (up to
*n*= 10)

##### 19. In conclusion

#### Appendices

##### Appendix A. The interval equality principle

##### Appendix B. Pseudo-code for computational procedures

#### Glossary

#### References

### Blog

### corp.ling.stats

### Publisher’s website

- Statistics in Corpus Linguistics Research (Routledge)

