The DCPSE Research Project

Creating a parsed and searchable diachronic corpus of present-day spoken English

funded by

Download data

The complete DCPSE corpus is available on CD-ROM from the Survey of English Usage. DCPSE contains one million words of English sampled across the decades.

You can download raw lexicon data from the links on the right hand side.

Each zip archive contains a text file in the form of a "tabbed table". In ICECUP 3.1, a snapshot of the lexicon looks like the following.

Lexicon example

When the lexicon is saved to a text file, one option (Table) outputs a text file containing frequencies separated by tabs, like this:


You can import files like this into Excel or any other spreadsheet. If you import the file into a wordprocessor you may wish to convert this section to a table and reformat as appropriate.

Normal Ignored Both
*+<,AUX> 54,680 2,370 57,050
 'd 1,402 113 1,515
  <AUX(do,past,encl)> 1 2 3
  <AUX(modal,past,encl)> 807 84 891
  <AUX(modal,past,encl,ditto)> 1 0 1
  <AUX(modal,past)> 1 0 1
  <AUX(semi,pres,encl)> 0 2 2
  <AUX(semi,past,encl)> 42 0 42
  <AUX(perf,past,encl)> 546 24 570
  <AUX(perf,past)> 1 0 1
  <AUX> 3 1 4
 'll 1,424 113 1,537
... etc      

Note that initial spaces are used to illuminate the structure.

Frequencies cited represent the total number of cases for the lexeme across the entire corpus: some 927,545 words (965,673 including ignored material). "Ignored" refers to words in the corpus where the speaker corrected themselves or where the words were corrected by the editor (rare in the case of DCPSE).

Frequency data includes counts for compounded items. These are listed with the pseudo-frequency 'ditto' so they may be removed or collapsed.

For more information...

For a thorough description and examples of all these part of speech terms in context, we recommend that you download the sample corpus package. This includes the full ICECUP help file including the ICE grammar reference, and of course, sample corpus and software.

A summary of the ICE grammar is published here.

This page last modified 14 May, 2020 by Survey Web Administrator.