LIDS methodology
LIDS is built entirely from ONS LS metadata available in the public CeLSIUS data dictionary. This design ensures that all artificial data extracts contain no information derived from the personal data
Extracting metadata from the CeLSIUS data dictionary
While the data dictionary provides structured code lists (explicit value-label pairs) for many variables, others document this information in unstructured formats or not at all. Some variables present coding information via external links to census appendices or classification tables; others provide only free-text code notes with descriptive information. As the ONS LS incorporates data from both census and administrative sources spanning over fifty years, these variations reflect differences in metadata documentation practices across sources and time periods.
To maximise the coverage of variables available in LIDS, CeLSIUS developed methods to extract structured code lists from these diverse sources.
Graph-based identification of comparable variables
A comparability network was constructed to identify equivalences between variables across the entire data dictionary. Variables were linked based on multiple dimensions of comparability documented in the metadata, including:
- Explicit comparable variable references (the ‘comvartime’ and ‘comvarind’ fields)
- Shared classification schemes
- Shared ‘coded as’ references indicating identical coding
- Similar variable descriptions
These pairwise relationships were used to build a network graph, with connected components representing groups of variables that share equivalent or related coding schemes. This approach allowed variables lacking structured code lists to inherit that information from equivalent variables within their component that do have complete metadata.
Each variable was then assigned an extraction method based on the best available source within its comparability group: direct use of an existing structured code list, extraction from a linked census appendix table, or extraction from free-text code notes.
Web scraping of reference tables
For variables whose coding information resides in external census or events appendices, these tables were programmatically retrieved. The scraped HTML tables were then parsed and reformatted into consistent value-label structures.
Text mining of code notes
For variables where coding information exists only in free-text code notes, natural language processing techniques and regular expressions were applied to extract structured value-label pairs.
Complete code lists could not be obtained for all variables. These variables remain available for selection in LIDS to preserve the complete table structures but appear with placeholder values indicating their status.
Generating impossible datasets
Using the variable code lists extracted from the CeLSIUS data dictionary, LIDS uses a straightforward process to generate impossible data extracts:
- Random value assignment: For each variable selected by a user, LIDS draws values independently and uniformly at random from that variable’s code list. Each observation receives a randomly sampled code with no consideration of values assigned to other variables in the same row or logical constraints that would apply in real data. This independence is what produces the ‘impossible’ combinations that give the tool its name.
- Record linkage identifiers: The ONS LS links individuals across census years and life events using a unique identifier (CORENO). When users select variables from any table, LIDS automatically includes CORENO and generates a set of unique identifier values shared across all selected tables. This allows users to practice linking records across tables in their code, even though the linked records contain randomly assigned values with no substantive relationship to one another.
- Observation sampling: Users specify the number of observations to generate, up to a maximum of 550,000. If the option to include all unique values is selected, LIDS first determines the minimum number of observations required to guarantee that every code for every selected variable appears at least once (equal to the largest code list among selected variables), and adjusts the observation count upward if necessary. Values are then sampled with replacement to fill the requested number of rows.
- Table structure: The generated data preserve the table structure of the ONS LS. Variables are grouped into their original tables, and the download function produces separate files for each table. This mirrors the format of real ONS LS data extracts, where census data, vital events, and other sources are provided as distinct but linkable datasets.
- Data types: LIDS assigns data types to selected variables based on the ‘format’ field of each respective variable on the data dictionary. For variables where the format field is empty, the data type defaults to characters.