Farr seminar: Probabilistic anonymisation and data analysis for dissemination of large sensitive datasets

27 February 2017

Speaker: Prof Harvey Goldstein, CL Institute of Child Health (Visiting Professor), University of Briatol (Professor of Social Statistics)

Venue: Room G01, Farr Institute of Health Informatics Research, 222 Euston Road, London, NW1 2DA

Date: Tuesday, 28 February 2017, h. 13:00-14:00

The proposal is to use the addition of random noise to some or all variables in a released pseudonymised data set where the values of identifying variables for individuals of interest are also available to an external ‘attacker’ who wishes to identify those individuals so that they can interrogate their records in the dataset. To avoid such identification by an ‘attacker’ who wishes to use the linking of patterns based on the values of such variables, enough noise needs to be generated and added to these identifying variables, and the necessary amount will be discussed. The noise so generated can then be ‘removed’ at the analysis stage since its characteristics are known, and of course this requires the disclosure of the noise distribution parameters by the linking agency. For the data analyst this is formally a measurement error model and there are efficient Bayesian procedures for fitting and recovering consistent estimates of the true model parameters. The downside is a loss of efficiency - traded for a lack of any form of data degradation such as coarsening or grouping.

Light refreshments and a sandwich lunch will be served from 12:30 noon.