## Statistical Science Seminars

**Usual time**: Thursdays 16:00 - 17:00

**Location**: Room 102, Department of Statistical Science, 1-19 Torrington Place (1st floor).

Some seminars are held in different locations at different times. Click on the abstract for more details.

## 30 September, 2-3pm: Stanislav Volkov (Lund University)

### On random geometric subdivisions

I will present several models of random geometric subdivisions, similar to that of Diaconis and Miclo (Combinatorics, Probability and Computing, 2011), where a triangle is split into 6 smaller triangles by its medians, and one of these parts is randomly selected as a new triangle, and the process continues ad infinitum. I will show that in a similar model the limiting shape of an indefinite subdivision of a quadrilateral is a parallelogram. I will also show that the geometric subdivisions of a triangle by angle bisectors converge (but only weakly) to a non-atomic distribution, and that the geometric subdivisions of a triangle by choosing a uniform random points on its sides converges to a flat triangle, similarly to the result of the paper mentioned above.

## 06 October: Steven Muirhead (University of Oxford)

### Discretisation schemes for level sets of planar Gaussian fields

Gaussian random fields are prevalent throughout mathematics and the sciences, for instance in physics (wave-functions of high energy electrons), astronomy (cosmic microwave background radiation) and probability theory (connections to SLE, random tilings etc). Despite this, the geometry of such fields, for instance the connectivity properties of level sets, is poorly understood. In this talk I will discuss methods of extracting geometric information about levels sets of a planar Gaussian random field through discrete observations of the field. In particular, I will present recent work that studies three such discretisation schemes, each tailored to extract geometric information about the levels set to a different level of precision, along with some applications.

## 13 October: Catalina Vallejos (The Alan Turing Institute)

### BASiCS: Bayesian Analysis of Single Cell Sequencing data

Recently, single-cell mRNA sequencing (scRNA-seq) has emerged as a novel tool for quantifying gene expression profiles of individuals cells. These assays can provide novel insights into a tissue's function and regulation. However, besides experimental issues, statistical analysis of scRNA-seq data is itself a challenge. In particular, a prominent feature of scRNA-seq experiments is strong measurement error. This is reflected by (i) technical dropouts, where a gene is expressed in a cell but its expression is not captured through sequencing and (ii) poor correlation between expression measurements of technical replicates. Critically, these effects must be taken into account in order to reveal biological findings that are not confounded by technical variation.

In this talk I introduce BASiCS (Bayesian Analysis of
Single-Cell Sequencing data) [1,2], an integrative approach to jointly infer
biological and technical effects in scRNA-seq datasets. It builds upon a
Bayesian hierarchical modelling framework, based on a Poisson formulation.
BASiCS uses a vertical integration approach, exploiting a set of
"gold-standard" genes in order to quantify technical artifacts.
Additionally, it provides a probabilistic decision rule to identify (i) key
drivers of heterogeneity within a population of cells and (ii) changes in gene
expression patterns between multiple populations (e.g. experimental conditions
or cell types). More recently, we extended BASiCS to experimental designs where
gold-standard genes are not available using a horizontal integration framework,
where technical variation is quantified through the borrowing of information
from observations across multiple groups of samples (e.g. sequencing batches
that are not confounded with the biological effect of interest). Control
experiments validate our method's performance and a case study suggests that
novel biological insights can be revealed.

Our method is implemented in R and
available at https://github.com/catavallejos/BASiCS.

*[1] Vallejos, Marioni and Richardson (2015) PLoS
Computational Biology
*

*[2] Vallejos, Richardson and Marioni (2016) Genome
Biology*

## 20 October: Bob Sturm (Queen Mary University of London)

### Clever Hans, Clever Algorithms: Are your machine learnings learning what you think?

In
machine learning, generalisation is the aim, and overfitting is the bane; but
just because one avoids the latter does not guarantee the former. Of particular
importance in some applications of machine learning is the “sanity" of the
models learnt. In this talk, I discuss one discipline in which model sanity is
essential -- machine music listening — and how several hundreds of research
publications may have unknowingly built, tuned, tested, compared and advertised
“horses” instead of solutions. The true cautionary tale of the horse-genius
Clever Hans provides the most appropriate illustration, but also ways forward.

*B. L. Sturm, “A simple method to determine if a music information retrieval
system is a “horse”,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1636-1644,
2014.*

## 24 October, 5-6pm: Mircea Petrache (Max-Planck-Institut Leipzig)

### Asymptotic rigidity at zero temperature for large particle systems with power-law interactions

We consider the model of a gas of N particles in d-dimensional Euclidean space, which have inverse-power-law interactions with exponent s<d. The most celebrated case is the case of Coulomb potentials for which s=d-2 (where for d=2, s=0 we consider logarithmic interactions instead). Our particles are confined to a compact set and in the limit of N going to infinity, we study the asymptotic behavior of the energy.

The leading order in N is quadratic in N and described by a mean-field energy on probability measures. I will describe a strategy for controlling the next-order term, which grows like the (1+s/d)-th power of N. This lower-order term is expressed in terms of an energy W on "micro-scale configurations" of the particles.

As the temperature of the gas tends to zero, the gas "crystallizes"

on minimizers of W, with a conjectural drop of
complexity. I will present the study of asymptotics of minimizers in which by
using the energy W we produce a first quantification of this rigidity
phenomenon, by proving hyperuniformity and quantitative equidistribution of the
configurations.

Possible extensions including related numerical conjectures in several directions will be presented.

*The talk is based on joint papers with S. Rota-Nodari and S.
Serfaty.*

## 03 November: Aidan O'Keeffe (University College London)

### Correlated reversible multi-state models using random effects: Application to renal function modelling in systemic lupus erythematosus

In many scenarios, observational data are collected on multiple continuous-time processes of interest for a number of subjects seen at discrete time points. While each process may be modelled separately using a multi-state model, the processes for each subject are likely to be correlated. We explore the use of random effects to account for such correlation, an approach for which there is a limited literature. Specifically, we develop a modelling framework that allows the incorporation of random effects when fitting a series of general multi-state models, including reversible models and we describe the use of standard statistical software to fit such models. An example concerning the modelling of renal (kidney) function using two processes in a cohort of systemic lupus erythematosus patients is used to demonstrate the methodology.

## 10 November: Chaitanya Joshi (University of Waikato, Hamilton)

### Improving grid based Bayesian Methods

In some cases, computational benefit can be gained by exploring the hyper-parameter space using a deterministic set of grid points instead of a Markov chain. We view this as a numerical integration problem and make three unique contributions. First, we explore the space using low discrepancy point sets instead of a grid. This allows for accurate estimation of marginals of any shape at a much lower computational cost than a grid based approach and thus makes it possible to extend the computational bene fit to a hyper parameter space with higher dimensionality (10 or more). Second, we propose a new, quick and easy method to estimate the marginal using a least squares polynomial and prove the conditions under which this polynomial will converge to the true marginal. Our results are valid for a wide range of point sets including grids, random points and low discrepancy points. Third, we show that further accuracy and efficiency can be gained by taking into consideration the functional decomposition of the integrand and illustrate how this can be done using anchored f-ANOVA on weighted spaces.

## 24 November: Richard Chandler (University College London)

### Natural hazards, risk and uncertainty – some hot topics

Society is, and has always been, vulnerable to damage resulting from natural disasters: earthquakes, floods, wind storms, tsunami and so on. In recent decades, the global human and economic cost of such disasters has increased rapidly: reasons for this include an increasing global population, infrastructure construction in regions that were previously undeveloped, and the effects of climate and other environmental change. The 2015 United Nations Global Assessment Report on Disaster Risk Reduction estimates the future cost of natural disasters at 314 billion US dollars per year, in terms of impacts on the built environment alone. There is considerable incentive, therefore, to develop a better understanding of natural hazards and the associated risks, in order to inform strategies for improving societal resilience.

For all natural hazards, our understanding of the relevant processes comes from a combination of data and models. But data are often sparse, incomplete and prone to errors or inhomogeneities; and, as a well-known statistician once remarked, all models are wrong. Our understanding of natural hazards and their consequences is inevitably incomplete, therefore. Hazard scientists, planners and policymakers must acknowledge the underlying uncertainties, and account for them appropriately. However, scientists often lack training in the kinds of statistical techniques that are required to characterise and communicate uncertainty appropriately in problems that are often highly complex; and planners and policymakers often lack training in how to make rational and robust decisions in the presence of uncertainty. At the same time, the last 20 years have seen enormous progress by the statistical and other communities in tackling the relevant issues. This creates tremendously exciting opportunities for collaboration, with potential to reshape the way in which risk from natural disasters is handled: in the UK this has been recognised by the Natural Environment Research Council in their funding of a four-year research programme entitled Probability, Uncertainty and Risk in the Environment (PURE). In this talk I will present examples to illustrate some of these opportunities, partly drawing on my own experience from the PURE programme, with application areas including the construction of seismic hazard maps, earthquake engineering and the assessment of climate change impacts.

## 01 December: Geert Dhaene (Katholieke Universiteit Leuven)

### Profile score adjustments for incidental parameter problems

*Joint work with Koen Jochmans, Sciences Po*

We propose a scheme of iterative adjustments to the profile score to deal with incidental-parameter bias in models for stratified data with few observations on a large number of strata. The first-order adjustment is based on a calculation of the profile-score bias and evaluation of this bias at maximum-likelihood estimates of the incidental parameters. If the bias does not depend on the incidental parameters, the first-order adjusted profile score is fully recentered, solving the incidental-parameter problem. Otherwise, it is approximately recentered, alleviating the incidental-parameter problem. In the latter case, the adjustment can be iterated to give higher-order adjustments, possibly until convergence. The adjustments are generally applicable (e.g., not requiring parameter orthogonality) and lead to estimates that generally improve on maximum likelihood. We examine a range of nonlinear models with covariates. In many of them, we obtain an adjusted profile score that is exactly unbiased. In the others, we obtain approximate bias adjustments that yield much improved estimates, relative to maximum likelihood, even when there are only two observations per stratum.

## 07 December, 5-6pm: Ioannis Kasparis (University of Cyprus)

### Regressions with fractional d=0.5 and weakly non stationary processes

*Joint with James Duffy,
Oxford University*

Despite major advances in the statistical analysis of fractionally integrated time series models, no limit theory is available for sample averages of fractionally integrated processes with memory parameter d=0.5. We provide limit theory for sample averages of a general class of nonlinear transformations of fractional d=0.5 (I(0.5)) and Mildly Integrated (MI) processes e.g. Phillips and Magdalinos (2007). Although I(0.5) processes lie in the nonstationary region, the asymptotic machinery that is routinely used for I(d), d>1/2 processes is not valid in the I(0.5) case. In particular, the usual tightness conditions required for establishing FCLTs fail in the case of I(0.5) processes and a different approach is required. A general method that applies to both I(0.5) and MI processes is proposed. We show that sample averages of transformations of I(0.5) and MI processes converge in distribution to the convolution of the transformation function and some Gaussian density evaluated at a possibly random point. The class of nonlinear transformations under consideration accommodates a wide range of regression models used in empirical work including step type discontinuous functions, functions with integrable poles as well as integrable kernel functions that involve bandwidth terms. Our basic limit theory is utilised for the asymptotic analysis of the LS and the Nadaraya-Watson kernel regression estimator. We show that the NW estimator has either normal or mixed normal limit distribution, and attains slower convergence rates than those known for stationary processes. On the other hand the LS estimator attains faster convergence rates than those attained under stationarity, and limit distributions are either normal or nonstandard.

## 08 December: Alexina Mason (London School of Hygiene and Tropical Medicine)

### Bayesian models for addressing informative missingness in the analysis of longitudinal clinical trial data

Missing data can be a serious problem in longitudinal studies because of the increased chance of drop-out and other non-response across the multiple time points, and can be particularly challenging when there are different causes of the missing values. For instance, the reasons that patients completely drop-out of the study (monotone missingness) may be very different from those for failing to attend a particular follow-up appointment (intermittent missingness). Also, for some types of missingness, it is often plausible to assume that data may be ‘missing not at random’ (MNAR), i.e. after conditioning on the observed data, the probability of missing data may depend on the underlying unobserved values. For example, in critical care trials the collection of hourly/daily biomedical data may take place at the local physician’s discretion and lead to intermittent missingness that is related to the severity of the patient’s illness. Faced with MNAR data, missing data guidelines recommend sensitivity analysis to allow for alternative assumptions about the missing data. A useful approach is to use selection models, which specify a marginal distribution for the outcomes and a conditional distribution for the missing value indicators given the outcomes. Selection models are particularly attractive in longitudinal studies, because they can recognise that the missing data mechanism may be distinct across the different types of missingness.

This research proposes flexible Bayesian selection models for assessing the robustness of trial results to alternative realistic assumptions about the different forms of missingness. We illustrate the methods using cardiac index data, collected at baseline and 9 subsequent time-points, from the Vasopressin and Septic Shock Trial (VASST). Monitoring started after baseline for a third of the patients and was discontinued as a result of both death and recovery. We compare the results from alternative assumptions about the longitudinal missing data mechanisms with the published trial results and assess the implications for decision uncertainty. We conclude that this approach to sensitivity analysis provides a flexible framework to assess the implications of the missing data for the trial conclusions.

## 15 December: Omiros Papaspiliopoulos (Universitat Pompeu Fabra, Barcelona)

TBA