Statistical Science Seminars
Usual time: Thursdays 16:00 - 17:00
Location: Room 102, Department of Statistical Science, 1-19 Torrington Place (1st floor).
Some seminars are held in different locations at different times. Click on the abstract for more details.
Mixing time of the exclusion process on hypergraphs
We introduce a process defined on hypergraphs and study its mixing time. This process can be viewed as an extension of the exclusion process to hypergraphs. Using a tool introduced by Morris and developed by Oliveira called the chameleon process, for any hypergraph within a certain class, we prove an upper-bound on the mixing time of the exclusion process in terms of the mixing time of a simple random walk on the same hypergraph. This is joint work with Stephen Connor.
A Hierarchical Bayesian Model for Inference of Copy Number Variants and Their Association with Gene Expression
Cancer is the result of a dynamic interplay at different molecular levels (DNA, mRNA and protein). Elucidating the association between two or more of these levels would enable the identification of biological relationships and lead to improvements in cancer diagnosis and treatment. For this purpose the development of statistical methodologies able to identify these relationships is crucial. In this talk, I present a model for the integration of high-throughput data from different sources. In particular, I focus on combining transcriptomics data (gene expression profiling) with genomic data, collected on the same subjects. At DNA level I focus on measuring copy number variation (CNV) using comparative genomic hybridization (CGH) arrays. I specify a measurement error model that relates the gene expression levels to latent copy number states. Selection of relevant associations is performed employing selection priors that explicitly incorporate dependencies information across adjacent copy number states. Copy number states are related to the observed surrogate CGH measurements via a hidden Markov model, which captures their peculiar state persistence. Posterior inference is carried out through Markov chain Monte Carlo techniques. In order to tackle the computational issue, I develop an algorithm that efficiently explores the space of all possible associations. The contribution of the methodology is twofold: infer copy number variation and, simultaneously, their association with gene expression. The performance of the method is shown on simulated data and I also illustrate an application to data from a prostate cancer study.
Estimating counterfactual means of static and dynamic interventions in critical care
A growing body of work in causal inference focuses on estimating the effects of longitudinal interventions, using observational data. Here, standard regression approaches cannot adjust for time-varying confounders, because those confounders can themselves be affected by the treatment. Inverse probability of treatment weighting and parametric g-computations are specialized methods that can consistently estimate the causal effects of longitudinal interventions, if their underlying models are correctly specified.
An alternative approach is targeted maximum likelihood estimation (TMLE), which combines the estimates of the treatment mechanism and the outcome, and is double-robust, i.e. consistent if at least one of the two components is correctly specified. In order to minimize residual bias due to mis-specification of these components, TMLE is often coupled with data-adaptive estimation. In spite of the flexibility of the TMLE framework for longitudinal settings, its uptake in applied work has been limited.
This presentation aims to demonstrate the feasibility of this approach, in an evaluation of a critical care intervention, nutritional support for children admitted to the intensive care unit. In the context of this study, I will define the intervention-specific mean parameter, distinguish between static and dynamic treatment regimes, state identifying assumptions, and provide a step-by-step guidance of estimation using parametric and data-adaptive methods.
Robustness and Efficiency of Covariate Adjusted Linear
Instrumental Variable Methods
Instrumental variables provide an approach for consistent inference on causal effects even in the presence of unmeasured confounding. Such methods have for instance been used in the context of Mendelian randomisation, as well as in pharmaco-epidemiological contexts. In these and other applications, it is common that covariates are available, even if deemed insufficient to adjust for all confounding. As IVs allow inference when there is unobserved confounding, it appears that often the analyst assumes that even observed confounders / covariates do not need to or should not be taken into account. However, this is not generally the case. With view to the role of covariates, we here contrast two-stage-least-squares estimators, generalized methods of moment estimators and variants thereof with methods more common in biostatistics using G-estimation in so-called structural mean models.
When using covariates, there are structural aspects to be considered, e.g. whether the covariates are prior to or potentially affected by the instruments. But in addition, one has to worry even more about efficiency versus model misspecification when modelling covariates. We discuss this for the IV procedures mentioned above, especially for linear instrumental variable models. Our results motivate adaptive procedures that guarantee efficiency improvements through covariate adjustment, without the need for covariate selection strategies. Besides theoretical findings, simulation results will be shown to provide numerical insight.
(This is joint work with Stijn Vansteelandt)
Quantile Cross-Spectral Measures of Dependence
I introduce quantile cross-spectral analysis of multiple time series which is designed to detect general dependence structures emerging in quantiles of the joint distribution in the frequency domain. This type of dependence is natural for, for example, economic time series but remains invisible when the traditional analysis is employed. New estimators which capture the general dependence structure are defined, a detailed analysis of their asymptotic properties and a discussion on how to conduct inference is provided for a general class of possibly nonlinear processes. In an empirical illustration one of the most prominent time series in economics is examined and new light is shed on the dependence of bivariate stock market returns.
(joint work with Jozef Barunik, Charles University, Prague)
Mixture Models for Detecting Subgroups and Differential Item Functioning
Mixture models are a flexible tool to uncover latent groups for which separate models hold. The Rasch model can be used to measure latent traits by modelling the probability of a subject solving an item through the subject's ability and the item's difficulty. A crucial assumption of the Rasch model is measurement invariance: each item measures the latent trait in the same way for all subjects. Measurement invariance is violated if, e.g., an item is of different difficulty for different (groups of) subjects. Mixtures of Rasch models can be used to check if one Rasch model with a single set of item difficulties holds for all subjects and thus measurement invariance is not violated. However, estimation of the item difficulties in a Rasch mixture model is not independent of the specification of the score distribution, which is based on the abilities. The latent groups detected with such a Rasch mixture model are not solely based on the item difficulties but also -- or even only -- the scores and thus subject abilities. If the aim is to detect violations of measurement invariance, only latent groups based on item difficulties are of interest because different ability groups do not infringe on measurement invariance.
This talk covers a new specification of the Rasch mixture model which ensures that latent classes uncovered are based solely on the item difficulties and thus increases the model's suitability as a tool to detect violations of measurement invariance. Open-source software for estimating various flavors of the Rasch mixture model including the newly suggested specification is provided in form of the R package 'psychomix'.
12 November: SPECIAL EVENT - LMS 150th Anniversary. Invited Speaker: Alison Etheridge (University of Oxford)
Joint with the Women in Mathematical Sciences event series, lunch is scheduled for 12noon - 2pm in room 706, mathematical science building (25 Gordon Street).
Joint with the LMS 150th Anniversary event series, drinks will be served at 6pm in room 102, statistical science building (1-19 Torrington Place).
12 November: Alison Etheridge (University of Oxford)
The pain in the torus: modelling evolution in a spatial continuum
Since the pioneering work of Fisher, Haldane and Wright at the beginning of the 20th Century, mathematics has played a central role in theoretical population genetics. One of the outstanding successes is Kingman's coalescent. This process provides a simple and elegant description of the way in which individuals in a population are related to one another. It is based on the simplest possible model of inheritance and is parametrised in terms of a single number, the population size. However, in using the Kingman coalescent as a model of real populations, one does not substitute the actual census population size, but rather an `effective' population size which somehow captures the evolutionary forces that are ommitted from the model. It is astonishing that this works; the effective population size is typically orders of magnitude different from the census population size. In order to understand the apparent universality of the Kingman coalescent, we need models that incorporate things like variable population size, natural selection and spatial and genetic structure. Some of these are well established, but, until recently, a satisfactory approach to populations evolving in a spatial continuum has proved surprisingly elusive. In this talk we describe a framework for modelling spatially distributed populations that was introduced in joint work with Nick Barton (IST Austria). As time permits we'll not only describe the application to genetics, but also some of the intriguing mathematical properties of some of the resulting models.
The two-periodic Aztec diamond
Random domino tilings of the Aztec diamond shape exhibit interesting features and statistical properties related to random matrix theory
As a statistical mechanical model it can be thought of as a dimer model or as a certain random surface. We consider the Aztec diamond with a two-periodic weighting which exhibits all three possible phases that occur in these types of models, often referred to as solid, liquid and gas. This model is considerably harder to analyze than the uniform model which only has two phases, liquid and solid. This talk presents an overview of the recent results on the two-periodic Aztec diamond including a partial description of the asymptotic behaviour at the liquid-gas boundary. These results are based on projects with Benjamin Young (Oregon), Kurt Johansson (KTH) and Vincent Beffara (Grenoble).
Market Impacts of Energy Storage in a Transmission-Constrained Power System
(joint work with Vilma Virasjoki, Paula Rocha, and Ahti Salo)
Environmental concerns have motivated governments in the European Union and elsewhere to set ambitious targets for generation from renewable energy (RE) technologies and to offer subsidies for their adoption along with priority grid access. However, because RE technologies like solar and wind power are intermittent, their penetration places greater strain on existing conventional power plants that need to ramp up more often. In turn, energy storage technologies, e.g., pumped hydro storage or compressed air storage, are proposed to offset the intermittency of RE technologies and to facilitate their integration into the grid. We assess the economic and environmental consequences of storage via a complementarity model of a stylised Western European power system with market power, representation of the transmission grid, and uncertainty in RE output. Although storage helps to reduce congestion and ramping costs, it may actually increase greenhouse gas emissions from conventional power plants in a perfectly competitive setting. Conversely, strategic use of storage by producers renders it less effective at curbing both congestion and ramping costs, while having no net overall impact on emissions.
The Coulomb gas in two dimensions
This will be an introduction for a general audience to the
statistical mechanics of an infinite assembly of positively and
negatively charged particles on a lattice $Z^2$.
There are some surprising theorems: for example, at large values of the parameter that represents temperature, empirical charge densities in disjoint regions are almost independent, even though the Coulomb interaction is very long range. This is called Debye screening. At small values of the temperature there is the famous Kosterlitz-Thouless phase where all charges are bound into neutral configurations; the system has the same long distance correlations as a random field known as the massless free field.
Recent progress by Pierluigi Falco has made it possible to understand the transition into this phase in detail, but this subject will never cease to generate interesting new questions, and I will point out some of those along the way.
Multichannel Sampling of Finite Rate of Innovation Signals
Recently, it has been shown that it is possible to sample and perfectly reconstruct some classes of non-bandlimited signals. In these schemes, the prior that the signal is sparse in a basis or in a parametric space is taken into account and perfect reconstruction is achieved based on a set of suitable measurements. Depending on the setup used and reconstruction method involved, these sampling methods go under different names such as compressed sensing (CS), compressive sampling, or sampling signals with ﬁnite rate of innovation (FRI). Sparse sampling theories are considered to be in the category of efficient sampling techniques as they allow sub-Nyquist sampling rates while achieving perfect retrieval of the observed signal.
In the first part of the talk I will present a possible extension of the theory of sampling signals with finite rate of innovation to the case of multichannel acquisition systems, considering 1D and 2D. In the second part of the talk, I will deviate slightly from the title of the talk and present some recent work on multidimensional estimation and statistical de-noising of seismic signals, considering 2D and 5D.
Recognition of Head Gestures in Spontaneous Human Conversations
Smooth and effective communication requires proper understanding of other party’s attitude and emotion, for which head gestures are basic and important signs. By recognizing head gestures in spontaneous human conversations, we may be able to help blind people, robots and even some normal people to improve their communication skills.
Though head gestures have been concerned and studied a lot by researchers in linguistics and psychology, they have to be manually labeled for analysis which limits the scale of research. Existing efforts on automatic recognition are restricted to only 1-2 types of acted gestures to non-humans (e.g. a virtual agent) and the known performance is still far from being applicable to real applications.
We look into this problem by focusing on spontaneous human conversations and try to make it work with wearable cameras which capture the data most similar to what is actually sensed by conversation participants. State-of-the-art pattern recognition models have been tested, which not only enables a deeper understanding of the task itself, but also reveals some interesting hard problems for further studies. I will present our latest progress on this study and propose discussions on some general problems which might be interesting to the statistics community.
Multiple imputation in Cox regression when there
are time-varying effects of exposures
Cox regression is the most widely used method to study associations between exposures and times-to-event. It is often of interest to study whether there are time-varying effects of exposures; these can be investigated using the extended Cox model in which the log hazard ratio is modeled as a function of time. This is also a popular way of testing the proportional hazards assumption.
Missing data on explanatory variables are common and multiple imputation (MI) is a popular approach to handling it. The imputation model should accommodate the form of the analysis model and White and Royston (Stat Med 2009) derived an approximate imputation model suitable for missing exposures in Cox regression. Another approach to imputing missing data under the Cox model was described by Bartlett et al. (Stat Meth Med Res 2014), which uses rejection sampling to draw imputed values from the correct distribution. However no MI methods have been devised which handle time-varying effects of exposures.
In this talk I will show how the imputation model of White and Royston can be extended to accommodate time-varying effects of exposures and also describe a simple extension to the method of Bartlett et al. Using simulations, we have shown that the proposed methods perform well, giving improvements relative to the complete case analysis. The methods also give approximately correct type I errors in the test for proportional hazards. Failure to account for time-varying effects results in the imputation results in biased estimates and incorrect tests for proportional hazards.
I will also discuss some further work in which the time-varying effect is modelled using fractional polynomials rather than a pre-specified functions. The methods will be illustrated using data from the Rotterdam Breast Cancer study.