Statistical Science


Statistical Science Seminars

A seminar series covering a broad range of applied and methodological topics in Statistical Science.

Talks take place in hybrid format.  

Usual time: Thursdays 14:00-15:00 (to be followed by Departmental Tea in the common room)

Location: Room 102, Department of Statistical Science, 1-19 Torrington Place, and Zoom (please use the contact info below to join the mailing list, and receive the links to the talks).

Contact info: thomas dot bartlett dot 10 at ucl dot ac dot uk

Recent talks

Please subscribe to our Youtube channel, to view some recent talks from the series

Programme for 2021/22

30 September 2021: Martin Huber (Université de Fribourg) - Double machine learning for sample selection models

We consider the evaluation of discretely distributed treatments when outcomes are only observed for a subpopulation due to sample selection or outcome attrition. For identification, we combine a selection-on-observables assumption for treatment assignment with either selection-on-observables or instrumental variable assumptions concerning the outcome attrition/sample selection process. We also consider dynamic confounding, meaning that covariates that jointly affect sample selection and the outcome may (at least partly) be influenced by the treatment. To control in a data-driven way for a potentially high dimensional set of pre- and/or post-treatment covariates, we adapt the double machine learning framework for treatment evaluation to sample selection problems. We make use of (a) Neyman-orthogonal, doubly robust, and efficient score functions, which imply the robustness of treatment effect estimation to moderate regularization biases in the machine learning- based estimation of the outcome, treatment, or sample selection models and (b) sample splitting (or cross-fitting) to prevent overfitting bias. We demonstrate that the proposed estimators are asymptotically normal and root-n consistent under specific regularity conditions concerning the machine learners and investigate their finite sample properties in a simulation study. We also apply our proposed methodology to the Job Corps data for evaluating the effect of training on hourly wages which are only observed conditional on employment. The estimator is available in the causalweight package for the statistical software R. 

7 October 2021: Francesco Ravazzolo (Libera Università di Bolzano) - Dynamic Combination and Calibration for Climate Predictions

We propose a density calibration and combination model that dynamically calibrate and combine predictive distributions. The time-varying calibration and combination weights are fitted by an observation driven model with dynamics inferred by the score of the assumed conditional likelihood of the data generating process. The model is very flexible and can handle different shapes, instability and model uncertainty. We show this analytically and in simulation exercises. An empirical application to short-term wind speed predictions documents the large instability of individual model performance and their calibration properties, favouring our model in terms of predictive accuracy.  

14 October 2021: Almut Veraart (Imperial College London) - High-frequency estimation of the Levy-driven Graph Ornstein-Uhlenbeck process with applications to wind capacity factor measurements

We consider the Graph Ornstein-Uhlenbeck (GrOU) process observed on a non-uniform discrete time grid and introduce discretised maximum likelihood estimators with parameters specific to the whole graph or specific to each component, or node. Under a high frequency sampling scheme, we study the asymptotic behaviour of those estimators as the mesh size of the observation grid goes to zero. We prove two stable central limit theorems to the same distribution as in the continuously observed case under both finite and infinite jump activity for the Levy driving noise. When a graph structure is not explicitly available, the stable convergence allows to consider purpose-specific sparse inference procedures, i.e. pruning, on the edges themselves in parallel to the GrOU inference and preserve its asymptotic properties. We apply the new estimators to wind capacity factor measurements, i.e. the ratio between the wind power produced locally compared to its rated peak power, across fifty locations in Northern Spain and Portugal. We show the superiority of those estimators compared to the standard least squares estimator through a simulation study extending known univariate results across graph configurations, noise types and amplitudes.
This is joint work with Valentin Courgeau (Imperial College London)

21 October 2021: Chris Oates (University of Newcastle) - Robust Generalised Bayesian Inference for Intractable Likelihoods

Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and bias-robustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernel-based exponential family models and non-Gaussian graphical models. 

29 October 2021: William Da Silva (Sorbonne Université Paris) - Multitype growth-fragmentations processes embedded in Brownian excursions

In a joint work with Élie Aïdékon, we consider a Brownian excursion from 0 to 1 in the upper half-plane. It possibly makes excursions above the horizontal line of height h>0. We record the size of each of these excursions, defined as the difference between its endpoint and starting point. As h evolves, this particle system exhibits a branching structure that we investigate. We recover one of the growth-fragmentations revealed by Bertoin, Budd, Curien and Kortchemski, which appears in the scaling limit of Markovian explorations of Boltzmann planar maps. I will also present an extension of this model to higher dimensions based on a joint work with Juan Carlos Pardo, when one considers a Brownian excursion in the half-space. 

4 November 2021: Marta Catalano (University of Warwick) - A Wasserstein index of dependence for Bayesian nonparametric modeling

Optimal transport (OT) methods and Wasserstein distances are flourishing in many scientific fields as an effective means for comparing and connecting different random structures. In this talk we describe the first use of an OT distance between Lévy measures with infinite mass to solve a statistical problem. Complex phenomena often yield data from different but related sources, which are ideally suited to Bayesian modeling because of its inherent borrowing of information. In a nonparametric setting, this is regulated by the dependence between random measures: we derive a general Wasserstein index for a principled quantification of the dependence gaining insight into the models’ deep structure. It also allows for an informed prior elicitation and provides a fair ground for model comparison. Our analysis unravels many key properties of the OT distance between Lévy measures, whose interest goes beyond Bayesian statistics, spanning to the theory of partial differential equations and of Lévy processes.

11 November 2021: Pierre Jacob (ESSEC Business School, Paris) - Some methods based on couplings of Markov chain Monte Carlo algorithms 

Markov chain Monte Carlo algorithms are commonly used to approximate a variety of probability distributions, such as posterior distributions arising in Bayesian analysis. I will review the idea of coupling in the context of Markov chains, and how this idea not only leads to theoretical analyses of Markov chains, but also to new Monte Carlo methods. In particular, the talk will describe how coupled Markov chains can be used to obtain 1) unbiased estimators of expectations, with applications to the "cut distribution" and to normalizing constant estimation, 2) non-asymptotic convergence diagnostics for Markov chains, and 3) unbiased estimators of the asymptotic variance of an MCMC ergodic average.

18 November 2021: Matthias Fengler (Universität St.Gallen) - Identifying structural shocks to volatility through a proxy-MGARCH model

We extend the classical MGARCH specification for volatility modeling by developing a structural MGARCH model targeting identification of shocks and volatility spillovers in a speculative return system. Similarly to the proxy-sVAR framework, we work with auxiliary proxy variables constructed from news-related measures to identify the underlying shock system. We achieve full identification with multiple proxies by chaining Givens rotations. In an empirical application, we identify an equity, bond and currency shock. We study the volatility spillovers implied by these labelled structural shocks. Our analysis shows that symmetric spillover regimes are rejected.

25 November 2021: Tom Rainforth (University of Oxford) - Deep Adaptive Design: Model-Based Adaptive Experimental Design in Real Time

Traditional model-based approaches to adaptive experimental design require substantial computational time at each stage of the experiment. This makes them unsuitable for most real-world applications, where decisions must typically be made quickly.  In this talk, I will introduce Deep Adaptive Design (DAD), a general method for amortizing the cost of performing sequential adaptive experiments.  The key idea behind DAD is to not optimize designs directly during the experiment, but instead learn a design policy network upfront and then use this to rapidly run (multiple) adaptive experiments at deployment time. This policy network takes as input the data from previous steps, and outputs the next design using a single forward pass; these design decisions can be made in milliseconds during the live experiment. Remarkably, we find that the DAD approach not only hugely speeds up the adaptive experimentation process, but often actually significantly improves its performance as well.  This is because it naturally allows non-greedy strategies to be learned and avoids errors that result from the inexactness of inference in the traditional framework. 

2 December 2021: Stefano Favaro (Università di Torino) - Learning-augmented count-min sketches via Bayesian nonparametrics

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens’ frequencies in a data stream, i.e. point queries, based on randomly hashed data. Learning-augmented CMSs aim at improving the CMS by means of learning models that allow to better exploit data properties. We focus on the learning-augmented CMS of Cai, Mitzenmacher and Adams (NeurIPS, 2018), which relies on Bayesian nonparametric (BNP) modeling of a data stream via Dirichlet process (DP) priors; this is refereed to as the CMS-DP. We present a novel and rigorous approach of the CMS-DP, and we show that it allows to consider more general classes of nonparametric priors than the DP prior. We apply our approach to develop a novel learning-augmented CMS under power-law data streams, which relies on BNP modeling of the stream via Pitman-Yor process (PYP) priors; this is referred to as the CMS-PYP.  Applications to synthetic data and real data show that the CMS-PYP outperforms both the CMS and CMS-DP in the estimation of low-frequency tokens; this is known to be a critical feature in natural language processing, where it is indeed common to encounter power-law data. 

9 December 2021: Fabrizia Mealli (Università degli Studi di Firenze) - Selecting Subpopulations for Causal Inference in Regression-Discontinuity Studies: A Bayesian Approach

Extracting causal information from regression-discontinuity (RD) studies, where the treatment assignment rule depends on some type of cutoff formula, may be challenging, especially in the presence of big data. Following Li, Mattei and Mealli (2015), we formally describe RD designs as local randomized experiments within the potential outcome approach. Under this framework causal inference concerns units belonging to some subpopulation where a local overlap assumption, SUTVA and a local randomization or local unconfoundedness assumption  hold. Unfortunately we do not usually know the subpopulations for which we can draw valid causal inference. We propose to use a model-based finite mixture approach to clustering in a Bayesian framework to classify observations into subpopulations for which we can draw valid causal inference and subpopulations from which we can extract no causal information on the basis of the observed data and the RD assumptions. This approach has important advantages: It explicitly accounts for the uncertainty about sub-population membership; it does not impose any constraint on the shape of the subpopulation; and it properly works in high-dimensional settings. We illustrate the framework in a high-dimensional RD study concerning the effects of the Borsa Famı́lia program, a social welfare program of the Brazilian government, on leprosy incidence.    

This is joint work with Alessandra Mattei and Laura Forastiere

13 January 2022: Emma Simpson (University College London) - Estimating the limiting shape of bivariate scaled sample clouds for self-consistent inference of extremal dependence properties 

When analysing and modelling bivariate extreme events, tail dependence features are a key consideration. It is common to categorise pairs of variables as either ‘asymptotically dependent’, when the largest values can occur simultaneously in both variables, or ‘asymptotically independent’, when they cannot. A mixture of extremal dependence features could also occur, where both variables can be simultaneously large but also with the possibility of at least one of the variables being large while the other is small. In the extreme value theory literature, various techniques are available to assess or model these aspects of tail dependence, and to quantify levels of extremal dependence. Existing inferential methods would need to be implemented separately for each of the available extremal dependence measures, with no guarantee of obtaining self-consistent information about the tail dependence behaviour from the different approaches.
Recent developments by Nolde and Wadsworth (2021) have established theoretical links between different characterisations of extremal dependence, through studying the limiting shape of an appropriately-scaled sample cloud. In this talk, I will discuss some current work where we aim to exploit these theoretical results for inferential purposes, allowing us to obtain estimates of several bivariate extremal dependence measures, with self-consistent conclusions about the extremal dependence characteristics of the variables.

17 January 2022 (Monday): Elena Stanghellini (Università degli Studi di Perugia) - Parametric mediation with multiple mediators: the case of binary random variables

The talk will focus on causal effects of a treatment on a binary outcome in a system with causally ordered multiple binary mediators. Moving from the work of Daniel et al. (2015) and Stanghellini and Doretti (2019), I present  the parametric formulation of the total causal effect and its possible modifications to derive causal effects of interest.  References to the classic path-analysis will also be made. Particular attention will be posed to the effect decomposition with one or two mediators. If time permits, extensions to outcome dependent sampling schemes will be also addressed. Real data examples will be introduced through the talk.
This talk is based on joint work with: Paolo Berta, Marco Doretti, Minna Genbäck, Martina Raggi.
Essential references:
Daniel R., De Stavola B., Cousens S.N. and Vansteelandt S. (2015). Causal Mediation Analysis with Multiple Mediators. Biometrics.
Stanghellini E. and Doretti M. (2019). On marginal and conditional parameters in logistic regression models. Biometrika.

20 January 2022: Perla Sousi (University of Cambridge) - The uniform spanning tree in 4 dimensions

A uniform spanning tree of Z^4 can be thought of as the "uniform 
measure" on trees  of Z^4. The past of 0 in the uniform spanning tree 
is the finite component that is disconnected from infinity when 0 is 
deleted from the tree. We establish the logarithmic corrections to the 
probabilities that the past contains a path of length n, that it has 
volume at least n and that it reaches the boundary of the box of side 
length n around 0. Dimension 4 is the upper critical dimension for this 
model in the sense that in higher dimensions it exhibits "mean-field" 
critical behaviour. An important part of our proof is the study of the 
Newtonian capacity of a loop erased random walk in 4 dimensions. This is 
joint work with Tom Hutchcroft. 

27 January 2022: Katie Harron (University College London) - Statistical methods for creating and evaluating electronic birth cohorts

Linkage of administrative data from different sources offer an efficient way to better understand the distribution of health and disease in populations, by generating population level cohorts whilst avoiding the time and cost associated with primary data collection. However, linkage is not always straightforward, particularly when we don't have access to a unique identifier that can be used to link the same individual across different data sources. In this talk, I will discuss statistical methods for creating and evaluating electronic birth cohorts, and some of the challenges that we should try to address when using these data for research. 

3 February 2022: Chris Holmes (University of Oxford) - Bayesian Predictive Inference

The prior distribution on parameters of a sampling distribution (the likelihood) is the usual starting point for Bayesian inference. In this talk we present a different perspective that focuses on the joint predictive distribution on observations as the primary tool for statistical inference. The approach has its roots in the work of de Finetti and Geisser, amongst others. Using predictive distributions we introduce the martingale posterior distribution, which provides Bayesian uncertainty directly on any statistic of interest without the need for the likelihood and prior, and this distribution can be sampled through a computational scheme we name predictive resampling. Using these techniques, we discuss new inference methodologies for multivariate density estimation, regression, and classification.

10 February 2022: Hakim Dehbi (University College London) - Controlled backfill in oncology dose-finding trials

The use of backfill in early phase dose-finding trials is a relatively recent practice. It consists of assigning patients to dose levels below the level where the study is at. The main reason for backfilling is to collect additional pharmacokinetic, pharmacodynamic and response data, in order to assess whether a plateau may exist on the dose-efficacy curve. This is a possibility in oncology with molecularly targeted agents or immunotherapy. Recommending for further study a dose level lower than the maximum tolerated dose could be supported in such situations. How to best allocate backfill patients to dose levels is not yet established. In this paper we propose to randomise backfill patients below the dose level where the study is at. A refinement of this would be to stop backfilling to lower dose levels when these show insufficient efficacy compared to higher levels, starting at dose level 1 and repeating this process sequentially. At study completion, data from all patients (both backfill patients and dose-finding patients) is used to estimate the dose-response curve. The fit from a change point model is compared to the fit of a monotonic model to identify a potential plateau. Using simulations, we show that this approach can identify the plateau on the dose-response curve when such a plateau exists, allowing the recommendation of a dose level lower than the maximum tolerated dose for future studies. This contribution provides a methodological framework for backfilling, from the perspective of both design and analysis in early phase oncology trials.  

17 February 2022: Kathryn Turnbull (Lancaster University) - Latent Space Modelling of Hypergraph Data

The increasing prevalence of relational data describing interactions among a target population has motivated a wide literature on statistical network analysis. In many applications, interactions may involve more than two members of the population and this data is more appropriately represented by a hypergraph. In this talk, we present a model for hypergraph data which extends the well-established latent space approach for graphs and, by drawing a connection to constructs from computational topology, we develop a model whose likelihood is inexpensive to compute. A delayed-acceptance MCMC scheme is proposed to obtain posterior samples and we rely on Bookstein coordinates to remove the identifiability issues associated with the latent representation. We investigate the performance of our approach via simulation and explore the application of our model to real-world data.  

24 February 2022: Darren Wilkinson (University of Durham) - Statistical modelling for geotechnical engineering

A significant part of the UK rail infrastructure is nearing 200 years in age whilst being built on high-plasticity soils that are prone to weathering and deterioration. Deterioration processes have been studied through computer simulation experiments of individual cuttings or embankments, but these are computationally-expensive and time-consuming, and therefore impractical to use directly for understanding the state of a rail network containing thousands of uniquely parameterised slopes. Instead we use surrogate statistical models, which can be used to approximate computationally-burdensome simulators, based on a relatively small number of simulator runs at well-selected input parameters. Parameters include slope geometry (height and angle) and various soil characteristics.

The simulation models produce large amounts of output, but emulation strategies focus on (derived) outputs of direct practical interest. One such output is the simulated time to failure of a given slope. An interesting issue with this output is that the computer experiments are terminated after around 200 years of simulated time, and so the data set used for emulation contains right-censored observations. An MCMC-based Bayesian modelling framework can accommodate such censored data, and can also be adapted to other derived outputs, such as deterioration curves. These emulators can be embedded in a Bayesian hierarchical model of a (section of) rail network for characterisation of network state and evaluation of cost-effectiveness of potential intervention strategies.

This work is supported by ACHILLES (https://www.achilles-grant.org.uk/), an EPSRC programme grant involving Newcastle, Durham, Loughborough, Southampton, Leeds and Bath Universities, and the British Geological Survey. The programme is concerned with improving understanding of the gradual deterioration of long linear assets, and the associated impact of climate change.

3 March 2022: Olatunji Johnson (University of Manchester) - A geostatistical method for analysing data from multiple Loa loa diagnostic tools

The elimination of onchocerciasis through community-based Mass Drug Administration (MDA) of ivermectin (Mectizan) is hampered by co-endemicity of Loa loa, as individuals who are highly co-infected with Loa loa parasites can suffer serious and occasionally fatal neurological reactions from the drug. Testing all individuals participating in MDA has some operational constraints ranging from the cost to limited availability of diagnostic tools. Therefore, there is a need for a way to establish whether an area is safe for MDA using the prevalence of loiasis derived from multiple diagnostic tools. Existing statistical methods only focus on using data from one diagnostic tool and ignore the potential information that could be derived from other datasets. In this talk, I will talk about how we address this issue by developing a joint geostatistical model that combines data from multiple Loa loa diagnostic tools. I will present how we developed the model and our method for inference. We applied this framework to Loa loa data from Gabon. We also propose a two-stage strategy to identify areas that are safe for MDA. Lastly, I will discuss how this work contributes to the global effort towards the elimination of onchocerciasis as a public health problem by potentially reducing the time and cost required to establish whether an area is safe for MDA.

10 March 2022: Mike Daniels (University of Florida) - A Bayesian nonparametric approach to causal mediation with multiple mediators

We introduce an approach for causal mediation with multiple mediators. We model the observed data distribution using a new Bayesian nonparametric approach that allows for flexible default specifications for the distribution of the outcome and the mediators conditional on mediator/outcome confounders. We briefly explore the properties of this specification and introduce assumptions that allow for the identification of direct and both joint and individual indirect effects. We use this approach to examine the effect of antibiotics as mediators of the relationship between bacterial community dominance and ventilator associated pneumonia and conduct simulation studies to better understand the frequentist properties of our approach.

Joint work with Samrat Roy (UPenn), Jason Roy (Rutgers), and Brendan Kelly (UPenn)

17 March 2022: Peter Orbanz (University College London) - Limit theorems for distributions invariant under a group of transformations

Consider a large random structure -- a stochastic process on the line, a random graph, a random field on the grid -- and a function that depends only on a small part of the structure. Now use a family of transformations to ‘move’ the domain of the function over the structure, collect each function value, and average. I will present results that show that, under suitable conditions, such transformation averages satisfy a law of large numbers, a central limit theorem, and a Berry-Esseen type bound on the speed of convergence. Several known results for stationary random fields, graphon models of networks, and so forth emerge as special cases. One relevant condition is that the distribution of the random structure remains invariant under the transformations used, which can be read as a probabilistic symmetry property. Loosely speaking: The large-sample theory of i.i.d. averages still holds if the i.i.d. assumption is relaxed to a symmetry assumption.

21 March 2022: Georgia Salanti (Universität Bern) - Network meta-regression to make predictions for heterogeneous treatment effects

Predicting individualized treatment effects is of great importance, so that a treatment might be targeted to individuals who will benefit from it, and be avoided by those who won’t. We have developped a two-stage individualized prediction model for heterogeneous treatment effects, by combining prognostic research and network meta-analysis methods. We extend the idea of risk-modelling, that has been used to estimate heterogeneous treatment effects in a single randomised trial, in the context of network meta-analysis.  We are also developing methods to evaluate the clinical relevance of the model, by extending the decision curve analysis idea. We apply the methodology in a network of trials that compare  placebo and three drugs for patients with relapsing-remitting multiple sclerosis.

31 March 2022: Judith Rousseau (University of Oxford) - The use of cut posterior for semi-parametric inference with application to nonparametric hidden Markov models

We consider the problem of estimation in Hidden Markov models with finite state space and nonparametric emission distributions. Efficient estimators for the transition matrix are exhibited, and a semiparametric Bernstein-von Mises result is deduced, extending previous work on mixture models to the HMM setting. Following from this, we propose a modular approach using the cut posterior to jointly estimate the transition matrix and the emission densities. We derive a general theorem on contraction rates for this approach, and we then show how this result may be applied to obtain a contraction rate result for the emission densities in our setting; a key intermediate step is an inversion inequality relating L1 distance between the marginal densities to L1 distance between the emissions. Finally, a contraction result for the smoothing probabilities is shown, which avoids the common approach of sample splitting. Simulations are provided which demonstrate both the theory and the ease of its implementation

This is joint work with Dan Moss.

7 April 2022: Rebecca Hubbard (University of Pennsylvania) - Considerations for valid analysis of medical product utilization and outcomes from real-world data

Opportunities to use “real-world data,” data generated as a by-product of digital transactions, have exploded over the past decade. In the context of health research, real-world data including electronic health records and medical claims facilitate understanding of treatment utilization and outcomes as they occur in routine clinical practice, and studies using these data sources can potentially proceed rapidly compared to trials and observational studies that rely on primary data collection. However, using data sources that were not collected for research purposes comes at a cost, and naïve use of such data without considering their complexity and imperfect quality can lead to bias and inferential error. Real-world data frequently violate the assumptions of standard statistical methods, but it is not practicable to develop new methods to address every possible complication arising in their analysis. The statistician is faced with a quandary: how to effectively utilize real-world data to advance research without compromising best practices for principled data analysis. In this talk I will use examples from my research on methods for the analysis of electronic health records (EHR) derived-data to illustrate approaches to understanding the data generating mechanism for real-world data. Drawing on this understanding, I will then discuss approaches to identify, use, and develop principled methods for incorporating EHR into research. The overarching goal of this presentation is to raise awareness of challenges associated with the analysis of real-world data and demonstrate how a principled approach can be grounded in an understanding of the scientific context and data generating process.

28 April 2022: Rhian Daniel (Prifysgol Caerdydd / Cardiff University) - Regression by composition

We introduce regression by composition (RBC), a class of flexible parametric regression models that includes many standard regression models as special cases. The RBC form allows, for example, more general maps between conditional outcome distributions than can be expressed using any generalised linear model (GLM) link function, as well as allowing predictors (or groups of predictors) each to have their own link function.

RBCs have the particular advantage of admitting models for binary outcomes with parameters that can be interpreted as switch relative risks. The switch relative risk has many attractive properties such as closure and collapsibility and its use is motivated by a large and growing body of psychological and philosophical work on what causal objects human reasoners expect to be stable between different subgroups. We also discuss an extension of the switch relative risk to survival analysis.

This talk is based on joint work with Daniel Farewell, Mats Stensrud and Anders Huitfeldt.

5 May 2022: Ioannis Kosmidis (University of Warwick) - Reduced-bias M-estimation

Many popular methods for the reduction of estimation bias rely on an approximation of the bias function under the assumption that the model is correct and fully specified. Other bias reduction methods, like the bootstrap, the jackknife and indirect inference require fewer assumptions to operate but are typically computer-intensive, requiring repeated optimization.

We present a novel framework for reducing estimation bias that:

i) can deliver estimators with smaller bias than reference estimators even for partially-specified models, as long as estimation is through asymptotically unbiased estimating functions;

ii) always results in closed-form bias-reducing penalties to the objective function if estimation is through the maximization of one, like maximum likelihood and maximum composite likelihood;

iii) relies only on the estimating functions and/or the objective and their derivatives, greatly facilitating implementation for general modelling frameworks through numerical or automatic differentiation techniques and standard numerical optimization routines.

The bias-reducing penalized objectives closely relate to information criteria for model selection based on the Kullback-Leibler divergence, establishing, for the first time, a strong link between reduction of estimation bias and model selection. We also discuss the asymptotic efficiency properties of the new estimator, inference and model selection, and present illustrations in well-used, important modelling settings of varying complexity.

This is joint work with Nicola Lunardon, University of Milano-Bicocca, Milan, Italy

Related preprint:


19 May 2022:  Emmanuelle Dankwa (University of Oxford) - Modelling the impacts of alternative control measures on an African swine fever epidemic

Somewhere in the middle of the North Atlantic Ocean, an epidemic has struck; for the first time in its history, the mythical Merry Island is experiencing an outbreak of African swine fever virus (ASFV), the pathogen responsible for African swine fever (ASF), a highly contagious and lethal disease capable of infecting all swine. Currently, an outbreak of ASFV among wild boar (found mainly in forested areas of the island) and domestic pigs (reared commercially or in backyards) is affecting several parts of Europe with serious economic impacts and potential threats to global food security. Motivated by this situation, the French National Research Institute for Agriculture, Food, and the Environment organized the ASF modelling challenge which aimed to expand the development of ASF transmission models to inform, in a timely manner, policy decisions on ASF control. Participating teams developed models to investigate ASFV transmission dynamics among wild boar and domestic pigs in Merry Island. Synthetic incidence data from a ‘secret’ data-generating model were provided at regular intervals and four-week-ahead incidence projections were required under a range of control scenarios. During the challenge, teams were blind to the data-generating model. 

In this talk, I will present a stochastic, spatial and compartment-based model of ASFV transmission produced by our team* during the challenge. The talk will feature:  

a brief overview of features of the Island relevant to transmission dynamics,
a description of the model structure and how the model was calibrated to the synthetic incidence data,
a presentation of our model projections and a comparison with projections from the data-generating model, and finally, 
some suggestions for model refinement.  

*This is joint work with Sébastien Lambert (Univ. of Toulouse), Sarah Hayes (Univ. of Oxford), Robin N. Thompson (Univ. of Warwick) and Christl A. Donnelly (Univ. of Oxford). 

26 May 2022:  Purvasha Chakravarti (Imperial College London) - Interpretable Model-Independent Detection of New Physics Signals

A central goal in experimental high energy physics is to detect new physics signals that are not explained by known physics. In this talk I will present our algorithm that we use to search for new signals that appear as deviations from known Standard Model physics in high-dimensional particle physics data. To do this, we determine whether there is any statistically significant difference between the distribution of Standard Model background samples and the distribution of the experimental observations, which are a mixture of the background and a potential new signal. We do this without making any model assumptions on the signal that we are searching for.  We use a classifier and construct three test statistics using the classifier: an estimated likelihood ratio test (LRT) statistic, a test based on the area under the ROC curve (AUC), and a test based on the misclassification error (MCE), to detect the presence of the signal in the experimental data. Additionally, I will present our methods for estimating the signal strength parameter and interpreting the high-dimensional classifier in order to understand the properties of the detected signal. I will also present the results on a data set related to the search for the Higgs boson at the Large Hadron Collider at CERN.

31 May 2022 (Tuesday):  Marta Soares (University of York) - Sharing of information in evidence synthesis to support health care decision making

The evidence supporting decision-making is often restricted to studies that fully conform to the decision problem PICOS: i.e. on the specific population (P), intervention (I), comparators (C), and outcomes (O) of interest, and using robust study designs (S, e.g., RCTs). There are, however, many cases where extending the evidence base can strengthen decision making. These include cases where the existing ‘direct’ evidence is limited (e.g. disconnected networks, single-arm studies), complex (e.g. surrogate/ multiple outcomes, complex interventions), or sparse (e.g., rare conditions or indications in children). Or it may be motivated by the fact that the extended evidence plausibly retains relevance (e.g. the effectiveness of chemotherapy is judged similar across different types of solid tumors). Under such circumstances, non-standard evidence synthesis methods can facilitate the sharing of information. In this seminar I will i) systematise information-sharing methods in evidence synthesis, ii) present a methodological exploration of how methods compare using a case study, and iii) outline an upcoming project developing a framework for the sharing of information to support decisions over multi-indication oncology drugs.

9 June 2022:  F. Marta L. Di Lascio (Libera Università di Bolzano) - Conditional dependence modelling and copula-based clustering with application to district heating data

Efficient energy production and distribution systems are urgently
needed to contribute to the development of sustainable cities and the reduction
of global climate change. Since modern district heating systems are sustainable
energy distribution services that exploit renewable sources and avoid energy
waste, an in-depth knowledge of thermal energy demand, which is mainly
affected by weather conditions and the energy characteristics of buildings, is
essential to enhance the management of heat production. Accordingly, we
propose a twofold copula-based contribution that we apply to the
high-dimensional district heating data of the Italian city Bozen-Bolzano. On the
one hand, using mixture of copulas we investigate the complex relationship
between thermal energy demand and meteorological variables. We thus derive
the copula-based conditional probability function of energy demand given
weather conditions to assess the risk of demand exceeding plant capacity or the
availability of renewable and local energy sources. On the other hand, we i)
explore the usefulness of the Ali-Mikhail-Haq copula in defining a new
dissimilarity measure to cluster variables in a hierarchical framework and ii)
develop a clustering methodology to find groups of buildings that are
homogeneous with respect to their main characteristics, such as energy
efficiency and heating surface. Our findings can be useful to support the design,
expansion and management of district heating systems.

16 June 2022: Silvia Noirjean (Università degli Studi di Firenze) - Exploiting network information to disentangle spillover effects in a field experiment on teens' museum attendance

A key element in the education of youths is their sensitization to historical and artistic heritage. We analyze a field experiment conducted in Florence (Italy) to assess how appropriate incentives assigned to high-school classes may induce teens to visit museums in their free time. Non-compliance and spillover effects make the impact evaluation of this clustered encouragement design challenging. We propose to blend principal stratification and causal mediation, by defining sub-populations of units according to their compliance behavior and using the information on their friendship networks as mediator. We formally define principal natural direct and indirect effects and principal controlled direct and spillover effects, and use them to disentangle spillovers from other causal channels. We adopt a Bayesian approach for inference.

Related preprint: https://arxiv.org/abs/2011.11023

23 June 2022: Rodrigo Targino (Fundação Getulio Vargas) - Transform MCMC schemes for sampling intractable factor copula models

In financial risk management, modelling dependency within a random vector X is crucial, a standard approach is the use of a copula model. Say the copula model can be sampled through realizations of Y having copula function C: had the marginals of Y been known, sampling X^(i) , the i-th component of X, would directly follow by composing Y^(i) with its cumulative distribution function (c.d.f.) and the inverse c.d.f. of X^(i). In this work, the marginals of Y are not explicit, as in a factor copula model. We design an algorithm which samples X through an empirical approximation of the c.d.f. of the Y marginals. To be able to handle complex distributions for Y or rare-event computations, we allow Markov Chain Monte Carlo (MCMC) samplers. We establish convergence results whose rates depend on the tails of X, Y and the Lyapunov function of the MCMC sampler. We present numerical experiments confirming the convergence rates and also revisit a real data analysis from financial risk management.
Joint work with Cyril Bénézet and Emmanuel Gobet


Affiliated Seminars