A seminar series covering a broad range of applied and methodological topics in Statistical Science.

#### Talks take place in hybrid format, or remotely: please check each week for details.

**Usual time:** Thursdays 14:00-15:00

**Location:**** **A lecture theater at UCL (Gower St. London WC1E 6BT, please check each week for exact venue), and/or Zoom (please use the contact info below to join the mailing list, and receive the links to the talks).

**Contact info:** thomas dot bartlett dot 10 at ucl dot ac dot uk

**Recent talks**

Please subscribe to our Youtube channel, to view some recent talks from the series

**Upcoming talks**

Calendar .ics link

- 30 September 2021: Martin Huber (Université de Fribourg) - Double machine learning for sample selection models
We consider the evaluation of discretely distributed treatments when outcomes are only observed for a subpopulation due to sample selection or outcome attrition. For identification, we combine a selection-on-observables assumption for treatment assignment with either selection-on-observables or instrumental variable assumptions concerning the outcome attrition/sample selection process. We also consider dynamic confounding, meaning that covariates that jointly affect sample selection and the outcome may (at least partly) be influenced by the treatment. To control in a data-driven way for a potentially high dimensional set of pre- and/or post-treatment covariates, we adapt the double machine learning framework for treatment evaluation to sample selection problems. We make use of (a) Neyman-orthogonal, doubly robust, and efficient score functions, which imply the robustness of treatment effect estimation to moderate regularization biases in the machine learning- based estimation of the outcome, treatment, or sample selection models and (b) sample splitting (or cross-fitting) to prevent overfitting bias. We demonstrate that the proposed estimators are asymptotically normal and root-n consistent under specific regularity conditions concerning the machine learners and investigate their finite sample properties in a simulation study. We also apply our proposed methodology to the Job Corps data for evaluating the effect of training on hourly wages which are only observed conditional on employment. The estimator is available in the causalweight package for the statistical software R.

- 7 October 2021: Francesco Ravazzolo (Libera Università di Bolzano) - Dynamic Combination and Calibration for Climate Predictions
We propose a density calibration and combination model that dynamically calibrate and combine predictive distributions. The time-varying calibration and combination weights are fitted by an observation driven model with dynamics inferred by the score of the assumed conditional likelihood of the data generating process. The model is very flexible and can handle different shapes, instability and model uncertainty. We show this analytically and in simulation exercises. An empirical application to short-term wind speed predictions documents the large instability of individual model performance and their calibration properties, favouring our model in terms of predictive accuracy.

- 14 October 2021: Almut Veraart (Imperial College London) - High-frequency estimation of the Levy-driven Graph Ornstein-Uhlenbeck process with applications to wind capacity factor measurements
We consider the Graph Ornstein-Uhlenbeck (GrOU) process observed on a non-uniform discrete time grid and introduce discretised maximum likelihood estimators with parameters specific to the whole graph or specific to each component, or node. Under a high frequency sampling scheme, we study the asymptotic behaviour of those estimators as the mesh size of the observation grid goes to zero. We prove two stable central limit theorems to the same distribution as in the continuously observed case under both finite and infinite jump activity for the Levy driving noise. When a graph structure is not explicitly available, the stable convergence allows to consider purpose-specific sparse inference procedures, i.e. pruning, on the edges themselves in parallel to the GrOU inference and preserve its asymptotic properties. We apply the new estimators to wind capacity factor measurements, i.e. the ratio between the wind power produced locally compared to its rated peak power, across fifty locations in Northern Spain and Portugal. We show the superiority of those estimators compared to the standard least squares estimator through a simulation study extending known univariate results across graph configurations, noise types and amplitudes.

This is joint work with Valentin Courgeau (Imperial College London)- 21 October 2021: Chris Oates (University of Newcastle) - Robust Generalised Bayesian Inference for Intractable Likelihoods
Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and bias-robustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernel-based exponential family models and non-Gaussian graphical models.

- 29 October 2021: William Da Silva (Sorbonne Université Paris) - Multitype growth-fragmentations processes embedded in Brownian excursions
In a joint work with Élie Aïdékon, we consider a Brownian excursion from 0 to 1 in the upper half-plane. It possibly makes excursions above the horizontal line of height h>0. We record the size of each of these excursions, defined as the difference between its endpoint and starting point. As h evolves, this particle system exhibits a branching structure that we investigate. We recover one of the growth-fragmentations revealed by Bertoin, Budd, Curien and Kortchemski, which appears in the scaling limit of Markovian explorations of Boltzmann planar maps. I will also present an extension of this model to higher dimensions based on a joint work with Juan Carlos Pardo, when one considers a Brownian excursion in the half-space.

- 4 November 2021: Marta Catalano (University of Warwick) - A Wasserstein index of dependence for Bayesian nonparametric modeling
Optimal transport (OT) methods and Wasserstein distances are flourishing in many scientific fields as an effective means for comparing and connecting different random structures. In this talk we describe the first use of an OT distance between Lévy measures with infinite mass to solve a statistical problem. Complex phenomena often yield data from different but related sources, which are ideally suited to Bayesian modeling because of its inherent borrowing of information. In a nonparametric setting, this is regulated by the dependence between random measures: we derive a general Wasserstein index for a principled quantification of the dependence gaining insight into the models’ deep structure. It also allows for an informed prior elicitation and provides a fair ground for model comparison. Our analysis unravels many key properties of the OT distance between Lévy measures, whose interest goes beyond Bayesian statistics, spanning to the theory of partial differential equations and of Lévy processes.

- 11 November 2021: Pierre Jacob (ESSEC Business School, Paris) - Some methods based on couplings of Markov chain Monte Carlo algorithms
Markov chain Monte Carlo algorithms are commonly used to approximate a variety of probability distributions, such as posterior distributions arising in Bayesian analysis. I will review the idea of coupling in the context of Markov chains, and how this idea not only leads to theoretical analyses of Markov chains, but also to new Monte Carlo methods. In particular, the talk will describe how coupled Markov chains can be used to obtain 1) unbiased estimators of expectations, with applications to the "cut distribution" and to normalizing constant estimation, 2) non-asymptotic convergence diagnostics for Markov chains, and 3) unbiased estimators of the asymptotic variance of an MCMC ergodic average.

- 18 November 2021: Matthias Fengler (Universität St.Gallen) - Identifying structural shocks to volatility through a proxy-MGARCH model
We extend the classical MGARCH specification for volatility modeling by developing a structural MGARCH model targeting identification of shocks and volatility spillovers in a speculative return system. Similarly to the proxy-sVAR framework, we work with auxiliary proxy variables constructed from news-related measures to identify the underlying shock system. We achieve full identification with multiple proxies by chaining Givens rotations. In an empirical application, we identify an equity, bond and currency shock. We study the volatility spillovers implied by these labelled structural shocks. Our analysis shows that symmetric spillover regimes are rejected.

- 25 November 2021: Tom Rainforth (University of Oxford) - Deep Adaptive Design: Model-Based Adaptive Experimental Design in Real Time
Traditional model-based approaches to adaptive experimental design require substantial computational time at each stage of the experiment. This makes them unsuitable for most real-world applications, where decisions must typically be made quickly. In this talk, I will introduce Deep Adaptive Design (DAD), a general method for amortizing the cost of performing sequential adaptive experiments. The key idea behind DAD is to not optimize designs directly during the experiment, but instead learn a design policy network upfront and then use this to rapidly run (multiple) adaptive experiments at deployment time. This policy network takes as input the data from previous steps, and outputs the next design using a single forward pass; these design decisions can be made in milliseconds during the live experiment. Remarkably, we find that the DAD approach not only hugely speeds up the adaptive experimentation process, but often actually significantly improves its performance as well. This is because it naturally allows non-greedy strategies to be learned and avoids errors that result from the inexactness of inference in the traditional framework.

- 2 December 2021: Stefano Favaro (Università di Torino) - Learning-augmented count-min sketches via Bayesian nonparametrics
The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens’ frequencies in a data stream, i.e. point queries, based on randomly hashed data. Learning-augmented CMSs aim at improving the CMS by means of learning models that allow to better exploit data properties. We focus on the learning-augmented CMS of Cai, Mitzenmacher and Adams (NeurIPS, 2018), which relies on Bayesian nonparametric (BNP) modeling of a data stream via Dirichlet process (DP) priors; this is refereed to as the CMS-DP. We present a novel and rigorous approach of the CMS-DP, and we show that it allows to consider more general classes of nonparametric priors than the DP prior. We apply our approach to develop a novel learning-augmented CMS under power-law data streams, which relies on BNP modeling of the stream via Pitman-Yor process (PYP) priors; this is referred to as the CMS-PYP. Applications to synthetic data and real data show that the CMS-PYP outperforms both the CMS and CMS-DP in the estimation of low-frequency tokens; this is known to be a critical feature in natural language processing, where it is indeed common to encounter power-law data.

- 9 December 2021: Fabrizia Mealli (Università degli Studi di Firenze) - Selecting Subpopulations for Causal Inference in Regression-Discontinuity Studies: A Bayesian Approach
Extracting causal information from regression-discontinuity (RD) studies, where the treatment assignment rule depends on some type of cutoff formula, may be challenging, especially in the presence of big data. Following Li, Mattei and Mealli (2015), we formally describe RD designs as local randomized experiments within the potential outcome approach. Under this framework causal inference concerns units belonging to some subpopulation where a local overlap assumption, SUTVA and a local randomization or local unconfoundedness assumption hold. Unfortunately we do not usually know the subpopulations for which we can draw valid causal inference. We propose to use a model-based finite mixture approach to clustering in a Bayesian framework to classify observations into subpopulations for which we can draw valid causal inference and subpopulations from which we can extract no causal information on the basis of the observed data and the RD assumptions. This approach has important advantages: It explicitly accounts for the uncertainty about sub-population membership; it does not impose any constraint on the shape of the subpopulation; and it properly works in high-dimensional settings. We illustrate the framework in a high-dimensional RD study concerning the effects of the Borsa Famı́lia program, a social welfare program of the Brazilian government, on leprosy incidence.

This is joint work with Alessandra Mattei and Laura Forastiere

- 13 January 2022: Emma Simpson (University College London) - Estimating the limiting shape of bivariate scaled sample clouds for self-consistent inference of extremal dependence properties
When analysing and modelling bivariate extreme events, tail dependence features are a key consideration. It is common to categorise pairs of variables as either ‘asymptotically dependent’, when the largest values can occur simultaneously in both variables, or ‘asymptotically independent’, when they cannot. A mixture of extremal dependence features could also occur, where both variables can be simultaneously large but also with the possibility of at least one of the variables being large while the other is small. In the extreme value theory literature, various techniques are available to assess or model these aspects of tail dependence, and to quantify levels of extremal dependence. Existing inferential methods would need to be implemented separately for each of the available extremal dependence measures, with no guarantee of obtaining self-consistent information about the tail dependence behaviour from the different approaches.

Recent developments by Nolde and Wadsworth (2021) have established theoretical links between different characterisations of extremal dependence, through studying the limiting shape of an appropriately-scaled sample cloud. In this talk, I will discuss some current work where we aim to exploit these theoretical results for inferential purposes, allowing us to obtain estimates of several bivariate extremal dependence measures, with self-consistent conclusions about the extremal dependence characteristics of the variables.- 17 January 2022 (Monday): Elena Stanghellini (Università degli Studi di Perugia) - Parametric mediation with multiple mediators: the case of binary random variables
The talk will focus on causal effects of a treatment on a binary outcome in a system with causally ordered multiple binary mediators. Moving from the work of Daniel et al. (2015) and Stanghellini and Doretti (2019), I present the parametric formulation of the total causal effect and its possible modifications to derive causal effects of interest. References to the classic path-analysis will also be made. Particular attention will be posed to the effect decomposition with one or two mediators. If time permits, extensions to outcome dependent sampling schemes will be also addressed. Real data examples will be introduced through the talk.

This talk is based on joint work with: Paolo Berta, Marco Doretti, Minna Genbäck, Martina Raggi.

Essential references:

Daniel R., De Stavola B., Cousens S.N. and Vansteelandt S. (2015). Causal Mediation Analysis with Multiple Mediators. Biometrics.

Stanghellini E. and Doretti M. (2019). On marginal and conditional parameters in logistic regression models. Biometrika.- 20 January 2022: Perla Sousi (University of Cambridge) - The uniform spanning tree in 4 dimensions
A uniform spanning tree of Z^4 can be thought of as the "uniform

measure" on trees of Z^4. The past of 0 in the uniform spanning tree

is the finite component that is disconnected from infinity when 0 is

deleted from the tree. We establish the logarithmic corrections to the

probabilities that the past contains a path of length n, that it has

volume at least n and that it reaches the boundary of the box of side

length n around 0. Dimension 4 is the upper critical dimension for this

model in the sense that in higher dimensions it exhibits "mean-field"

critical behaviour. An important part of our proof is the study of the

Newtonian capacity of a loop erased random walk in 4 dimensions. This is

joint work with Tom Hutchcroft.- 27 January 2022: Katie Harron (University College London) - Statistical methods for creating and evaluating electronic birth cohorts
Linkage of administrative data from different sources offer an efficient way to better understand the distribution of health and disease in populations, by generating population level cohorts whilst avoiding the time and cost associated with primary data collection. However, linkage is not always straightforward, particularly when we don't have access to a unique identifier that can be used to link the same individual across different data sources. In this talk, I will discuss statistical methods for creating and evaluating electronic birth cohorts, and some of the challenges that we should try to address when using these data for research.

- 3 February 2022: Chris Holmes (University of Oxford)
Title and abstract TBC.

- 10 February 2022: Hakim Debhi (University College London) - Controlled backfill in oncology dose-finding trials
The use of backfill in early phase dose-finding trials is a relatively recent practice. It consists of assigning patients to dose levels below the level where the study is at. The main reason for backfilling is to collect additional pharmacokinetic, pharmacodynamic and response data, in order to assess whether a plateau may exist on the dose-efficacy curve. This is a possibility in oncology with molecularly targeted agents or immunotherapy. Recommending for further study a dose level lower than the maximum tolerated dose could be supported in such situations. How to best allocate backfill patients to dose levels is not yet established. In this paper we propose to randomise backfill patients below the dose level where the study is at. A refinement of this would be to stop backfilling to lower dose levels when these show insufficient efficacy compared to higher levels, starting at dose level 1 and repeating this process sequentially. At study completion, data from all patients (both backfill patients and dose-finding patients) is used to estimate the dose-response curve. The fit from a change point model is compared to the fit of a monotonic model to identify a potential plateau. Using simulations, we show that this approach can identify the plateau on the dose-response curve when such a plateau exists, allowing the recommendation of a dose level lower than the maximum tolerated dose for future studies. This contribution provides a methodological framework for backfilling, from the perspective of both design and analysis in early phase oncology trials.

- 17 February 2022: Kathryn Turnbull (Lancaster University)
Title and abstract TBC.

- 24 February 2022: Darren Wilkinson (University of Durham)
Title and abstract TBC.

- 3 March 2022: Olatunji Johnson (University of Manchester)
Title and abstract TBC

- 10 March 2022: Mike Daniels (University of Florida)
Title and abstract TBC

- 17 March 2022: Peter Orbanz (University College London)
Title and abstract TBC

- 21 March 2022: Georgia Salanti (Universität Bern)
Title and abstract TBC

- 31 March 2022: Judith Rousseau (University of Oxford)
Title and abstract TBC

- 7 April 2022: Rebecca Hubbard (University of Pennsylvania) - Considerations for valid analysis of medical product utilization and outcomes from real-world data
Opportunities to use “real-world data,” data generated as a by-product of digital transactions, have exploded over the past decade. In the context of health research, real-world data including electronic health records and medical claims facilitate understanding of treatment utilization and outcomes as they occur in routine clinical practice, and studies using these data sources can potentially proceed rapidly compared to trials and observational studies that rely on primary data collection. However, using data sources that were not collected for research purposes comes at a cost, and naïve use of such data without considering their complexity and imperfect quality can lead to bias and inferential error. Real-world data frequently violate the assumptions of standard statistical methods, but it is not practicable to develop new methods to address every possible complication arising in their analysis. The statistician is faced with a quandary: how to effectively utilize real-world data to advance research without compromising best practices for principled data analysis. In this talk I will use examples from my research on methods for the analysis of electronic health records (EHR) derived-data to illustrate approaches to understanding the data generating mechanism for real-world data. Drawing on this understanding, I will then discuss approaches to identify, use, and develop principled methods for incorporating EHR into research. The overarching goal of this presentation is to raise awareness of challenges associated with the analysis of real-world data and demonstrate how a principled approach can be grounded in an understanding of the scientific context and data generating process.

- 28 April 2022: Rhian Daniel (Prifysgol Caerdydd / Cardiff University)
Title and abstract TBC

- 5 May 2022: Ioannis Kosmidis (University of Warwick)
Title and abstract TBC