## Statistical Science Seminars

**Usual time**: Thursdays 1400 - 1500

**Location**: Room 102, Department of Statistical Science, 1-19 Torrington Place (1st floor).

Some seminars are held at different locations and at different times. Please click on the abstract for further details.

## 31 August 2017 (Galton Lecture Theatre, 1-19 Torrington Place): Dr. Bob Durrant (University of Waikato)

### Random Projections for Dimensionality Reduction

Linear dimensionality reduction is a key tool in the statistician's toolbox, used variously to make models simpler and more interpretable, to deal with cases when n<p (e.g. to enable model identifiability), or to reduce compute time or memory requirements for large-scale (high-dimensional, large p) problems. In recent years, /random/ projection ('RP'), that is projecting a dataset on to a k-dimensional subspace ('k-flat') chosen uniformly at random from all such k-flats, has become a workhorse approach in the machine learning and data-mining fields, but it is still relatively unknown in other circles. In this talk I will review an elementary proof of the Johnson-Lindenstrauss lemma which, perhaps rather surprisingly, shows that (with high probability) RP approximately preserves the Euclidean geometry of projected data. This result has provided some theoretical grounds for using RP in a range of applications. I will also give a simple - but novel - extension which shows that for data satisfying a mild regularity condition simply sampling the features does nearly as well as RP at geometry preservation, while at the same time bringing a substantial speed-up in execution. Finally, I will briefly discuss some refinements of this final approach and present some preliminary experimental findings combining this with a pre-trained "deep" neural network on ImageNet data.

## 21 September 2017: Dr. Andrew Titman (Lancaster University)

### Testing the Markov assumption in general multi-state models

Recently there has been interest in the development of estimators of the transition probabilities for right-censored data that are robust to departures from the Markov assumption. The landmark Aalen-Johansen (LMAJ) [1] estimator is robust to non-Markov processes, but this robustness comes at the cost of a loss of efficiency compared to the standard Aalen-Johansen (AJ) estimator, making it important to identify when it is necessary to use LMAJ.

A similar
principle to the construction of the LMAJ can be used to build a
test of the Markov property. For a given starting state and time,
the set of patients who were in that state at that time can be
identified and treated as a distinct group to those who were not.
If the process is Markov, the transition intensities in the two
groups will be equal. The log-rank test statistics from the
transition intensities can be combined to produce a test at that
time. Moreover, the statistics across time and starting state form
a stochastic process allowing the construction of a global
supremum test. A wild bootstrap procedure is proposed to
approximate the null distribution in finite samples.

The performance of the test is investigated through simulation in a variety of settings and by application to a dataset on patients with liver cirrhosis.

[1] Putter, H., Spitoni, C. (2016). Non-parametric estimation of transition probabilities in non-Markov multi-state models: the landmark Aalen-Johansen estimator. Statistical Methods in Medical Research. DOI:10.1177/0962280216674497

## 05 October 2017: Prof. Aapo Hyvarinen (University College London)

### Nonlinear ICA using temporal structure: a principled framework for unsupervised deep learning

Unsupervised learning, in particular learning general nonlinear representations, is one of the deepest problems in machine learning. Estimating latent quantities in a generative model provides a principled framework, and has been successfully used in the linear case, e.g. with independent component analysis (ICA) and sparse coding. However, extending ICA to the nonlinear case has proven to be extremely difficult: A straight-forward extension is unidentifiable, i.e. it is not possible to recover those latent components that actually generated the data. Here, we show that this problem can be solved by using temporal structure. We formulate two generative models in which the data is an arbitrary but invertible nonlinear transformation of time series (components) which are statistically independent of each other. Drawing from the theory of linear ICA, we formulate two distinct classes of temporal structure of the components which enable identification, i.e. recovery of the original independent components. We show that in both cases, the actual learning can be performed by ordinary neural network training where only the input is defined in an unconventional manner, making software implementations trivial. We can rigorously prove that after such training, the units in the last hidden layer will give the original independent components. [With Hiroshi Morioka, published at NIPS2016 and AISTATS2017.]

## 12 October 2017: Prof. Stein-Erik Fleten (Norwegian University of Science and Technology)

### Structural Estimation of Switching Costs for Peaking Power Plants

We use structural estimation to determine the one-time costs associated with shutting down, restarting, and abandoning peak power plants in the United States. The sample period covers 2001-2009. Switching costs are difficult to determine in practice and can vary substantially from plant to plant. The approach combines a nonparametric regression for capturing transitions in the exogenous state variable with a one-step nonlinear optimization for structural estimation. The data are well-suited to test the new method because the state variable is not described by any known stochastic process. From our estimates of switching (and maintenance) costs we can infer the costs which would be avoided if a peaking plant were taken out of service for a year. This so-called avoidable cost plays an important role is electricity capacity markets such as the Reliability Pricing Mechanism in PJM. Our avoidable cost estimates are less than the default Avoidable Cost Rate (ACR) in PJM.

## 19 October 2017: Prof. Bianca De Stavola (University College London)

### Multiple questions for multiple mediators

Investigating the mechanisms
that may explain the causal links between an exposure and a temporally distal outcome often involves multiple interdependent
mediators. Until recently, dealing with multiple mediators was
restricted to settings where mediators relate to exposure and outcome
only linearly. Extensions proposed in the causal inference literature
to allow for interactions and non-linearities in the presence of
multiple mediators initially focussed on natural direct and indirect
effects. These however are not all identifiable, with the rest
requiring stringent, and often unrealistic, assumptions. More recent
developments have focussed interventional (or randomised) direct and indirect effects to deal with these issues. They can be identified
under less restrictive assumptions, with generalizations dealing with
time-varying exposures, mediators and confounders also possible. The
mediation questions that can be addressed when estimating
interventional effects differ from those asked by natural effects in
subtle ways. These will be reviewed, with their differences in emphasis, assumptions, and interpretation discussed. An
epidemiological investigation of the mechanisms linking maternal
pre-pregnancy weight status and offspring eating disorders behaviour will be used to illustrate these points.

References:

Daniel RM, De Stavola BL,
Cousens SN, Vansteelandt S. Causal mediation analysis with multiple
mediators. Biometrics 2015; 71, 1–14.

Vanderweele TJ,
Vansteelandt S, Robins JM. Effect decomposition in the presence of an
exposure-induced mediator-outcome confounder. Epidemiology 2014; 25,
300–306.

Vanderweele TJ,
Tchetgen-Tchetgen E. Mediation Analysis with Time-Varying Exposures
and Mediators JRSS A (in press)

Vansteelandt, S. and
Daniel, R.M. Interventional effects with multiple mediators.
Epidemiology 2016; (1), 1–8

## 26 October 2017: Dr. Shaun Seaman (University of Cambridge)

### Relative Efficiency of Joint-Model and Full-Conditional-Specification Multiple Imputation when Conditional Models are Compatible

Fitting a regression model of interest is often complicated by missing data on the variables in that model. Multiple imputation (MI) is commonly used to handle these missing data. Two popular methods of MI are joint model MI and full-conditional-specification (FCS) MI. These are known to yield imputed data with the same asymptotic distribution when the conditional models of FCS are compatible with the joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model MI and FCS MI will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by FCS MI are linear, logistic and multinomial regressions, these are compatible with a restricted general location (RGL) joint model. We show that MI using the RGL joint model (RGL MI) can be substantially more asymptotically efficient than FCS MI, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, FCS MI is shown to be potentially much more robust than RGL MI to misspecification of the RGL model when there is substantial missingness in the outcome variable.

This is joint work with Rachael Hughes, University of Bristol.

## 09 November 2017: Dr. Francisco J. Rubio (London School of Hygiene & Tropical Medicine)

### TBC

## 16 November 2017: Dr. Rui Zhu (University of Kent)

### TBC

## 23 November 2017: Dr. Mark Brewer (Biomathematics & Statistics Scotland)

### TBC

## 30 November 2017: Dr. Sara Wade (University of Warwick)

### TBC

## 14 December 2017: Dr. Kylie-Anne Richards (QTR Capital Pty, Sydney)

### Modelling the limit order book using marked Hawkes self-exciting point processes

Increased activity and temporal clustering in the limit order book (LOB) can be characterized by an increase in intensity of events. Understanding and forecasting fluctuations in the intensity is informative to high frequency financial applications. The Hawkes self-exciting point process can be used to successfully model the dynamics of the intensity function by allowing for irregularly spaced time sequences, a multivariate framework, multiple dependent marks and the ability to capture the impact of marks on intensity. A critical first step to successfully apply these models to the LOB is suitably defining events in terms of the number of limit order book levels and types of orders. Based on extensive data analysis, recommendations are made. Likewise, selection of marks that impact the intensity function is challenging and the literature provides little guidance. Based on a review of the LOB literature potential marks are identified and screened using a novel mark detection method based on the likelihood score statistic. Comparisons of exponential and power-law decay functions are presented. Fitting marks with a likelihood based method presents substantial identifiability issues which are investigated via simulation for a variety of model formulations. Application is made to futures data with various underlying asset classes.