Statistical Science Seminars
Usual time: Thursdays 16:00 - 17:00
Location: Room 102, Department of Statistical Science, 1-19 Torrington Place (1st floor).
Some seminars are held in different locations at different times. Click on the abstract for more details.
Beyond Beta regression: modelling bounded-domain variables in the presence of boundary observations
Beta regression is a useful tool for modelling bounded-domain continuous response variables, such as proportions, rates fractions and concentration indices. One important limitation of beta regression models is that they do not apply when at least one of the observed responses is on the boundary --- in such scenarios the likelihood function is simply 0 regardless of the value of the parameters. The relevant approaches in the literature focus on either the transformation of the observations by small constants so that the transformed responses end up in the support of the beta distribution, or the use of a discrete-continuous mixture of a beta distribution and point masses at either or both of the boundaries. The former approach suffers from the arbitrariness of choosing the additive adjustment. The latter approach gives a "special" interpretation to the boundary observations relative to the non-boundary ones, and requires the specification of an appropriate regression structure for the hurdle part of the overall model, generally leading to complicated models. In this talk we rethink of the problem and present an alternative model class that leverages the flexibility of the beta distribution, can naturally accommodate boundary observations and preserves the parsimony of beta regression, which is a limiting case. Likelihood-based learning and inferential procedures for the new model are presented, and its usefulness is illustrated by applications.
Geometric Representations of Random Hypergraphs
Joint with Edoardo M. Airoldi, Sayan Mukherjee and Robert L. Wolpert
We introduce a novel parametrization of distributions on hypergraphs based on the geometry of points in Rd. The idea is to induce distributions on hypergraphs by placing priors on point configurations via spatial processes. This prior specification is then used to infer conditional independence models or Markov structure for multivariate distributions. This approach supports inference of factorizations that cannot be retrieved by a graph alone, leads to new Metropolis-Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space, and generally offers greater control on the distribution of graph features than currently possible. We provide a comparative performance evaluation against state-of-the-art, and we illustrate the utility of this approach on simulated and real data.
Bayesian inference on high-dimensional Seemingly Unrelated Regressions
We present a Bayesian Seemingly Unrelated Regressions (SUR) model for associating metabolomics outcomes with genetic variants, allowing for both sparse variable selection and correlation between the outcomes. Previously people have made use of either the assumption of independence between the outcomes (Bottolo et al. 2011, Lewin et al. 2015) or selected predictors jointly for all the outcomes (Bhadra and Mallik 2013, Bottolo et al. 2013).
In order to overcome some of the computational difficulty with the general SUR model, Zellner and Ando (2010) proposed a reparametrisation of the model in which the likelihood factorises completely into a product of conditional distributions, and used a Direct Monte Carlo (DMC) approach to estimate the posterior. We extend their work by allowing for a more general prior distribution, and we show that it is possible to build a Gibbs-DMC sampler without the need for re-sampling. The proposed method is applied to both simulated data, to illustrate the computational gains, and real metabolomics analysis where the dimension of the data precludes the use of the traditional sampler.
A Quasi-Bayesian Perspective to NMF: theory and applications
Quasi-Bayesian estimators are increasingly popular in statistics and machine learning, due to their generalization properties and flexibility. In recent work (Alquier & Guedj 2017, Mathematical Methods of Statistics), we have proposed a quasi-Bayesian estimator for non-negative matrix factorization. I will present a quick overview of quasi- and PAC-Bayesian frameworks and discuss our theoretical and algorithmic contributions.
Robust ranking, constrained ranking and rank aggregation via eigenvector and semidefinite programming synchronization
consider the classic problem of establishing a statistical ranking of a set of
n items given a set of inconsistent and incomplete pairwise comparisons between
such items. Instantiations of this problem occur in numerous applications in
data analysis (e.g., ranking teams in sports data), computer vision, and
machine learning. We formulate the above problem of ranking with incomplete
noisy information as an instance of the group synchronization problem over the
group SO(2) of planar rotations, whose usefulness has been demonstrated in
numerous applications in recent years. Its least squares solution can be
approximated by either a spectral or a semidefinite programming (SDP)
relaxation, followed by a rounding procedure. We perform extensive numerical
simulations on both synthetic and real-world data sets (Premier League soccer
games, a Halo 2 game tournament and NCAA College Basketball games) showing that
our proposed method compares favourably to other algorithms from the recent
We propose a similar synchronization-based algorithm for the rank-aggregation problem, which integrates in a globally consistent ranking pairwise comparisons given by different rating systems on the same set of items. We also discuss the problem of semi-supervised ranking when there is available information on the ground truth rank of a subset of players, and propose an algorithm based on SDP which recovers the ranks of the remaining players. Finally, synchronization-based ranking, combined with a spectral technique for the densest subgraph problem, allows one to extract locally-consistent partial rankings, in other words, to identify the rank of a small subset of players whose pairwise comparisons are less noisy than the rest of the data, which other methods are not able to identify.
Density functional theory and optimal transport with Coulomb cost
Multi-marginal optimal transport with Coulomb cost arises as a dilute limit of density functional theory, which is a widely used electronic structure model. The number N of marginals corresponds to the number of particles. After a non-rigorous introduction to quantum mechanics and DFT, I will discuss the question whether ''Kantorovich minimizers'' must be ''Monge minimizers'' (yes for N=2, open for N>2, no for N=infinity), and derive the surprising phenomenon that the extreme correlations of the minimizers turn into independence in the large N limit.
The talk is based on joint works with Gero Friesecke (TUM), Claudia Klueppelberg (TUM), Brendan Pass (Alberta) which appeared in CPAM (2013) and Calc.Var.PDE (2014).
A dynamic network model for single-cell genomic data
Network models have become an important topic in modern statistics. The
evolution of network structure over time is an important new area of study,
relevant to a range of applications. An important application area of
statistical network models is in genomics: network models are a natural way to
describe and analyse patterns of interactions (represented by network edges)
between genes and their products (represented by network nodes). However,
whilst network models are well established in genomics, historically these
models have almost always been static network models, ignoring the fact that
genomic processes are inherently dynamic.
In this joint work with Ricardo Silva and Ioannis Kosmidis, a model is proposed to infer dynamic genomic network structure based on single-cell measurements of gene expression counts. This model draws on ideas from the Bayesian lasso and from copula models, and is implemented efficiently by combining Gibbs- and slice-sampling techniques. We apply our new model to data from neural development, and in doing so are able to infer changes in network structure which we would expect to see, based on current biological knowledge. For this talk, no knowledge of biology, bioinformatics or biological data analysis will be assumed.
Using Ecological Propensity Score to Adjust for Missing Confounders in Small Area Studies
Small area ecological studies are commonly used in epidemiology to assess the impact of area level risk factors on health outcomes when data are only available in an aggregated form. However the resulting estimates are often biased due to unmeasured confounders, which typically are not available from the standard administrative registries used for these studies. Extra information on confounders can be provided through external datasets such as surveys or cohorts, where the data are available at the individual level rather than at the area level; however such data typically lack the geographical coverage of administrative registries. We develop a framework of analysis which combines ecological and individual level data from different sources to provide an adjusted estimate of area level risk factors which is less biased. Our method (i) summarises all available individual level confounders into an area level scalar variable, which we call ecological propensity score (EPS), (ii) implement a hierarchical structured approach to predict the values of EPS whenever they are missing (iii) include the estimated and predicted EPS into the ecological regression linking the risk factors to the health outcome. We show that integrating individual level data into small area analyses via EPS is a promising method to reduce the bias intrinsic in ecological studies due to unmeasured confounders through a simulation study; we also apply the method to a real case study to evaluate the effect of air pollution on coronary heart disease hospital admissions in London.
Scaling limit of the odometer in the divisible sandpile
Joint with Rajat Subhra Hazra and Wioletta Ruszel
The divisible sandpile model, a continuous version of the abelian sandpile model, was introduced by Levine and Peres to study scaling limits of the rotor aggregation and internal DLA growth models. The dynamics of the sandpile runs as follows: to each site of a graph there is associated a height or mass. If the height exceeds a certain value then the site collapses by distributing the excessive mass uniformly to its neighbours. In a recent work Levine et al. addressed two questions regarding these models: the dichotomy between stabilizing and exploding configurations, and the behaviour of the odometer (a function measuring the amount of mass emitted during stabilization). In this talk we will investigate further the odometer function by showing that, under appropriate rescaling, it converges to the continuum bi-Laplacian field or to an alpha-stable generalised field when the underlying graph is a discrete torus. Moreover we present some results about stabilization versus explosion for heavy-tailed initial distributions.
Analytic representations of large graphs
The theory of graph limits aims at providing tools to analyze and model large networks/graphs. Such analytic models have found applications in various areas of computer science and mathematics. We survey basic results on analytic models of large dense graphs and focus in more detail on the asymptotic structure of dense graphs uniquely determined by finitely many density constraints. The talk will be self-contained and no previous knowledge of the area is needed.