Statistical Science Seminars
Usual time: Thursdays 16:00 - 17:00
Location: Room 102, Department of Statistical Science, 1-19 Torrington Place (1st floor).
Some seminars are held in different locations at different times. Click on the abstract for more details.
Beyond Beta regression: modelling bounded-domain variables in the presence of boundary observations
Beta regression is a useful tool for modelling bounded-domain continuous response variables, such as proportions, rates fractions and concentration indices. One important limitation of beta regression models is that they do not apply when at least one of the observed responses is on the boundary --- in such scenarios the likelihood function is simply 0 regardless of the value of the parameters. The relevant approaches in the literature focus on either the transformation of the observations by small constants so that the transformed responses end up in the support of the beta distribution, or the use of a discrete-continuous mixture of a beta distribution and point masses at either or both of the boundaries. The former approach suffers from the arbitrariness of choosing the additive adjustment. The latter approach gives a "special" interpretation to the boundary observations relative to the non-boundary ones, and requires the specification of an appropriate regression structure for the hurdle part of the overall model, generally leading to complicated models. In this talk we rethink of the problem and present an alternative model class that leverages the flexibility of the beta distribution, can naturally accommodate boundary observations and preserves the parsimony of beta regression, which is a limiting case. Likelihood-based learning and inferential procedures for the new model are presented, and its usefulness is illustrated by applications.
Geometric Representations of Random Hypergraphs
Joint with Edoardo M. Airoldi, Sayan Mukherjee and Robert L. Wolpert
We introduce a novel parametrization of distributions on hypergraphs based on the geometry of points in Rd. The idea is to induce distributions on hypergraphs by placing priors on point configurations via spatial processes. This prior specification is then used to infer conditional independence models or Markov structure for multivariate distributions. This approach supports inference of factorizations that cannot be retrieved by a graph alone, leads to new Metropolis-Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space, and generally offers greater control on the distribution of graph features than currently possible. We provide a comparative performance evaluation against state-of-the-art, and we illustrate the utility of this approach on simulated and real data.
Bayesian inference on high-dimensional Seemingly Unrelated Regressions
We present a Bayesian Seemingly Unrelated Regressions (SUR) model for associating metabolomics outcomes with genetic variants, allowing for both sparse variable selection and correlation between the outcomes. Previously people have made use of either the assumption of independence between the outcomes (Bottolo et al. 2011, Lewin et al. 2015) or selected predictors jointly for all the outcomes (Bhadra and Mallik 2013, Bottolo et al. 2013).
In order to overcome some of the computational difficulty with the general SUR model, Zellner and Ando (2010) proposed a reparametrisation of the model in which the likelihood factorises completely into a product of conditional distributions, and used a Direct Monte Carlo (DMC) approach to estimate the posterior. We extend their work by allowing for a more general prior distribution, and we show that it is possible to build a Gibbs-DMC sampler without the need for re-sampling. The proposed method is applied to both simulated data, to illustrate the computational gains, and real metabolomics analysis where the dimension of the data precludes the use of the traditional sampler.
A Quasi-Bayesian Perspective to NMF: theory and applications
Quasi-Bayesian estimators are increasingly popular in statistics and machine learning, due to their generalization properties and flexibility. In recent work (Alquier & Guedj 2017, Mathematical Methods of Statistics), we have proposed a quasi-Bayesian estimator for non-negative matrix factorization. I will present a quick overview of quasi- and PAC-Bayesian frameworks and discuss our theoretical and algorithmic contributions.
Robust ranking, constrained ranking and rank aggregation via eigenvector and semidefinite programming synchronization
consider the classic problem of establishing a statistical ranking of a set of
n items given a set of inconsistent and incomplete pairwise comparisons between
such items. Instantiations of this problem occur in numerous applications in
data analysis (e.g., ranking teams in sports data), computer vision, and
machine learning. We formulate the above problem of ranking with incomplete
noisy information as an instance of the group synchronization problem over the
group SO(2) of planar rotations, whose usefulness has been demonstrated in
numerous applications in recent years. Its least squares solution can be
approximated by either a spectral or a semidefinite programming (SDP)
relaxation, followed by a rounding procedure. We perform extensive numerical
simulations on both synthetic and real-world data sets (Premier League soccer
games, a Halo 2 game tournament and NCAA College Basketball games) showing that
our proposed method compares favourably to other algorithms from the recent
We propose a similar synchronization-based algorithm for the rank-aggregation problem, which integrates in a globally consistent ranking pairwise comparisons given by different rating systems on the same set of items. We also discuss the problem of semi-supervised ranking when there is available information on the ground truth rank of a subset of players, and propose an algorithm based on SDP which recovers the ranks of the remaining players. Finally, synchronization-based ranking, combined with a spectral technique for the densest subgraph problem, allows one to extract locally-consistent partial rankings, in other words, to identify the rank of a small subset of players whose pairwise comparisons are less noisy than the rest of the data, which other methods are not able to identify.
Density functional theory and optimal transport with Coulomb cost
Multi-marginal optimal transport with Coulomb cost arises as a dilute limit of density functional theory, which is a widely used electronic structure model. The number N of marginals corresponds to the number of particles. After a non-rigorous introduction to quantum mechanics and DFT, I will discuss the question whether ''Kantorovich minimizers'' must be ''Monge minimizers'' (yes for N=2, open for N>2, no for N=infinity), and derive the surprising phenomenon that the extreme correlations of the minimizers turn into independence in the large N limit.
The talk is based on joint works with Gero Friesecke (TUM), Claudia Klueppelberg (TUM), Brendan Pass (Alberta) which appeared in CPAM (2013) and Calc.Var.PDE (2014).
A dynamic network model for single-cell genomic data
Network models have become an important topic in modern statistics. The
evolution of network structure over time is an important new area of study,
relevant to a range of applications. An important application area of
statistical network models is in genomics: network models are a natural way to
describe and analyse patterns of interactions (represented by network edges)
between genes and their products (represented by network nodes). However,
whilst network models are well established in genomics, historically these
models have almost always been static network models, ignoring the fact that
genomic processes are inherently dynamic.
In this joint work with Ricardo Silva and Ioannis Kosmidis, a model is proposed to infer dynamic genomic network structure based on single-cell measurements of gene expression counts. This model draws on ideas from the Bayesian lasso and from copula models, and is implemented efficiently by combining Gibbs- and slice-sampling techniques. We apply our new model to data from neural development, and in doing so are able to infer changes in network structure which we would expect to see, based on current biological knowledge. For this talk, no knowledge of biology, bioinformatics or biological data analysis will be assumed.