Statistical Science Seminars
Usual time: Thursdays 16:00 - 17:00
Location: Room 102, Department of Statistical Science, 1-19 Torrington Place (1st floor).
Some seminars are held in different locations at different times. Click on the abstract for more details.
Non-parametric information estimator and its application to information theoretic clustering
Shannon’s information measure and entropy are fundamental building blocks of many statistics such as Kullback-Leibler divergence and mutual information. One of the commonly used non-parametric estimators for these quantities is the k-nearest neighbor based ones. In this talk, we introduce a quantile formulation of the non-parametric information estimator and its extension to weighted observations. Based on the estimator, we develop a fixed-point algorithm for information theoretic clustering. The proposed clustering algorithm offers state-of-the-art results for many real-world problems.
Algebraic Combinatorial Solution Strategies for Matrix Completion
Matrix completion is the inverse problem of reconstructing missing entries in a potentially noisy low-rank matrix. We give a short introduction to the most common application scenarios (recommender systems, compressed sensing), and overview some existing solution strategies. We explain how matrix completion is algebraic - through the low-rank assumption - and combinatorial - through the pattern of observations. We show how the algebraic combinatorial properties of the true matrix and the observation pattern can be encoded in a bipartite graph whose combinatorics determine the solvability of the problem, and demonstrate how some previously known results, e.g., the sufficient and necessary O(n log n) bound for the random sampling density in order to achieve reconstruction, can be completely explained in terms of the combinatorial properties of the graph. Furthermore, we show how the knowledge on algebraic and combinatorial features of the problem can be used to obtain a noise-consistent estimator for the reconstruction of single missing entries, or the denoising of single observed entries, and a noise-consistent estimator for the reconstruction accuracy which is independent of the reconstruction method.
Insurance Claims Run-Off Modeling
We introduce the claims reserving problem. Claims reserving is one of the most important tasks in a general insurance company. It aims at predicting cash flows of the outstanding loss liabilities. These predictions are needed for pricing of insurance products, for accounting of insurance business and for risk management purposes. We discuss state-of-the-art claims reserving methods and issues that should be considered for improving the currently used techniques.
Feature selection via Joint Likelihood
The field of feature selection has many different competing algorithms, selection criteria and measure functions, with little theoretical justification for the choice of one measure over another. In this talk we focus specifically on feature selection algorithms which use information theoretic criteria and provide a solid theoretical justification for the use of such criteria. We begin by considering feature selection as a process which minimises a loss function, specifically the model likelihood. From this choice of loss function we show that the previous 20 years of research in information theoretic feature selection can be re-derived by making different factorisation assumptions on that likelihood. We show how these different assumptions influence the empirical performance, and how our likelihood approach naturally allows the incorporation of priors into feature selection.
Computational inference of the integrin adhesion complex structure from perturbation data
The integrin adhesion complex links the inside of the cell to the extracellular matrix by the combined action of >100 proteins. To disentangle the structure of this complex we analysed knock-down experiments on 10 key proteins. In the knock-down the abundance of a protein is reduced by 50%, and the experiments measure how this effects the recruitment of other proteins. We represent the single molecule-to-molecule interactions using the Bayesian Network framework, allowing for one protein to be represented by multiple nodes. We model the perturbations as ideal interventions on the unknown structure of the adhesion complex. As the posterior for this model is difficult to solve in closed form, we use a simple approximate model selection methodology we tested on realistic synthetic data.
Designs for generalised linear models with random block effects
For an experiment measuring independent discrete responses, a generalised linear model, such as the logistic or log-linear, is typically used to analyse the data. In blocked experiments, where observations from the same block are potentially dependent, it may be appropriate to include random effects in the predictor, thus producing a generalized linear mixed model. Selecting optimal designs for such models is complicated by the fact that the Fisher information matrix, on which most optimality criteria are based, is computationally expensive to evaluate. In addition, the dependence of the information matrix on the unknown values of the parameters must be overcome by, for example, use of a pseudo-Bayesian approach. We use a variety of closed-form approximations, some derived from marginal quasi-likelihood, to develop computationally inexpensive surrogates for the information matrix to obtain D-optimal designs. This approach reduces the computational burden substantially, enabling straightforward selection of multi-factor designs. The accuracy of the closed-form approximations is explored for the first time using a novel computational approximation. It is found that correcting for the marginal attenuation of parameters in binary-response models yields much improved designs.
Page last modified on 08 oct 12 11:11