Information Studies


INSTG083/INSTM083 - Foundations of Machine Learning and Data Science

Aims: The module is intended as an introduction to machine learning and data science. The course focuses on the principles underlying probabilistic and statistical approaches, introducing a small number of explicit methodologies as exemplars, and looking at how these should be applied and evaluated by the careful practitioner.

Intended Learning Outcomes: By the end of the course, students will have a basic understanding of supervised learning (regression and classi_cation) and unsupervised learning (clustering and dimensionality reduction). They will be able to: apply methodologies in each of these problem domains; to assess the suitability of approaches to a constrained set of tasks; and employ common techniques to evaluate a methodology's performance.

Content: Students will study a selection of machine learning techniques, including theoretical underpinnings and po-tential applications. In the process, students will gain some mathematical insight about these (and related) approaches to probabilistic & statistical models, their strengths and weaknesses, and how to e_ectively evaluate their performance on data. A signi_cant proportion of the course involves hands on experience of using and evaluating these techniques on real world data. A brief outline of the topics covered is as follows:

  • Probabilistic & statistical foundations, such as: marginal, joint & conditional probabilities; Bayes theory; probability densities; maximum likelihood; expectation and variance.
  • Regression methods, such as: least squares regression, regularization, and basis functions.
  • Classication methods, such as: k-nearest neighbours & logistic regression.
  • Clustering and dimensionality reduction such as: k-means, mixture of gaussians, and principle component analysis (PCA).
  • Bayesian graphical models
  • Evaluation techniques and concepts, such as: loss-functions; model selection and hypothesis testing.

Delivery: Course will be delivered through lectures, tutorials, seminars and computing laboratory work. Where possible, there will be learning through practical work (e.g. programming), with an exposure to real world data. Potential tasks/data sets for exploring methods include: botanical sample descriptions for species classification; census data for predicting income levels; chemical constituents of food and drink; feature descriptions of images for clinical diagnosis; human activity recognition from smartphone data; and preprocessed text documents for clustering. As currently planned [June 6, 2017] the tutorial and lab work will involve programming in python. However, the decision on this is yet to be made. A short primer will be available for students unfamiliar with the chosen programming language.

Prerequisites: There are no formal prerequisites required for this module in terms of other modules taken at UCL. However, the module assumes mathematical knowledge roughly equivalent to an A-level in the subject, including: some basic calculus (differentiation and integration); statistics; linear algebra and related concepts.