UCL Institute of Health Informatics


Machine Learning: Unsupervised Learning

Unsupervised learning methods are essential tools for data scientists and statisticians alike. They are often applied as a pre-processing step for feature selection and dimensionality reduction in statistical learning tasks. Cluster analysis is the most popular example of stand-alone unsupervised learning methods, often seen as an exploratory and hypothesis-generating approach.

Despite its wide use in many fields, cluster analysis is challenging to design, implement and evaluate. The challenge stems from the exploratory nature of the clustering process, and the multiple ways the analysis outputs can be interpreted.

This course will outline the fundamentals of cluster analysis and dimensionality reduction with the aim to enable learners to confidently design and implement such methods independently on a variety of datasets.

Learning Objectives

By the end of this course, participants should be able to:

  • Understand the concept of representations in data
  • Decide when to use dimensionality reduction
  • Select an appropriate dimensionality reduction method
  • Select an appropriate dissimilarity metric
  • Outline the generalised clustering pipeline
  • Describe in detail the k-means algorithm
  • Understand the basic principles of hierarchical, density based, and gaussian mixture clustering
  • Evaluate clustering algorithm outputs
  • Outline the challenges and opportunities of applying cluster analysis for the discovery of disease subtypes.
TimeSession TitleLead Tutor

9:00 - 09:30

registration, coffee and welcome

Maria Pikoula
9:30 - 11:00
  • Group discussion: The relevance of unsupervised machine learning methods
  • Brief overview of supervised learning
  • Introduction to unsupervised learning concepts and methods
  • Representations of data
  • Dimensionality reduction
Maria Pikoula

11:00 - 11:15

Coffee break

11:15 - 12:45
  • The generalised clustering pipeline
  • variable selection and pre-processing
  • definition of distance metric
  • clustering algorithms
  • evaluation of output
  • Algorithm walk-through: k-means
Maria Pikoula and Lucy Pembrey

12:45 - 13:45



13:45 - 15:15

  • Clustering for disease phenotyping - real world examples
  • Tutorial: students implement k-means in R or python
Maria Pikoula and Nonie Alexander

15:15 - 15:30

Coffee break


15:30 - 17:00

Tutorial: evaluating clustering results

Maria Pikoula and Nonie Alexander


Dr Maria Pikoula

Maria is a data scientist by training, and currently work in the field of health informatics and electronic health record mining for research. Her work at the Institute for Health Informatics focuses on disease prediction and disease sub-typing via clustering methods. Maria teaches Python for data analysis and machine learning at the Msc Programme: Data science for research in health and biomedicine.