Machine Learning: Unsupervised Learning

Unsupervised learning methods are essential tools for data scientists and statisticians alike. They are often applied as a pre-processing step for feature selection and dimensionality reduction in statistical learning tasks. Cluster analysis is the most popular example of stand-alone unsupervised learning methods, often seen as an exploratory and hypothesis-generating approach.

Despite its wide use in many fields, cluster analysis is challenging to design, implement and evaluate. The challenge stems from the exploratory nature of the clustering process, and the multiple ways the analysis outputs can be interpreted.

This course will outline the fundamentals of cluster analysis and dimensionality reduction with the aim to enable learners to confidently design and implement such methods independently on a variety of datasets.

Learning Objectives

By the end of this course, participants should be able to:

Understand the concept of representations in data
Decide when to use dimensionality reduction
Select an appropriate dimensionality reduction method
Select an appropriate dissimilarity metric
Outline the generalised clustering pipeline
Describe in detail the k-means algorithm
Understand the basic principles of hierarchical, density based, and gaussian mixture clustering
Evaluate clustering algorithm outputs
Outline the challenges and opportunities of applying cluster analysis for the discovery of disease subtypes.

Time	Session Title	Lead Tutor
9:00 - 09:30	registration, coffee and welcome	Maria Pikoula
9:30 - 11:00	Group discussion: The relevance of unsupervised machine learning methods Brief overview of supervised learning Introduction to unsupervised learning concepts and methods Representations of data Dimensionality reduction	Maria Pikoula
11:00 - 11:15	Coffee break
11:15 - 12:45	The generalised clustering pipeline variable selection and pre-processing definition of distance metric clustering algorithms evaluation of output Algorithm walk-through: k-means	Maria Pikoula and Lucy Pembrey
12:45 - 13:45	Lunch
13:45 - 15:15	Clustering for disease phenotyping - real world examples Tutorial: students implement k-means in R or python	Maria Pikoula and Nonie Alexander
15:15 - 15:30	Coffee break
15:30 - 17:00	Tutorial: evaluating clustering results	Maria Pikoula and Nonie Alexander

Dr Maria Pikoula: Maria is a data scientist by training, and currently work in the field of health informatics and electronic health record mining for research. Her work at the Institute for Health Informatics focuses on disease prediction and disease sub-typing via clustering methods. Maria teaches Python for data analysis and machine learning at the Msc Programme: Data science for research in health and biomedicine.

Machine Learning: Unsupervised Learning

Learning Objectives

Key Details