Optics for Distributed Learning

Split screen. graphic image of a Neuro processing device and fibre optics

1 October 2020

Developing large scale distributed learning computing systems, using new topologies, partitioning and scheduling strategies over optical networks.

Funder Microsoft
Amount £ 73 000

Research topics Optics for distributed learning | Reconfigurable topologies | Neural network computational graphs | Distributed learning partitioning algorithm | Gradient reduce strategies

Description

Data centres have been historically based on a server-centric approach with fixed amounts of processor and directly attached memory resources within the boundary of a mainboard tray. The mismatch between fixed proportionalities and diverse set of workloads can lead to substantially under-utilized resources (some cases even below 40%) that account for 85% of the total data centre cost.

The project aims to explore resource (xPU, memory, storage) disaggregation at Rack and Cluster level and identify the scalability limits both in terms of the number of end-points, network capacity per CPU/Memory, physical distance and the associated penalties to processing power, sustained memory bandwidth etc. In particular, the project will focus on optical network technologies, topologies, strategies and control to support large scale distributed learning, using heterogeneous resources (xPUs) in order to minimize training time of diverse distributed learning models.

Outputs

View Principal Investigator's Publications

Team

Lead Institution
UCL

UCL Principal Investigator (PI)
Georgios Zervas

UCL Members
Alessandro Ottino

Industry collaborators
Microsoft