Optics for Distributed Learning
1 October 2020
Developing large scale distributed learning computing systems, using new topologies, partitioning and scheduling strategies over optical networks.
Funder Microsoft
Amount £ 73 000
Research topics Optics for distributed learning | Reconfigurable topologies | Neural network computational graphs | Distributed learning partitioning algorithm | Gradient reduce strategies
Description
Data centres have been historically based on a server-centric approach with fixed amounts of processor and directly attached memory resources within the boundary of a mainboard tray. The mismatch between fixed proportionalities and diverse set of workloads can lead to substantially under-utilized resources (some cases even below 40%) that account for 85% of the total data centre cost.
The project aims to explore resource (xPU, memory, storage) disaggregation at Rack and Cluster level and identify the scalability limits both in terms of the number of end-points, network capacity per CPU/Memory, physical distance and the associated penalties to processing power, sustained memory bandwidth etc. In particular, the project will focus on optical network technologies, topologies, strategies and control to support large scale distributed learning, using heterogeneous resources (xPUs) in order to minimize training time of diverse distributed learning models.
Outputs
View Principal Investigator's Publications