OptoCloud: Ultra-fast optically interconnected heterogeneous Data Centers
1 March 2021
Designing heterogeneous, disaggregated and sustainable Data Center and HPC systems, enabled by energy efficient, scalable, single hop, and nanosecond speed optical circuit switched network technologies.
Amount £ 1 120 128
Project Website EPSRC
Research topics Heterogeneous and disaggregated data center and high performance computing architectures | Compute and optical network co- design/optimization/control | Energy efficient and nanosecond speed optical circuit switched interconnects | Ultra-fast optical switching and hardware (FPGA/ASIC) scheduling | Optics for machine learning and machine learning for optics
The majority of human activities, including transport, Internet, banking, public health and entertainment, depend on Data Centers. Cloud traffic is forecasted to grow exponentially and accounts for 95% of global traffic. Data centers currently consume 2% of the World’s electricity and it is estimated to reach up to 15% by 2030.
Currently, all data center networks are formed based on hierarchical electronic packet-switched networks; however, they can't keep up with demand creating an ever-increasing gap between data growth and Moore's Law. So, while computing node power, measured in flop/s, has increased by 65 times in the last 18 years, the node communication bandwidth has only increased by 4.8 times and the bytes communicated per flop have decreased 8 times.This creates a computation to communication wall, minimizing data movement and constraining applications to operate locally. In addition, these systems also suffer from very high median latencies, (100microseconds) (order of 100microseconds), and 99.9-percentile tail latencies, (100ms), to the detriment of the system and application performance.
Optical interconnects have the potential to offer orders-of-magnitude better network performance and energy efficiency, yet research to date has focused on replicating the packet switching principles that suffer from complexity. New architectures are needed not just to replace electronics with optics but substantially outperform them while being insensitive to workloads. Key challenges include sub-microsecond control, sub-nanosecond switching, ultra-low power transceivers and nanosecond-speed topology reconfiguration to support diverse communication primitives/requirements and workloads.
The OptoCloud fellowship aims to design and build an energy-efficient, cost-effective, scalable, single hop, and nanosecond speed optical circuit-switched network. This will interconnect heterogeneous systems made of servers, CPUs, accelerators, neuromorphic processors, memory elements, storage to support different parts (rack, end-of-row) and sizes of data centers (small-medium size ~10-100,000 to ~1,000,000 server farm). Crucially, the network aims to offer zero data loss, without in-network:
b) active switching and routing
c) network header addressing and processing to minimize complexity, and to consume very low power.
Furthermore, the system also will inherently support 1-to-1, 1-to-N, N-to-N and N-to-1 connectivity in a synchronous manner without the need for data replication for multi/broad-casting, currently not possible.This is key to support diverse workloads such as storage caching, large-scale database lookups, training distributed deep neural networks, parallel computing that use communication primitives such as allreduce, broadcast and reduce, gather and scatter, all-to-all among others.
To achieve these, OptoCloud will explore the fundamental challenges of sub-nanosecond optical switching, near receiver-less low-power transceivers and nanosecond scheduling able to reconfigure circuits and shape IT and network topologies every 10s-100s of nanoseconds. It aims to offer orders of magnitude improvement in a) switching, b) scheduling and network topology re-configuration, c) power consumption, d) medium and tail latency and finally e) throughput with zero data loss.
The PI will work with the PDRAs, PhD students, industrial partners (Microsoft, Finisar, Xilinx, Sumitomo Electric), as well as universities (Columbia and National Technical University of Athens) and form a unique compute and optical network ecosystem to methodologically answer fundamental questions while reflecting all necessary requirements on the proposed concepts, and rigorously evaluating developed technologies using industrial driven use case scenarios.