Colloquium Details

Workload-Aware Networks for Machine Learning

Speaker: Weiyang "Frank" Wang, MIT CSAIL

Location: 60 Fifth Avenue 150

Date: April 7, 2026, 2 p.m.

Host: Anirudh Sivaraman, Siddharth Garg

Synopsis:

Today's ML workloads require networks that connect tens to hundreds of thousands of GPUs. Existing GPU clusters rely on network designs offering any-to-any connectivity while remaining agnostic to the data they carry. These traits are carried over from legacy CPU datacenters, limiting scalability and hindering GPU utilization.

I will present workload-aware networking, a systematic approach that exploits structures inherent to machine learning traffic to co-design networks with ML workloads. I start by showing that large language model (LLM)’s network traffic exhibits a surprising property: it stays within the bottom layer of a switched network. This insight enables rail-only network designs that dramatically reduce cost and complexity. I then discuss TopoOpt, which uses reconfigurable networks to adapt to the repetitive, predictable traffic patterns of ML training, delivering performance improvements over today's network designs. Finally, I show that understanding traffic content in the network unlocks new functionalities. I introduce Checkmate, a system that embeds checkpointing into the network through gradient replication, enabling per-iteration checkpointing with zero GPU overhead. I conclude with future directions for extending these principles to emerging workloads like agentic AI, and building orchestration frameworks that automate network-workload co-design.
 

Speaker Bio:

Weiyang "Frank" Wang is a final-year Ph.D. candidate at MIT CSAIL, working with Professor Manya Ghobadi. His research spans computer networking, machine learning (ML) systems, and reconfigurable networks. Frank designs network systems that co-optimize with ML workloads, revealing and exploiting the structure of ML traffic to improve performance and reduce cost and complexity. Frank is the author of Rail-only networks and TopoOpt, which have helped shape how the industry builds large-scale ML training networks. His Rail-only design has been supported in production systems at Juniper and Broadcom, and serves as a reference for recent network designs at Alibaba, ByteDance, and Meta. 

Notes:

In-person attendance only available to those with active NYU ID cards.


How to Subscribe