Colloquium Details
Workload-Aware Networks for Machine Learning
Speaker: Weiyang "Frank" Wang, MIT CSAIL
Location: 60 Fifth Avenue 150
Date: April 7, 2026, 2 p.m.
Host: Anirudh Sivaraman, Siddharth Garg
Synopsis:
Today's ML workloads require networks that connect tens to hundreds of thousands of GPUs. Existing GPU clusters rely on network designs offering any-to-any connectivity while remaining agnostic to the data they carry. These traits are carried over from legacy CPU datacenters, limiting scalability and hindering GPU utilization.
I will present workload-aware networking, a systematic approach that exploits structures inherent to machine learning traffic to co-design networks with ML workloads. I start by showing that large language model (LLM)’s network traffic exhibits a surprising property: it stays within the bottom layer of a switched network. This insight enables rail-only network designs that dramatically reduce cost and complexity. I then discuss TopoOpt, which uses reconfigurable networks to adapt to the repetitive, predictable traffic patterns of ML training, delivering performance improvements over today's network designs. Finally, I show that understanding traffic content in the network unlocks new functionalities. I introduce Checkmate, a system that embeds checkpointing into the network through gradient replication, enabling per-iteration checkpointing with zero GPU overhead. I conclude with future directions for extending these principles to emerging workloads like agentic AI, and building orchestration frameworks that automate network-workload co-design.
Speaker Bio:
Weiyang "Frank" Wang is a final-year Ph.D. candidate at MIT CSAIL, working with Professor Manya Ghobadi. His research spans computer networking, machine learning (ML) systems, and reconfigurable networks. Frank designs network systems that co-optimize with ML workloads, revealing and exploiting the structure of ML traffic to improve performance and reduce cost and complexity. Frank is the author of Rail-only networks and TopoOpt, which have helped shape how the industry builds large-scale ML training networks. His Rail-only design has been supported in production systems at Juniper and Broadcom, and serves as a reference for recent network designs at Alibaba, ByteDance, and Meta.
Notes:
In-person attendance only available to those with active NYU ID cards.