Context for data centers: =============== * Large building-sized networks: Can be as large as a few football fields stacked side by side. * Data centers power most web services today: Google, Facebook, Microsoft, Amazon, Baidu, Alibaba, ... * O(100000) servers, O(10000) switches, O(100) such data centers throughout the world. * Goal is to maximize cost-performance, i.e., ratio of performance to cost of hardware. * Like supercomputing facilities in some ways, but built out of commodity, off-the-shelf parts. * Commodity parts maximize cost-performance, which is critical to data centers, unlike supercomputers that maximize performance * Commodity parts are used both for servers and switches. * Commodity switching chips are called merchant silicon switching chips. * Merchant silicon switching chips tend to be cheap and shallow buffered (like the DCTCP paper keeps mentioning). Context for the VL2 paper: =============== * Prevailing wisdom then (pre 2009): Commodity parts were already being used for servers. * Networking hadn't yet made the switch. * Still used scale-up networking (Figure 1 of the paper). * Led to several problems: --> Oversubscription up the hierarchy (limited server-to-server capacity) --> Poor reliability (1:1 redundancy) --> Fragmentation of resources * VL2 context: Microsoft observed these problems first hand in their own network * Problems (oversubscription, reliability, fragmentation, etc.) had gotten to a point where they needed to be fixed soon * In other words, things were broken. * In some ways, ripe opportunity for conducting research that could immediately impact the practice. * On the other hand, industrial setting meant that the solution had to be readily deployable. --> Would prefer not to invest in new switch software or hardware. --> This was one difference from Fat-tree (reference 1) from UCSD, which embodied a more clean-slate approach. VL2 requirements: =============== * Uniform high capacity from any server to any server, regardless of where the server was. * Performance isolation of one service from another. * Layer-2 semantics: --> Should look like one large layer-2 switch to a service running on top. --> Crucial to enable cloud customers to move their services from on-premise deployments to the cloud. --> Crucial to allow features like link-layer broadcast etc to work. * Agility: Reassign service VMs to any server based on demand for VMs and to handle faulty servers * Better fault tolerance * Work with existing unmodified switch hardware and software. Favor modifications to end hosts. * Aside: This preference for end-host-only architectures has been a central part of Microsoft's cloud philosophy since then. * Google, on the other hand, has explored architectures that do change the network (e.g., in B4, they built their own switch boxes, with Espresso+Jupiter they wrote their own control platforms, etc.) VL2 measurement study: =============== * Demonstrates that there is considerable volatility in network traffic matrices (TM) * Traffic engineering is hard to do when the TM changes so quickly (need to rerun TE every second). * It's also hard to predict traffic pattern from one interval of time to the next. * Instead, their takeaway was to spread out load uniformly at all times over all links. * "Convert all cases into the average case" * Failures happen from time to time at scale: move from 1:1 redundancy to 1 in n redundancy. * Good example of measurement-guided system design VL2 main ideas: ============== * Valiant Load Balancing to cope with traffic volatility * Reusing existing switch equipment. * Separate application addresses (servers) from location addresses (switches). * Embracing end systems: can perform fine-grained ACLs like not allowing a server to communicate with another. Scale up vs. scale out: ============== * Better redundancy by using a "fat" layer of low port count switches (1 in n vs. 1:1). * Can build out of cheaper components because low port count switches are readily available. * In fact, can use this idea recursively to build the large port count switch from smaller port count switches (see related work in VL2 paper). Addressing details: ============== * Use AAs for servers so that they can be stably retained after a VM migration. * The paper isn't clear about whether AAs refer to VMs on a server or the servers themselves. My guess is it's the VMs. * Packet forwarding: Encapsulate application's packet in an outer IP layer header with the source switch's IP. * Encapsulation is performed by the server's VL2 agent. Decapsulation at the receiver end. * Like a NAT except no state is maintained. Similar approaches have been proposed to handle roaming. * Address resolution: Use a directory service to map AAs to LAs. * Can use directory service for access control as well: If one AA should not talk to another AA. * Benefit of this: switch routing tables only store LA routes Randomized load balancing: ============= * VLB to load balance across ECMP groups. * ECMP to load balance across all intermediate switches within an ECMP group (16 or 256). * Why is VLB alone not enough? * Lots of work on better load balancing algorithms since then (CONGA, PRESTO, HULA, LetFlow, etc.) Directory system: ============= * Optimize for reads * Eventually consistent. Might be stale for small periods of time. Evaluation: ============= * Uniform high capacity for all-to-all shuffle: found in MapReduce jobs * Performance isolation (although more work has been done on this since): One flow starting up doesn't affect another. * Directory system can handle sufficient churn in performance (although this will probably be the bottleneck eventually). * Paper doesn't say where exactly the directory system is located. After the VL2 paper was published: ============ * Most data centers now use a VL2-style topology (e.g., Facebook's data center network and Google's network called Jupiter). * They all employ some form of load balancing typically like ECMP. DCTCP: Context ================= * Published in 2010 * Context: TCP in widespread use within large private networks (datacenters) * But TCP has some serious problems --> Only way it backs off is in response to packet loss. --> If you have a large buffer, it will continue to build it up. * Several adverse affects (Section 2.3) --> If a large flow co-exists with a small flow, the small flow's latency goes up --> If many small flows converge on the switch at the same time (incast), then several small flows see drops, leading to timeouts, retransmissions, and increased latency --> Collateral damage: A large flow on one port can reduce buffer available on another. The DCTCP algorithm ================= * Keep queue sizes much lower than TCP. * Send feedback much before TCP would. * Once queue exceeds a threshold K, mark packets using explicit congestion notification bit (ECN). --> ECN dates back to DECbit from the late 1980s. --> But was only standardized in the late 1990s. --> Finally implemented in 2010 in DCTCP. * At the end host, use a sequence of packet marks to determine the extent of congestion --> Using the fraction of marked packets. --> Cute trick to extract multi-bit information (extent of congestion) from single-bit ECN marks (presence of congestion). * End result: --> Keep queues much lower (Figure 13). --> Less oscillation around queue size (Figure 15 b, 16 b). --> Lower query completion time (Figures 18 and 19). * Algorithm was quite influential: --> Published as an IETF RFC (RFC 8257). --> DCTCP or some variant of it is implemented in Windows Server, Microsoft datacenters, Google, Morgan Stanley.