Context for data centers:
===============

* Large building-sized networks: Can be as large as a few football fields stacked side by side.
* Data centers power most web services today: Google, Facebook, Microsoft, Amazon, Baidu, Alibaba, ...
* O(100000) servers, O(10000) switches, O(100) such data centers throughout the world.
* Goal is to maximize cost-performance, i.e., ratio of performance to cost of hardware.   
* Like supercomputing facilities in some ways, but built out of commodity, off-the-shelf parts.
* Commodity parts maximize cost-performance, which is critical to data centers, unlike supercomputers that maximize performance
* Commodity parts are used both for servers and switches.
* Commodity switching chips are called merchant silicon switching chips.
* Merchant silicon switching chips tend to be cheap and shallow buffered (like the DCTCP paper keeps mentioning).

Context for the VL2 paper:
===============
* Prevailing wisdom then (pre 2009): Commodity parts were already being used for servers.
* Networking hadn't yet made the switch.
* Still used scale-up networking (Figure 1 of the paper).
* Led to several problems:
  --> Oversubscription up the hierarchy (limited server-to-server capacity)
  --> Poor reliability (1:1 redundancy)
  --> Fragmentation of resources
* VL2 context: Microsoft observed these problems first hand in their own network
* Problems (oversubscription, reliability, fragmentation, etc.) had gotten to a point where they needed to be fixed soon
* In other words, things were broken.
* In some ways, ripe opportunity for conducting research that could immediately impact the practice.
* On the other hand, industrial setting meant that the solution had to be readily deployable.
  --> Would prefer not to invest in new switch software or hardware.
  --> This was one difference from Fat-tree (reference 1) from UCSD, which embodied a more clean-slate approach.

VL2 requirements:
===============
* Uniform high capacity from any server to any server, regardless of where the server was.
* Performance isolation of one service from another.
* Layer-2 semantics:
  --> Should look like one large layer-2 switch to a service running on top.
  --> Crucial to enable cloud customers to move their services from on-premise deployments to the cloud.
  --> Crucial to allow features like link-layer broadcast etc to work.
* Agility: Reassign service VMs to any server based on demand for VMs and to handle faulty servers
* Better fault tolerance
* Work with existing unmodified switch hardware and software. Favor modifications to end hosts.
* Aside: This preference for end-host-only architectures has been a central part of Microsoft's cloud philosophy since then.
* Google, on the other hand, has explored architectures that do change the network (e.g., in B4, they built their own switch
  boxes, with Espresso+Jupiter they wrote their own control platforms, etc.)

VL2 measurement study:
===============
* Demonstrates that there is considerable volatility in network traffic matrices (TM)
* Traffic engineering is hard to do when the TM changes so quickly (need to rerun TE every second).
* It's also hard to predict traffic pattern from one interval of time to the next.
* Instead, their takeaway was to spread out load uniformly at all times over all links.
* "Convert all cases into the average case" 
* Failures happen from time to time at scale: move from 1:1 redundancy to 1 in n redundancy.
* Good example of measurement-guided system design

VL2 main ideas:
==============
* Valiant Load Balancing to cope with traffic volatility
* Reusing existing switch equipment.
* Separate application addresses (servers) from location addresses (switches).
* Embracing end systems: can perform fine-grained ACLs like not allowing a server to communicate with another.

Scale up vs. scale out:
==============
* Better redundancy by using a "fat" layer of low port count switches (1 in n vs. 1:1).
* Can build out of cheaper components because low port count switches are readily available.
* In fact, can use this idea recursively to build the large port count switch
  from smaller port count switches (see related work in VL2 paper). 

Addressing details:
==============
* Use AAs for servers so that they can be stably retained after a VM migration.
* The paper isn't clear about whether AAs refer to VMs on a server or the servers themselves. My guess is it's the VMs.
* Packet forwarding: Encapsulate application's packet in an outer IP layer header with the source switch's IP.
* Encapsulation is performed by the server's VL2 agent. Decapsulation at the receiver end.
* Like a NAT except no state is maintained. Similar approaches have been proposed to handle roaming.
* Address resolution: Use a directory service to map AAs to LAs.
* Can use directory service for access control as well: If one AA should not talk to another AA.
* Benefit of this: switch routing tables only store LA routes

Randomized load balancing:
=============
* VLB to load balance across ECMP groups.
* ECMP to load balance across all intermediate switches within an ECMP group (16 or 256).
* Why is VLB alone not enough?
* Lots of work on better load balancing algorithms since then (CONGA, PRESTO, HULA, LetFlow, etc.)

Directory system:
=============
* Optimize for reads
* Eventually consistent. Might be stale for small periods of time.

Evaluation:
=============
* Uniform high capacity for all-to-all shuffle: found in MapReduce jobs
* Performance isolation (although more work has been done on this since): One flow starting up doesn't affect another.
* Directory system can handle sufficient churn in performance (although this will probably be the bottleneck eventually).
* Paper doesn't say where exactly the directory system is located.

After the VL2 paper was published:
============
* Most data centers now use a VL2-style topology (e.g., Facebook's data center network and Google's network called Jupiter).
* They all employ some form of load balancing typically like ECMP.

DCTCP: Context
=================
* Published in 2010
* Context: TCP in widespread use within large private networks (datacenters)
* But TCP has some serious problems
  --> Only way it backs off is in response to packet loss.
  --> If you have a large buffer, it will continue to build it up.
* Several adverse affects (Section 2.3)
  --> If a large flow co-exists with a small flow, the small flow's latency goes up
  --> If many small flows converge on the switch at the same time (incast),
      then several small flows see drops, leading to timeouts, retransmissions, and
      increased latency
  --> Collateral damage: A large flow on one port can reduce buffer available on another.

The DCTCP algorithm
=================
* Keep queue sizes much lower than TCP.
* Send feedback much before TCP would.
* Once queue exceeds a threshold K, mark packets using explicit congestion notification bit (ECN).
  --> ECN dates back to DECbit from the late 1980s.
  --> But was only standardized in the late 1990s.
  --> Finally implemented in 2010 in DCTCP.
* At the end host, use a sequence of packet marks to determine the extent of congestion
  --> Using the fraction of marked packets.
  --> Cute trick to extract multi-bit information (extent of congestion) from single-bit ECN marks (presence of congestion).
* End result:
  --> Keep queues much lower (Figure 13).
  --> Less oscillation around queue size (Figure 15 b, 16 b).
  --> Lower query completion time (Figures 18 and 19).
* Algorithm was quite influential:
  --> Published as an IETF RFC (RFC 8257).
  --> DCTCP or some variant of it is implemented in Windows Server, Microsoft datacenters, Google, Morgan Stanley.