WFQ: =============== * Published earlier than XCP. * But conceptually comes last in the progression of adding more and more intelligence to routers. * TCP Tahoe assumes routers do nothing special except that routers will eventually drop packets because they have to. * XCP assumes routers can compute rich feedback and run fairness and efficiency controllers, but still assumes FIFO routers. * WFQ assumes routers can actually decide the order in which packets are scheduled to enforce fairness on a packet-to-packet basis. * Roughly speaking, WFQ is a more sophisticated version of round robin with --> 1. better delay guarantees than having fixed quotas for each flow. --> 2. the ability to handle different packet sizes fairly. The tradeoff space: ============== * More intelligence in the routers means more obstacles to deployment (This is changing now.). * At the same time, more intelligence in the routers means we can expect more out of our network. --> WFQ can provide protection from arbitrary misbehaving sources, not just misbehaving TCP or XCP sources. --> Fairness is provided on a packet-by-packet basis (also called isolation). XCP/TCP provide it after convergence. * What is isolation?: Roughly speaking: if you stick to your allocated bandwidth, you will not be affected by anyone else. --> Figure 1 illustrates isolation very well. --> Aside: The math is complicated (the authors use the term "forbidding" :)), but really the implications are more interesting. --> What are the implications? --> Implication 1: As long as the Telnet source is sending under its fair share, it will get really low delay. --> Implication 2: If it exceeds its fair share, its delays will skyrocket. --> A much more formal version of this result called the Parekh-Gallager theorem appeared in "A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Single Node Case" and "A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Multiple Node Case" --> The Parekh-Gallager theorm says (very informally): "if everyone promises to stay under a particular transmission rate, and the network can support the sum of these transmission rates, and the routers run WFQ, then everyone's worst-case per-packet delay can be bounded." --> In general, WFQ incentivizes good congestion control because a sender's bad behavior can only affect that sender. This is the exact opposite of FCFS/FIFO. But, what's the catch with WFQ? ================ * Need to maintain separate queues for each flow. * This leads to the dreaded per-flow state (at the very least to track heads and tails for each flow's queues). We actually need a bit more state for each flow. * The point is it can get bad if you have many many flows. * But, is it really so bad? Depends. * In the core of an ISP, yes. There may be a few millions of flows. * In a datacenter, maybe no. There might be a few thousands of flows depending on the size of the datacenter. * And the number of active flows is even less. The WFQ algorithm itself (partly based on Section 3 of http://web.mit.edu/6.829/www/2016/papers/fq-notes.pdf) ================ * Based on Nagle's round-robin algorithm: service each flow for a time quantum and then move on to the next. * But round-robin has problems. --> 1. It can be unfair if one flow uses much larger packets than the other. --> 2. Also, if a packet arrives just after it's turn, it needs to wait a while for the time quanta of each of the other flows before it can be transmitted. * How do we fix this? What's the ideal behavior? * Idealized model: Bit-by-bit round robin, which is unattainable. Go one bit at a time from each flow. * How do we approximate bit-by-bit round robin? * Some definitions --> Round: one cycle through all queues with data (backlogged queues) where we send one bit per flow in a round. --> dr/dt = mu / N (i.e., more flows means rounds increase slowly with time.) --> Notice that N can keep varying as the number of active/backlogged queues changes. --> And this will change the rate of change of r. --> Implication: tracking r accurately is hard. We'll come back to this soon. * Let's say we have r at any point in time. Then, we have the following: * Start time (in units of rounds) of a packet in the bit-by-bit model is either: --> 1. Finish time (in rounds) of previous packet in the same flow (if queue was backlogged) --> 2. Current round number (if queue was not backlogged) --> Combine the two by taking the max of 1 and 2. * What is the finish time (in rounds) of this packet: --> Start time (in rounds) + length of packet. This is because we send one bit for each round. * Need to track finish time on a per-flow basis (in addition to head and tail pointers). * Packet-by-packet model makes two approximations relative to the bit-by-bit model: --> Send packet with smallest finish time to "catch up" with bit-by-bit model. --> Set r to finish time of current packet in service. --> Can prove that these don't introduce *too much* more delay in the worst-case relative to bit-by-bit. Again, look at the Parekh-Gallager paper if you're interested. Historical aside: ===================== * In 1995, an algorithm called Deficit Round Robin (a variant of Round Robin but with an important fix) was developed. * It didn't have WFQ's delay properties, but wasn't too bad delay-wise either. * But it was much simpler! Eventually found its way into all major routers. * This is what is on routers today, even though WFQ is better delay-wise and came earlier. * Meta point: Simplicity is important.