RMT: Context
===============================
* Published in 2013 around the time that SDN was really taking off as an academic topic
* One way to look at it: how do we enable a more flexible version of OpenFlow?
* Another way to look at it: how do we make the data plane programmable as well?
* SDN up until that point was all about control plane programmability.
* But SDN was still quite limited by what was provided by the switches: OpenFlow was a pragmatic compromises
* OpenFlow turned out to be inadequate
  --> Led to two responses
  --> The edge/host approach to networking in the hypervisor or virtual switch
  --> The programmmable switch approach to networking by building more flexible switches at the hardware level
  --> We'll talk mostly about the second approach here.
  --> If you want to learn more about the first approach, read:
      http://yuba.stanford.edu/~casado/fabric.pdf and
      https://benpfaff.org/papers/net-virt.pdf

RMT: Main ideas
==============================
* The reconfigurable match-action table model: programmability without giving up performance
* Flexible packet parsing: can add new header fields and have the switch match on them (OpenFlow could not do this).
* Flexible matches on new header fields and new combinations of headers
* Flexible actions (RMT calls this packet editing): Modify packet headers in somewhat more arbitrary ways.
* Flexible table sizing: Only need to respect overall constraint on SRAM (exact match) and TCAM (ternary match) memory sizes.

But, RMT has some restrictions
==============================
* Can't run everything you can on a CPU
* Formally, it isn't Turing complete.
* What can't it do?
  --> Packet scheduling
  --> Payload manipulation
  --> Programmatic state manipulation
* RMT was the first formal description of a switch architecture providing programmability without giving up performance.
* Later work fixed some of the restrictions of RMT.
* But ultimately the architecture is always going to be restricted relative to a CPU
  --> This is the price for high performance

Who does RMT (or a programmable data plane) benefit?
=============================
* Researchers trying out new ideas: XCP, DCTCP, etc. Don't need to contort your algorithm to the mechanics of the switch like DCTCP did.
* Network operators who want to customize the network to their needs: traffic engineering, load balancing, etc.
* Switch vendors: It's easiest to fix bugs in software without making new hardware chips.
  --> This was already happening to some extent at the time the RMT paper was
      published. As the RMT paper says: "In fact, some existing chips, driven at
      least in part by the need to address multiple market segments, already have
      some flavors of reconfigurability that can be expressed using ad hoc interfaces
      to the chip." The Intel FM6000 was one example at that time, although it provided
      less programmability than RMT.
* Question: Which of these three needs is the most pressing at this point?

The RMT paper itself
=============================
* Written in a very layered style.
* Goes from high-level ideas down to very low-level details (e.g., power consumption of digital circuits).
* Paper combines ideas and authors from many different fields:
  --> Networking
  --> Hardware design (RMT pipeline)
  --> Compilers (table dependency graph and parse graph)
  --> Circuit design (new TCAM cells)
  --> Algorithms (cuckoo hashing)
* Good example of interdisciplinary work in service of a particular end goal (switch programmability)
  --> Also makes the paper hard to read in parts especially towards the end

The RMT hardware architecture
============================
* 640 Gbit/s
* 960 MHz at minimum packet size
* Why provision for minimum packet size?
  --> What's the goal here?
* 1 ingress and egress pipeline (single physical pipeline is time multiplexed)
* 16 parsers
  --> Why are there more parsers than pipelines?
* 32 stages

RMT parser
============================
* Follow parse graphs
* Dump bytes into appropriate packet headers
* Packet header travels through the pipeline.

An RMT stage
=============================
* A certain amount of SRAM
* A certain amount of TCAM
* The ability to extract match keys from the packet header (up to 640 bits of match key)
* The ability to modify packet headers in parallel (up to 224 parallel execution units)
* Crossbar to extract match keys from packet header vector.
* Crossbar to extract action inputs from packet header vector.

How is flexibility provided at the hardware level
============================
* SRAM is a sea of small memory blocks that can be combined together for wider or deeper tables.
* TCAM is similarly built.
* Crossbar allows match key to come from anywhere in the packet header vector.
* Crossbar allows action inputs to come from anywhere in the packet header vector.
* Flexible parser using a programmable state machine (parser's TCAM stores this).

What does flexibility cost?
============================
* Section 5.5
  --> 5.5 is a bit unclear because the baseline switching chip's area isn't mentioned anywhere.
* Small memory tiles that can be combined together add some overhead relative to large fixed memory (8%)
* Crossbar for operands and match keys adds some overhead (paper says 6 mm^2 but doesn't say what the relative number is).
* Computation units for actions cost additional area (5.5%)
* Flexibility in the parser adds an additional 0.7%
* Similar analysis for power
* Overall takeaway: Small additional amount of power and area
* This is really the main takeaway of the paper: you don't need a chip that's much larger.

Shortcomings of the paper:
===========================
* Some parts of the paper are hard to read.
* Design parameters are arbitrary: no real design space exploration
  --> In some ways unavoidable for the first version of any hardware design.
  --> But a simulator can help with some of this. 

Shortcoming of the architecture:
==========================
* Conflates memory with packet processing resources
* If you need more memory you are forced to pay for more packet processing resources and the other way around.
* Later architectures (e.g., dRMT) fix this problem.

P4
==========================
* Published a bit after the RMT paper in 2014.
* Goal was a standard language to program different emerging programmable switching devices (mostly switches then, today includes NICs).
* Three goals:
  --> Field reconfigurability: Change switch behavior in the field without redoing hardware design
  --> Protocol independence: Swap out one protocol for another
  --> Target independence: Same program should run on different targets
  (This turned out to be the hardest to achieve in hindsight.)
* P4's relationship to RMT was similar to OpenFlow's relationship to Ethane.
  --> In the sense that P4's design was informed by the requirements for programming RMT.
* Designed to co-exist within the SDN-style control-data-plane separation model.
  --> Instead of the controller talking to an OpenFlow switch,
      it uses a compiler generated API to talk to a P4 program compiled to a programmable switch.

P4 constructs
=========================
* Headers: Breaking down a packet's bits into different fields
* Parsers: State machine to dictate parsing
* Tables: Match-action processing for packet processing
* Actions: What operations to perform on each packet
* Control program: What order to apply these operations
* Primitive actions within an action are executed in parallel. Why?
* Goal: compile P4 program to different targets: RMT, NPUs, FPGAs, etc.

What's happened since 2013?
=========================
* RMT was commercialized as a switch called Tofino by Barefoot Networks. 6.5 Tbit/s (about 10 times faster than RMT).
  (This is what is required to remain competitive in the market place.)
* P4 has been evolved into a second version called P4-16, which makes P4 much more high level.
  --> Several workshops centred around P4.
  --> Open source spec and compilers at p4.org
  --> Industry consortium around it.
* Emerging research area since 2013: how do we make things fast and programmable at once?
* However, it's unclear how it will be adopted in industry.
  --> Clear benefits to switch vendors (e.g., Arista, Juniper, Dell, etc.) who want switch programmability for agility.
  --> Benefits to network operators a bit more unclear.
      --> Partly dependent on mindset.
      --> If you have already figured out how to do things from the edge of the network, why bother about moving things into network?
      --> Largely the approach Microsoft has taken
          (e.g., VFP, Azure SmartNIC, etc.)
      --> But, Google seems to be more open to tinkering with network hardware/software
          (e.g., switch control software for Jupiter/Espresso/B4, switch box design for B4).
      --> Might find uses in niche markets like high-frequency trading
          * In such cases, you really want to program your switch to do as little as possible for low latency.
          * Counterintuitive: programmability can help you reduce the number of features in your network.