Distributed Systems Spring 2024 Lecture 2: Logical Clocks, Safety and Liveness The readings for this week concern themselves with the task of ordering events (operations) in a distributed system, and reasoning about ordered events. You might wonder why do we not just use real clocks? One problem is relativity, which tells us that the notion of time is dependent on location. This of course sounds unhelpful, we after all manage to schedule our lives around clocks despite relativity. Delay this thought until September 23, when we will look at some reasons why this results in practical challenges for distributed systems. Instead focus on logical clocks for this class. Why worry about ordering anyways? The reason is we always think about algorithms as presenting steps that must be performed in order: an algorithm is a sequence of steps that must be performed. Ordering is thus at the heart of this discussion: when we analyze distributed systems we need to understand the order in which they performed operations, and when designing systems we need to worry about how we ensure that operations are performed in order. # Lamport '78: Time, Clocks, and the Ordering of Events in a Distributed Systems In this paper, the first of the logical clock papers, Lamport describes an algorithm (a procedure) for recovering a **total order** on events in the distributed system. A total order here in particular means that for any two events (which remember are operations) e and e' either e < e' (e happens and then e' happens) or e' < e. Total orders are appealing, and Lamport's construction makes sure that the total order is sane: i.e., messages are sent before being received, and causality is maintained. However, despite this the total order Lamport derives is not **unique**, which just means that one can ascribe more than one total order to a set of events. ## Questions to consider * Consider a distributed system where the only events are sending and receiving messages. Give pseudocode that each process should run when sending or receiving messages in order to maintain a Lamport clock. * Construct an example for the application that only allows sends and receives where two or more total orders can be assigned to a single event sequence. # Alpern and Schneider '85: Defining Liveness This next paper looks at the question of what correctness means given a set of ordered events. This seemingly trivial question is actually quite deep: most unit tests and integration tests that you have probably encountered thus far provide a particular set of inputs, and then check outputs once the system is done running. This is a view that makes sense if programs are like mathematical functions, taking inputs and producing outputs. This view of course breaks down for many reasons, including the fact that programs have side-effects. However, side-effects are not the main issue we focus on in this class. Our concern is that many distributed algorithms are designed to run for unbounded amount of time, and this is true for many of the systems you interact on with a daily basis, e.g., from a user's perspective websites like GMail never really "terminate", that is there is no notion of when they are done running. What does it mean for GMail to be correct? This paper presents two fundamental types of correctness properties that must hold for a system, and shows that every correctness property can be decomposed into these fundamental types. ## Questions to consider * Pick a system of your choice, and list its safety and liveness properties? * Give an example (from a real system) of an absolute liveness property, and discuss what might go wrong if one used uniform liveness instead? # Chandy, Lamport '85: Distributed Snapshots Lamport '78 discussed some of the issues with thinking about a single timeline when dealing with distributed systems, while Alpern and Schneider talk about correctness in terms of sequence of events. For most of the class we will resolve this contradiction by assuming a theoretical global clock and talking about the system from the perspective of an outside observers who is omniscient. However, this is an entirely theoretical perspective, and leads to the question of what can we do if we want a program to reason about its state. This paper both defines what program state means in the distributed context, and presents an elegant protocol/algorithm for collecting this state. ## Questions to consider * What properties hold for the snapshots collected by the Chandy Lamport algorithm?