Distributed Systems Fall 2021
Lecture 2: Logical Clocks, Safety and Liveness
The readings for this week concern themselves with the task of ordering events
(operations) in a distributed system, and reasoning about ordered events. You
might wonder why do we not just use real clocks? One problem is relativity,
which tells us that the notion of time is dependent on location. This of course
sounds unhelpful, we after all manage to schedule our lives around clocks
despite relativity. Delay this thought until September 23, when we will look at
some reasons why this results in practical challenges for distributed systems.
Instead focus on logical clocks for this class.
Why worry about ordering anyways? The reason is we always think about algorithms
as presenting steps that must be performed in order: an algorithm is a sequence
of steps that must be performed. Ordering is thus at the heart of this
discussion: when we analyze distributed systems we need to understand the order
in which they performed operations, and when designing systems we need to worry
about how we ensure that operations are performed in order.
# Lamport '78: Time, Clocks, and the Ordering of Events in a Distributed Systems
In this paper, the first of the logical clock papers, Lamport describes an
algorithm (a procedure) for recovering a **total order** on events in the
distributed system. A total order here in particular means that for any two
events (which remember are operations) e and e' either e < e' (e happens and
then e' happens) or e' < e.
Total orders are appealing, and Lamport's construction makes sure that the total
order is sane: i.e., messages are sent before being received, and causality is
maintained. However, despite this the total order Lamport derives is not
**unique**, which just means that one can ascribe more than one total order to
a set of events.
## Questions to consider
* Consider a distributed system where the only events are sending and receiving
messages. Give pseudocode that each process should run when sending or receiving
messages in order to maintain a Lamport clock.
* Construct an example for the application that only allows sends and receives
where two or more total orders can be assigned to a single event sequence.
# Alpern and Schneider '85: Defining Liveness
This next paper looks at the question of what correctness means given a set of
ordered events. This seemingly trivial question is actually quite deep: most
unit tests and integration tests that you have probably encountered thus far
provide a particular set of inputs, and then check outputs once the system is
done running. This is a view that makes sense if programs are like mathematical
functions, taking inputs and producing outputs. This view of course breaks down
for many reasons, including the fact that programs have side-effects. However,
side-effects are not the main issue we focus on in this class.
Our concern is that many distributed algorithms are designed to run for
unbounded amount of time, and this is true for many of the systems you interact
on with a daily basis, e.g., from a user's perspective websites like GMail never
really "terminate", that is there is no notion of when they are done running.
What does it mean for GMail to be correct? This paper presents two fundamental
types of correctness properties that must hold for a system, and shows that
every correctness property can be decomposed into these fundamental types.
## Questions to consider
* Pick a system of your choice, and list its safety and liveness properties?
* Give an example (from a real system) of an absolute liveness property, and
discuss what might go wrong if one used uniform liveness instead?
# Chandy, Lamport '85: Distributed Snapshots
Lamport '78 discussed some of the issues with thinking about a single timeline
when dealing with distributed systems, while Alpern and Schneider talk about
correctness in terms of sequence of events. For most of the class we will
resolve this contradiction by assuming a theoretical global clock and talking
about the system from the perspective of an outside observers who is omniscient.
However, this is an entirely theoretical perspective, and leads to the question
of what can we do if we want a program to reason about its state. This paper
both defines what program state means in the distributed context, and presents
an elegant protocol/algorithm for collecting this state.
## Questions to consider
* What properties hold for the snapshots collected by the Chandy Lamport
algorithm?