# ​

## ​

The full code and documentation is available on Github. Cover photograph by Samyuktha Sridhar.

# Beaker

Beaker is a distributed, transactional key-value store that is consistent and available. Beaker uses a leader-less variation of Generalized Paxos to consistently execute transactions. Beaker permits a minority of failures and hence it is N / 2 fault tolerant. Beaker assumes that failures are fail-stop. It makes no assumptions about the underlying network except that messages are received in the order they were sent. Most networking protocols, including TCP, satisfy this requirement.

## Introduction

A database is a transactional key-value store. Databases map keys to versioned values, called revisions. Revisions are uniquely identified and totally ordered by their version. A transaction depends on the versions of a set of keys, called its readset, and changes the values of a set of keys, called its writeset. Transactions may be committed if and only if the versions they depend on are greater than or equal to their versions in the database. Revisions are monotonic; if a transaction changes a key for which there exists a newer revision, the modification is discarded. This ensures that transactions cannot undo the effect of other transactions. We say that a transaction A conflicts with B if either reads or writes a key that the other writes.

A distributed database is a collection of beakers. Each beaker maintains its own replica of the database. In order to maintain consistency across beakers, a majority of beakers must agree to commit every transaction. Reaching agreement in a distributed system is often referred to as consensus, and it is a relatively well studied problem in computer science. There are a variety of algorithms that solve this problem, most notably Paxos, that have proven to be correct and performant. Beaker employs a variation of Paxos that has several desirable properties. First, beakers may simultaneously commit non-conflicting transactions. Second, beakers automatically repair replicas that have stale revisions. Third, beakers may safely commit transactions as long as they are connected to at least a majority of their non-faulty peers.

## Consensus

Beakers reach consensus on proposals. A proposal is a collection of non-conflicting transactions. These transactions may conditionally apply changes or unconditionally repair stale revisions. Proposals are uniquely identified and totally ordered by a ballot number. We say that a proposal A conflicts with B if a transaction that is applied by A conflicts with a transaction that is applied by B. We say that a proposal A is older than B if A conflicts with B and A has a lower ballot than B. We say that a proposal A matches B if A applies the same transactions as B. Proposals A and B may be merged by taking the maximum of their ballots, combining the transactions they apply choosing the transactions in the newer proposal in the case of conflicts, and combining their repairs choosing highest revision changes in the case of duplicates.

The leader for a proposal P first prepares P on a majority of beakers. If a beaker has not made a promise to a newer proposal, it responds with a promise not to accept any proposal that conflicts with the proposal it returns that has a lower ballot than P. If a beaker has already accepted proposals older than P, merges them together and returns the result. Otherwise, it returns the proposal with a zero ballot. If the leader does not receive a majority of promises, it retries with a higher ballot. Otherwise, it merges the returned promises into a single proposal P'. If P does not match P', it retries with P'. Otherwise, the leader gets the latest versions of the keys that are read by P from a majority of beakers. The leader discards all transactions in P that cannot be committed given the latest versions, and sets its repairs to the latest revisions of keys that are read - but not written - by P for which the beakers disagree on their version. The leader then requests a majority of beakers to accept P. A beaker accepts a proposal if it has not promised not to. When a beaker accepts a proposal, it discards all older accepted proposals and broadcasts a vote for it. We say that a proposal is accepted if a majority of beakers vote for it. A beaker learns a proposal once a majority of beakers vote for it. If a beaker learns a proposal, it commits its transactions and repairs on its replica of the database.

### Correctness

The proof of correctness relies on the assumption of connectivity, beakers are always connected to all of their non-faulty peers, and the fact of quorum intersection, any majority of beakers will contain at least one beaker in common.

Liveness. An accepted proposal A will eventually be learned. Proof. By quorum intersection, at least one promise will contain A. Therefore, A must be proposed enough beakers learn A such that A is no longer accepted by a majority. By assumption of connectivity, if any beaker learns a proposal then all beakers will eventually learn it.

Linearizability. If a proposal A is accepted, then any conflicting proposal B that is accepted after A will be learned after A. Proof. Because A was accepted before B, the majority that accepted A before B will vote for A before B. Because messages are delivered in order, A will be learned before B.

Commutativity. Let R denote the repairs for an accepted proposal A. Any accepted proposal B that conflicts with A + R but not A commutes with A + R. Proof. Because B conflicts with A + R but not A, B must read a key k that is read by A. By linearizability, B must read the latest version of k because B is accepted. Suppose that B is committed first. Because B reads and does not write k, A + R can still be committed. Suppose that A + R is committed first. Because A + R writes the latest version of k and B reads the latest version, B can still be committed.

Consistency. If a proposal A is accepted, it can be committed. Proof. Suppose there exists a transaction that cannot be committed. Then, the transaction must read a key for which there exists a newer version. This implies that there exists a proposal B that was accepted after but learned before A that changes a key k that is read by A. By linearizability, B cannot conflict with A. Therefore, B must repair k. By commutativity, A may still be committed.

## Reconfiguration

Each beaker is required to be connected to a majority of non-faulty peers in order to guarantee correctness. However, this correctness condition is only valid when the cluster is static. In practical systems, beakers may join or leave the cluster arbitrarily as the cluster grows or shrinks in size. In this section, we describe how fresh beakers are bootstrapped when they join an existing cluster. When a fresh beaker joins a cluster, its database is initially empty. In order to guarantee correctness, its database must be immediately populated with the latest revision of every key-value pair. Otherwise, if N -+ 1 fresh beakers join a cluster of size N it is possible for a quorum to consist entirely of fresh beakers.

A naive solution might be for the fresh beaker to propose a read-only transaction that depends on the initial revision of every key-value pair in the database and conflicts with every other proposal. Then, the fresh beaker would automatically repair itself in the process of committing this transaction. However, this is infeasible in practical systems because databases may contain arbitrarily many key-value pairs. This approach would inevitably saturate the network because for a database of size D such a proposal consumes D * (3 * N / 2 + N * N) in bandwidth. Furthermore, it prevents any proposals from being accepted in the interim.

We can improve this solution by decoupling bootstrapping and consensus. A fresh beaker joins the cluster as a non-voting member; it learns proposals, but does not participate in consensus. The fresh beaker reads the contents of the database from a quorum. It then assembles a repair transaction and commits it on its replica. It then joins the cluster as a voting member. This approach consumes just D * N / 2 in bandwidth and permits concurrent proposals.