Consensus Algorithm: Raft
This blog deep dives into the famous consensus algorithm called Raft and how it helps in achieving consensus in a distributed system consisting of multiple nodes.
Question: Why do we need consensus algorithms?
Imagine there are 3 nodes or machines in your architecture: Node A, Node B, and Node C. Initially, all of the three nodes have the following key, value stored in them.
Node A: Key1, Value1
Node B: Key1, Value1
Node C: Key1, Value1
Now, imagine a network partition occurs, isolating Node A from Nodes B and C. Then, an update operation comes on Node A which changes the value from Value1 to Value2. The configuration on all nodes becomes:
Node A: Key1, Value2
Node B: Key1, Value1
Node C: Key1, Value1
Imagine that the network partition is recovered. Now, if we send read requests to this system consisting of three nodes, it is likely that the read request might go to any node in the system. Thus, when the response is returned from Node A, the value will be Value2, whereas the value returned from Node B and Node C will be Value1. Thus, there is a discrepancy in the read requests sent to the system.
To fetch a consistent response from an architecture consisting of multiple nodes, we need a consensus among all the nodes so that data among all nodes is consistent.
Introduction: Raft Algorithm
Raft is a consensus algorithm designed for managing a distributed system, ensuring data consistency and fault tolerance across multiple nodes. It solves the problem of achieving consensus in a distributed system, where nodes may fail or become unavailable.
Here are some basic components of the Raft algorithm:
Node States: All the nodes start in the follower state. At any point in time, a node can be in one of three states:
Follower: A node that follows a leader and replicates its log.
Candidate: A node that attempts to become the leader during an election.
Leader: The leader node is a node that gets all the writes in the system from the client side. It’s a primary node responsible for managing the cluster and replicating logs.
Term: Each term starts with the election and continues till the leader is active and able to replicate its logs to the follower nodes. The moment a leader node becomes inactive, the term ends, and a new term starts.
Log: In the context of Raft, the log can be thought of as a sequence of entries that represent the history of all changes made to the distributed system's state. Think of a “log” as a sequence of entries where each entry might denote a database command, a configuration change, a system update, or anything that can be applied to a single node or its data in a distributed system.
Entries: Entries are the individual elements of the log, containing commands or data.
Heartbeats: Heartbeats are periodic messages sent by the leader to maintain its authority over the follower nodes and detect failures.
Let’s deep dive into the main algorithm and it’s key components.
How does Raft perform consensus?
There are two key components for performing consensus using Raft in a distributed system: Leader election and Log replication. Let’s deep dive and understand each of them.
Leader Election
Leader election in Raft is the process by which a new leader is chosen when the current leader fails or becomes unavailable. Initially, all the nodes start as the follower nodes. Each node has an election timeout associated with it(a randomized integer between 150 to 300ms). Initially, all the follower nodes wait for the election timeout to receive any communication in the form of a heartbeat from the leader node (if there’s any).
Since there is no leader initially, as soon as the election timeout expires for one of the nodes, this follower node turns into a candidate node and sends out vote requests to the other nodes in the system.
Situation 1: If the node who sent out vote requests gets a majority, then the current node is selected as the leader node. Then, the leader node at a periodic interval sends out a heartbeat to all the other follower nodes in the distributed system to maintain its authority over the follower nodes. The leader node also dictates what to replicate (covered in the next section)
Situation 2: If the node doesn’t receive a majority, then the re-election happens using the same process.
Two timeouts which control the election process:
Election timeout: The election timeout is the amount of time a follower waits until becoming a candidate. The election timeout is randomized to be between 150ms and 300ms. As discussed earlier, after the election timeout the follower becomes a candidate and starts a new election term. This election term will continue until a follower stops receiving heartbeats and becomes a candidate.
Heartbeat timeout: The heartbeat timeout represents the periodic time at which the messages are sent from the leader node to the follower nodes.
Log Replication
After a leader is elected, all the changes in the system go through the leader. Every time, a client’s write request comes, it goes to the leader node and gets added as an entry in the node’s log. Initially, the log entry is uncommitted on the leader node. It’s just present in the leader node’s log as an uncommitted entry.
To commit the entry, the leader node first replicates the log to the follower nodes by sending them data in the heartbeat request. Then, the leader node waits until it gets an acknowledgment that the majority of nodes have written the entry. After getting the acknowledgment from the majority of nodes, the leader node commits the log entry on the self-node and then re-notifies it to all the follower nodes so that followers can also commit the log entries.
This is called Log Replication. This way, the whole distributed system comes to a consensus on the same state of the system.
Interesting problem: Suppose there are 5 nodes in a system, and there is a network partition in the system. This leads to two sub-groups: one having 3 nodes and the other minority group having 2 nodes. This situation will eventually lead to both groups running their own elections respectively and eventually having two leaders.
This is called the Split-brain problem where two leaders can accept writes in the system and thus can cause data inconsistencies.
The split-brain problem is handled in Raft implicitly because the leader of the minority subgroup won’t be able to commit the log entries because the leader won’t be able to send the logs to the majority of the nodes in the system and thus all the log entries in the minority group leader node will remain uncommitted.
Later, when the network partition is recovered, the uncommitted entries of the minority group will be discarded and the logs from the majority group leader node will be communicated to all the follower nodes in the system and thus consensus will be achieved again.
I was about to compare Paxos with Raft but only could understand that Paxos is super difficult to understand compared to Raft. So, I have deferred reading about Paxos and will try to cover it sometime later in future editions.
Kudos to the OG authors Diego Ongaro and John Ousterhout for writing such a detailed paper.
That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!
Resources
In Search of an Understandable Consensus Algorithm
Raft Algorithm by Wikipedia
What is the Raft Consensus algorithm by Yugabyte
Best Animation to understand Raft by the creators of the algorithm
One question here is during log replication what if majority of the followers fail to commit the entry. Does the leader also reverts in this case or this is all under a transaction boundary?