What is Consistent Hashing?

Vivek Bansal

Dec 2, 2023

This article deep dives into the consistent hashing algorithm - what it is and how it's helpful in real-world use cases

Read →

11 Comments

Antonio Cota

Jan 1

Very detailed article. Just one question:

> hashing the datastore name (or ID) with K different hash functions to get h1(D0), h2(D0), h3(D0) … hK(D0)

Why should I use many different hash functions for each virtual node? Shouldn't the hash function be always the same for keys, nodes and virtual nodes?

Expand full comment

Reply (1)

Vivek Bansal

Jan 1

You can use different hash functions to remove the biases of a single hash function and increase randomness by using different hash functions.

The more randomness is there, the better the chances of a uniform load distribution on the hash ring.

Expand full comment

Yash Agarwal

Jan 29, 2024

Are you sure the output of a hash function needs to be shorter length than the input?

I don't think that is correct

Expand full comment

Reply (1)

Vivek Bansal

Jan 30, 2024

Hi Yash,

Thanks for highlighting. It was giving a wrong meaning. I changed it to variable-length output value.

Expand full comment

Reply (1)

Antonio Cota

Jan 1

Isn't the output of an hash function of fixed length generally?

Expand full comment

Reply (1)

Vivek Bansal

Jan 1Edited

No, the output of a hash function is not fixed length. It can be of variable length.

Expand full comment

Pawan Kumar

Oct 5, 2024

How much virtual node works here? i want to know if D0 has three VN(D01,D02,D03).. all nodes conatins same copy of data?? I mean they will sync data in async way while all accept write?? or each virtual act like normal node?

Expand full comment

Rikesh Agarwal

Mar 12, 2024

You mentioned the following:

locate the h(key) on the circle ring and start traversing the ring in the clockwise direction. The first bucket to encounter which is equal to the hash of any datastore node is the target datastore node on which this key should reside.

How is this done efficiently? In terms of Time Complexity, this sounds like an O(n) operation, which is not good

Expand full comment

Reply (1)

Rishabh

Mar 24, 2024

You’re right; the naive approach could indeed lead to O(n) time complexity, where n is the number of nodes in the ring. To improve efficiency, techniques like binary search or maintaining a sorted list of nodes can be used, reducing the lookup time to O(log n).

Additionally, consistent hashing algorithms often employ techniques like virtual nodes to distribute data more evenly across the ring, further optimizing performance.(This is explained in the post under virtual copies section)

Expand full comment

Sanju

Dec 27, 2023

I didn't get how is adding virtual copies ensuring N/M load per node

Can u explain?

Expand full comment

Reply (1)

Rishabh

Mar 24, 2024

Here’s how it works:

1. Creating Virtual Copies: For each datastore node, we apply multiple hash functions (let’s say K hash functions) to its name or identifier. This results in K different hash values for each node.

2. Mapping to the Ring: Each of these K hash values is then mapped onto the hash ring. So, instead of a single point on the ring representing each node, there are K points (virtual copies) distributed around the ring for each node.

3. Data Distribution: When a key needs to be mapped to a node, it is hashed using the same K hash functions. Then, starting from the hash value of the key, we traverse the ring clockwise until we encounter the closest virtual copy of a node. This virtual copy is considered the target node for storing or retrieving the data associated with that key.

4. Load Balancing: By creating multiple virtual copies of each node, we increase the likelihood of keys being evenly distributed across the nodes on the ring. This helps in balancing the load more effectively, even if the hash values generated by the hash functions are not perfectly random.

5. Adjusting Replication Factor (K): The number of virtual copies (K) determines the degree of load balancing and replication. A higher value of K increases the number of virtual copies and hence improves load balancing but also increases the storage overhead and lookup time.

Expand full comment

Curious Engineer

What is Consistent Hashing?