6 Comments
Jan 29Liked by Vivek Bansal

Are you sure the output of a hash function needs to be shorter length than the input?

I don't think that is correct

Expand full comment
author

Hi Yash,

Thanks for highlighting. It was giving a wrong meaning. I changed it to variable-length output value.

Expand full comment

You mentioned the following:

locate the h(key) on the circle ring and start traversing the ring in the clockwise direction. The first bucket to encounter which is equal to the hash of any datastore node is the target datastore node on which this key should reside.

How is this done efficiently? In terms of Time Complexity, this sounds like an O(n) operation, which is not good

Expand full comment

You’re right; the naive approach could indeed lead to O(n) time complexity, where n is the number of nodes in the ring. To improve efficiency, techniques like binary search or maintaining a sorted list of nodes can be used, reducing the lookup time to O(log n).

Additionally, consistent hashing algorithms often employ techniques like virtual nodes to distribute data more evenly across the ring, further optimizing performance.(This is explained in the post under virtual copies section)

Expand full comment

I didn't get how is adding virtual copies ensuring N/M load per node

Can u explain?

Expand full comment

Here’s how it works:

1. Creating Virtual Copies: For each datastore node, we apply multiple hash functions (let’s say K hash functions) to its name or identifier. This results in K different hash values for each node.

2. Mapping to the Ring: Each of these K hash values is then mapped onto the hash ring. So, instead of a single point on the ring representing each node, there are K points (virtual copies) distributed around the ring for each node.

3. Data Distribution: When a key needs to be mapped to a node, it is hashed using the same K hash functions. Then, starting from the hash value of the key, we traverse the ring clockwise until we encounter the closest virtual copy of a node. This virtual copy is considered the target node for storing or retrieving the data associated with that key.

4. Load Balancing: By creating multiple virtual copies of each node, we increase the likelihood of keys being evenly distributed across the nodes on the ring. This helps in balancing the load more effectively, even if the hash values generated by the hash functions are not perfectly random.

5. Adjusting Replication Factor (K): The number of virtual copies (K) determines the degree of load balancing and replication. A higher value of K increases the number of virtual copies and hence improves load balancing but also increases the storage overhead and lookup time.

Expand full comment