Consistent Hashing

Consistent hashing is a specific hashing technique used in distributed systems to efficiently distribute data across multiple nodes while minimizing the need for reorganization when the number of nodes changes.

What is Hashing?

Think of hashing as the work of a magic box. It’s like placing something inside, and presto! Out pops a unique code. This code is way smaller than what you started with, making it much easier to handle. In computer science, hashing is the process we use to do this magic. We take any size of data and transform it into fixed-size values. A special tool called a hash function does the trick. You give it an input, also known as a ‘message’, and it spits out a fixed-size string of bytes – that’s your hash value. Each time you do this, you get a unique code, a sort of ‘digest’ of the original data. Hash functions are like the Swiss Army knives of computer science. We use them everywhere, from finding stuff in a big pile of data to keeping secrets safe in cryptography, to making sure our data hasn’t been tampered with. They’re incredibly useful and powerful tools.

Now, consistent hashing is like dividing up tasks among friends. You have a bunch of friends (nodes), and you want to share the work (data) among them. With consistent hashing, each task (data) gets assigned to a friend based on a special rule, making sure the workload is evenly spread. Plus, if you add or remove friends, only a little reshuffling is needed, which keeps things efficient.

In consistent hashing, each data item is assigned a hash value, and each node in the system is also assigned a range of hash values. This ensures that when a node is added or removed from the system, only a fraction of the data needs to be rehashed and moved to accommodate the change, thus reducing the overall system overhead.

Example of Consistent Hashing

Imagine you’re running a website and you want to distribute incoming requests to different servers. You could assign each request based on its content (like the URL) to a server. But what if you add or remove servers? With consistent hashing, you divide the servers and requests into a circle. Each server is represented by a point on the circle, and each request is directed to the nearest server point on the circle in a clockwise direction.

              (Server A)
                 / \
                /   \
               /     \
              /       \
             /         \
            /           \
( Request ) -----[   ]------( Request )
            \           /
             \         /
              \       /
               \     /
                \   /
                 \ /
              (Server B)
               /   \
              /     \
             /       \
            /         \
           /           \
( Request ) -----[   ]------( Request )
           \           /
            \         /
             \       /
              \     /
               \   /
                \ /
             (Server C)

In this diagram, the circle represents the consistent hash ring. Each server (A, B, and C) is represented by a point on the circle. When a request comes in, it is directed to the nearest server point on the circle in a clockwise direction. For example, the first request is directed to server A, while the second request is directed to server B. If server B were to leave, requests that would have gone to B will now be directed to server C, maintaining balance and consistency in the system.

Advantages of consistent hashing:

Load Balancing: With each server taking care of a bunch of different codes, the work gets spread out evenly among them. So, when a new server joins or one leaves, only a part of the codes needs to get moved around. This makes sure the whole system keeps running smoothly without a big slowdown.
Scalability: Consistent hashing makes growing the system easy. When you add new servers, only a small part of the codes need to find new spots, letting the system get bigger without a hitch.
Fault Tolerance: If a server breaks down, only the codes it was looking after need to find new homes on other servers. This means the breakdown doesn’t mess up the whole system too much.
Reduced Rebalancing: Unlike other ways of dividing up the work where adding or removing a server means moving around lots of data, consistent hashing keeps that to a minimum. This makes everything run more smoothly and efficiently.

Overall, consistent hashing provides a scalable and fault-tolerant solution for distributing data in distributed systems, making it a popular choice for building large-scale, resilient applications.

Issues with Consistent hashing

Consistent hashing has its perks, but it’s not all smooth sailing. Here are some common hiccups you might encounter:

Uneven Load: Despite its goal of spreading the workload evenly, consistent hashing doesn’t always nail it. Sometimes, certain servers end up with more than their fair share of tasks. This can happen because of how the data is spread out or if some servers are better equipped to handle more.

Hotspots: Think of hotspots as traffic jams on the information highway. Sometimes, certain servers get bombarded with requests while others sit idle. This can slow things down and create a lopsided use of resources.

Churn Sensitivity: Imagine a game of musical chairs, but with servers. Every time a server joins or leaves the party, there’s a scramble to reshuffle the data. This can cause some turbulence, leading to temporary slowdowns and wobbles in system stability.

Data Rebalancing Headaches: While consistent hashing tries to keep the data shuffle to a minimum, there’s still some heavy lifting involved when servers come and go. This can hog network bandwidth, CPU power, and storage space, putting a strain on the system.

Complexity and Upkeep: Implementing and keeping up with consistent hashing can feel like wrangling a herd of unruly cattle. Dealing with surprises like server crashes or changes in workload takes careful planning and constant vigilance to keep everything running smoothly.

Routing Roadblocks: While consistent hashing is like a GPS for data, sometimes it takes the scenic route. Depending solely on the closest server in the hash ring might not always be the quickest path, leading to detours or longer wait times.

Addressing these problems often involves a combination of careful system design, tuning parameters, implementing heuristics to handle edge cases, and incorporating additional mechanisms for load balancing and fault tolerance. Despite these challenges, consistent hashing remains a valuable tool for distributing data in distributed systems, offering scalability, fault tolerance, and resilience to failures.

Alternatives to Consistent Hashing

Here are some other ways to share the load besides Consistent Hashing:

Round Robin: Think of this like taking turns. Requests are spread out equally among servers in a circle. Each new request goes to the next server in line. It’s easy to set up, but it doesn’t always think about how busy each server is.
Least Connections: With this, we send new requests to the server that’s not too busy. This helps make sure no one server gets overwhelmed with too much work.
IP Hashing: Instead of looking at the request, we use the client’s IP address to decide which server gets the request. This keeps requests from the same client going to the same server, which can help with keeping things organized.
Weighted Round Robin: Here, we give each server a score based on how much work it can handle. Servers with higher scores get more requests, letting us fine-tune how we spread out the work.
Least Response Time: This sends requests to the server that’s quickest to respond. It helps keep things moving fast by sending requests to the servers that are ready to handle them.
Dynamic Routing Algorithms: These fancy algorithms keep an eye on things like how busy each server is, how fast they’re responding, and how healthy they are. They adjust where requests go in real-time to make sure everything runs smoothly, especially when things change a lot.

These alternatives offer different trade-offs in terms of simplicity, load balancing effectiveness, and ability to adapt to changing conditions. The choice of routing algorithm depends on factors such as the specific requirements of the application, the characteristics of the server infrastructure, and the desired balance between simplicity and efficiency

Wikipedia – Consistent Hashing

High Scalability Blog

https://web.stanford.edu/class/cs168/l/l1.pdf

Consistent Hashing – How does it Optimize Server Routing for Websites

What is Hashing?