How to implement a Circuit Breaker
This article deep dives into how to implement a circuit breaker design pattern
What is a Circuit Breaker?
Imagine you’re at home and plug too many devices into a single socket. The socket will withdraw so much current that the wires behind it might overheat and cause a fire. To prevent this situation, a circuit breaker for electrical safety is installed in your home’s electrical panel.
A similar thing happens within software systems. It’s a design pattern that detects when a system or service is becoming overwhelmed due to a large number of requests and then prevents further requests from being sent to it. This helps prevent cascading failures, reduces the load on the struggling system, and gives them time to recover.
For example, Imagine that we own an e-commerce system. There are two services: order service & inventory service. The inventory service manages a database used for persisting the inventory. The order service calls the inventory service to subtract the inventory so that the order service can create an order.
Without Circuit Breaker:
Suppose, there’s a query within the inventory database that’s taking too much time to execute. Due to this, the new incoming requests are not able to acquire a lock on certain tables because they haven’t been released by the previous query.
This situation would lead to a delay in the execution of all the incoming requests.
Worse: Instead of providing the time for the database to recover, we would be passing the same amount of traffic and it would take down the database easily.
This would ultimately lead to cascading failures. The inventory service would provide a failure response to order service which eventually means that the order cannot be created. Thus, our whole e-commerce system would be down.
With Circuit Breaker:
The inventory service would detect that there are delays in getting responses from the database.
Then, the inventory service would reduce the amount of traffic going to the database to allow it to recover — This state is known as Open Circuit.
The queries that are not sent to the database could be marked failed and a suitable response can be formulated to let the client(order service) know that the downstream system is overwhelmed right now and the client should retry after some time.
Meanwhile, the inventory service will call the downstream service periodically to see if the database has recovered and all the traffic can be resumed as earlier.
Now that we have discussed how the circuit breaker works, the rest of it boils down to how a system/service detects that downstream is struggling and then opens the circuit breaker, and then closes it again after some time.
Let’s take a dig at the implementation now.
Circuit Breaker States
A circuit breaker can be in only one of the three states:
Closed: This state means that downstream systems can manage incoming traffic load and everything works fine.
Half Open: This state is achieved after the circuit has been open for a certain time. The immediate next request will decide whether the circuit state should be “Closed” or “Open”.
Open: This state means that the downstream system is overwhelmed with a lot of traffic right now and the traffic should be stopped from being sent to the downstream service so that it can recover.
Refer to the following state machine diagram to understand how state movement happens between all the states.
Implementation
The CircuitBreaker Structure
Few points:
The above structure represents all the things needed for a CircuitBreaker implementation
Then, I have defined a constructor that accepts the maxfailures: maximum number of consecutive failures and resetTimeout: the time interval circuit should be kept open before moving to the HalfOpen state.
Inside the constructor, I’ve initialized another goroutine that takes care of the receiving signals through a channel, putting the goroutine to sleep, and then moving the state to HalfOpen. We will see its implementation later.
The CircuitBreaker States
The Execute, openWatcher functions
Few points:
Check Circuit State:
First, check if the circuit is already in the "Open" state.
If it is, immediately fail any incoming requests to prevent further strain on the system. This is known as the "Fail Fast" approach.
Process Requests:
If the circuit is not "Open," proceed to call the downstream service.
Use the downstream response to determine how to update the circuit breaker’s state.
Locking for Safety:
Before updating the state or the failure count, acquire a mutex lock to prevent race conditions.
Use the
defer
keyword to ensure the mutex is released just before exiting the function, ensuring proper resource cleanup.
Circuit States:
At this point, the circuit can only be in one of two states: "HalfOpen" or "Closed."
Handling No Errors:
If there are no errors from the downstream service, set the circuit state to "Closed" and reset the
failureCount
to 0.
Handling Errors:
If there is an error:
If HalfOpen:
Set the circuit state to "Open" and notify
openChannel
to put the circuit to sleep for aresetTimeout
interval.After this interval, set the circuit state back to "HalfOpen" and reset the
failureCount
to 0.
If Closed:
Increment the
failureCount
.If the
failureCount
reaches or exceeds themaxFailures
threshold, change the circuit state to "Open" and send a signal viaopenChannel
to handle the state transition as described above.
The callDownstream function
Few points:
The wg.Done() function decrements a count in the waitGroup indicating a goroutine has been completed. The calling goroutine will use wg.Wait() and will be blocked until the counter reaches 0.
The callDownstream function simulates logic to represent the successful and failed requests.
The sleep timer represents the real-time delay between any two requests.
The main function
The main function initializes the circuitBreaker with maxFailures: 3 and resetTimeout: 100ms configuration.
Then, we use a waitGroup and iterate over a loop 20 times to represent 20 downstream calls. Each call is added to the waitGroup and a parallel goroutine is fired to simulate the parallel processing of requests.
Output
As you can see, the output for this execution is pretty random since all the goroutines are working in parallel, so you can’t predict the order of execution among multiple threads.
Summary
Circuit Breaker is a must-have design pattern for the resilience and fault tolerance of a system. All good product companies implement this circuit breaker pattern and deliberately do failover drills to see if the system is resilient.
You can see throughout the code, that I’ve tried to put the code in such a way that if requests start succeeding, they don’t have to fight or wait for resources i.e. minimum resource contention. On the other hand, if the circuit is in Open state already, it might be okay to fail some more requests than expected to avoid lock contention.
This implementation is a classic example of a high-throughput, fail-fast system, which is often suitable for large-scale distributed systems.
Do it Yourself
I put you up for a challenge to implement this in your language of choice and paste the GitHub or online IDE link in the comments. Let’s discuss more in the comments.
Here’s the Go PlayGround link which has the executable code: Circuit Breaker Implementation. Copy this to your IDE and try changing the maxFailures, resetTimeout, or delay time between two requests. See, what happens next.
Kudos 🎉 to this article by Homayoon for a great write-up.
Hope you liked this edition.
Please consider liking 👍 and sharing with your friends as it motivates me to bring you good content. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!
Great job on making such a technical concept easy to understand and follow :)
I was wondering if it’s a good idea to let such requests fail, especially when dealing with critical systems that handle tasks like payments or other sensitive operations. Do businesses generally allow such failures? Or is there typically a backup process or alternative mechanism in place to handle these scenarios?
Thanks for providing implementation as well.