This article introduces the concept of the Circuit Breaker resilience policy in Polly, with a particular focus on HTTP clients requests using .NET extensions and Polly, but the philosophy also applies to a more generic CB tuning. It provides a practical guide to integrating the Circuit Breaker policy and fine-tuning its configuration to enhance service resilience and fault tolerance.
What is Circuit Breaker Policy
The Circuit Breaker (CB) is a resilience strategy that enables a service to fail fast when experiencing significant issues, thereby preventing further strain on itself and downstream dependencies. It temporarily halts the execution of failing operations, allowing time for recovery before subsequent retry attempts. In essence, the Circuit Breaker is designed to:
- Allow the caller to fail fast and handle failures efficiently without repeated, unnecessary attempts
- Protect the callee by reducing pressure during periods of instability
Basic concepts
The following table outlines the key configuration options for a Circuit Breaker policy:
Property | Data Type | Defaults (.NET extension) | Description |
---|---|---|---|
FailureRatio |
Double (0 < x < 1.0) |
0.1 |
The failure ratio that will trigger the circuit to open. A value of 0.1 means the circuit will break if 10% of sampled executions fail. |
MinimumThroughput |
Integer (x >= 2) |
100 |
The minimum number of executions required within the sampling duration for the failure ratio to be considered. |
SamplingDuration |
TimeSpan |
30s |
The time window over which execution results are sampled to calculate the failure ratio. |
BreakDuration |
TimeSpan |
5s |
The period the circuit remains open before transitioning to a half-open state for re-evaluation. |
ShouldHandle |
Exceptions |
See Microsoft.Extensions.Http.Resilience vs Polly | A predicate function that defines which exceptions or results are considered failures by the strategy. |
In addition to the above parameters, an important concept to understand is Circuit Breaker coverage (CB coverage)
. This refers to the effectiveness of the circuit breaker in monitoring and protecting the service, and it is influenced by the interplay between SamplingDuration
, MinimumThroughput
, and the actual traffic volume
.
CB coverage
is defined as the percentage of sampling windows in which the configured MinimumThroughput
is met. In other words, it represents the proportion of time the circuit breaker is actively able to evaluate failures and provide protection. A higher CB coverage means the service is consistently generating enough traffic for the circuit breaker to function effectively.
Tuning goals
- Activate the CB policy only during actual incidents.
- Ensure sufficient CB coverage relative to service traffic.
- Avoid overly aggressive settings that cause false positives.
- Ensure timely recovery after a circuit opens.
To meet these objectives, the Circuit Breaker policy should be carefully tuned with the following considerations:
- Use a relatively high
FailureRatio
– This helps ensure the circuit only opens during meaningful failures, such as a true outage of a downstream service. - Lower
MinimumThroughput
andSamplingDuration
moderately – This increases CB coverage, making the policy effective even under low or fluctuating traffic, while also reducing the delay in detecting failures. - In the meantime,
MinimumThroughput
andSamplingDuration
should not be too low – Extremely low values may lead to statistically unreliable failure detection and increase the risk of false positives. - Keep
BreakDuration
reasonably short – A shorter break time allows for quicker recovery once the downstream service stabilizes, reducing the chance of unnecessarily blocking healthy requests.
Proposed Best Practices
Usage principles before fine tuning
- Use a circuit breaker when the downstream service exposes multiple endpoints – This setup allows the circuit breaker to work effectively alongside strategies like hedging or fallback, providing resilience without broadly blocking functionality.
- Avoid using a circuit breaker for services with a single critical endpoint – In such cases, opening the circuit could block all traffic, even if some requests might still succeed. However, exceptions can be made when failure tolerance is high or the downstream service experiences significant load and needs protection.
- Isolate circuit breakers by request host, API path, or other dimensions – This ensures that issues with one endpoint or host don’t cascade and affect unrelated traffic, allowing for more granular and targeted resilience behavior.
Settle down the FailureRatio
Set a relatively high FailureRatio
to capture non-transient issues. Ideally, this threshold should be defined by the downstream (callee) service, as they best understand the behavior and availability expectations of their APIs. They can specify the minimum availability level at which it still makes sense to allow traffic.
If no guidance is available from the downstream service, a good starting point is a FailureRatio
of 0.5 or higher. This is a conservative default that helps avoid prematurely opening the circuit due to transient or low-severity failures.
Settle down the BreakDuration
We recommend starting with a small BreakDuration
and increasing it as needed based on observed behavior. A shorter duration enables faster recovery, reducing the risk of unnecessarily blocking healthy requests once the downstream service stabilizes.
Based on our experience with a wide range of services, a default starting point of 5 seconds is typically effective and strikes a good balance between protection and responsiveness.
Fine-tune to get a proper MinimumThroughput
and SamplingDuration
pair
This 2 configurations are the key parts to determine whether a CB is configured effective or not.
Pre-requisites
Before going into below fine-tune steps, we strongly recommend you to distinguish various traffic patterns of your services in different environment or location, because you need to fine-tune and come up with configuration value pairs per those patterns, instead of relying on single set of configurations for all deployment instances.
Fine-tune steps
-
Given a suggested
SamplingDuration
time window, find theMinimumThroughput
at which the failure rate stabilizes.- a. Set the initial
SamplingDuration
to the minimum value 30s, which is the recommended default from .NET. - b. Analyze your service logs during peak traffic to visualize relationship for failure rate versus throughput.
- Based on your service logs, divide the observed time range into consecutive windows based on the SamplingDuration (e.g., 30-second intervals), counts the request traffic and requests success ratio for each time window. Plot these data points on a line chart showing failure rate against throughput over time.
- When filtering failed requests, apply the Circuit Breaker’s ShouldHandle predicate to include only relevant failure types (refer to the table at the start of this article), and only focus on certain scenario/API (since they should use separated circuit).
- c. Identify the
MinimumThroughput
by below conditions- Choose the throughput value at which the failure rate begins to stabilize, indicating statistical reliability.
- Ensure that the
MinimumThroughput
is greater thanSamplingDuration * 1
(i.e., at least 1 request per second). If it’s lower, the traffic volume is too low to justify a circuit breaker policy. - If no suitable
MinimumThroughput
is found, increase theSamplingDuration
(e.g., to 45 seconds) and repeat this analysis. TheMinimumThroughput
should be sufficiently large to reveal a stable failure rate with statistical significance. By the above steps, we start with a smallSamplingDuration
time window and gradually increase to accurately understand the real relationship between throughput and failure rate for your services.
Pic 1: A sample chart to show the relationship between failure rate and throughput
- a. Set the initial
-
The value pair should meet a proper CB coverage
- a. Calculate the CB coverage for your chosen
MinimumThroughput
andSamplingDuration
pair from the previous step.- Using your service logs, divide the time range into windows based on the current SamplingDuration.
- For each window, count the number of requests and check if it meets or exceeds the MinimumThroughput.
- Calculate the percentage of these time windows that satisfy the MinimumThroughput condition.
- This percentage represents the CB coverage—the proportion of time during which the circuit breaker can actively monitor and protect the service.
- If the result is below 50%(0.5), increase the
SamplingDuration
and repeat above process, until the CB coverage exceeds 50%, ensuring the circuit breaker is effective for most of the traffic.
Pic 2: A sample pie chart that shows the CB coverage check
- a. Calculate the CB coverage for your chosen
Some potential problems while fine-tuning
Too large SamplingDuration
It is recommended to keep the SamplingDuration at no more than 5 minutes.
If, after increasing SamplingDuration
up to this limit, you still do not observe stabilization of the failure rate as described in step 1.c, this may indicate one of the following:
- The traffic volume is too low to provide statistically significant data for reliable circuit breaker tuning.
- The downstream service is highly unstable, with frequent failures over extended periods, making it unsuitable for a circuit breaker policy that assumes failures are transient and limited to live-site incidents.
Low circuit breaker coverage
If CB coverage remains low even after increasing SamplingDuration
, consider the following steps:
-
Check for traffic imbalance across pods or machines.
Circuit Breaker policies in .NET with Polly operate at the individual pod or machine level and assume a reasonably balanced load. If some pods receive disproportionately low traffic, the circuit breaker on those pods may rarely activate, leading to low overall coverage. Address any load-balancing issues to ensure more even traffic distribution before tuning the circuit breaker further.
-
Carefully lower the
MinimumThroughput
to increase coverage.This involves a trade-off between improved CB coverage and the risk of introducing noise (false positives). When decreasing
MinimumThroughput
, ensure that:- The additional noise introduced does not cause the observed failure rate to exceed (ideally, it should remain well below) the configured FailureRatio.
- The MinimumThroughput remains above the threshold of at least 1 request per second (
MinimumThroughput
>SamplingDuration * 1
), ensuring statistical significance.
Pic 3: Relationship between failure rate and throughput, and potential buffer for decreasing MinimumThrough
In the above screenshot sample, it’s a draw based on 30s sampling durations. If you find it hard to meet a high CB coverage with MinimumThroughput >= 50, usually because your services traffic is not that big, then you can consider decreasing the MinimumThroughput
to a value in “Decrease Interval”. This provides higher coverage (because more time slices could possibly reach the new lower minimumThroughput
bar), and an acceptable noise (all time-slice data points in this range have lower failure rates compared to your CB FailureRatio
configuration).
Enhancement Idea: AI-Driven Adaptive Circuit Breaker via MCP Integration
Great write-up! One idea worth exploring is integrating a lightweight MCP (Modular Cognitive Process) agent alongside Polly to dynamically fine-tune circuit breaker thresholds based on real-time telemetry, traffic patterns, or ML insights.
Instead of static thresholds (e.g., 5 failures in 30s), an MCP can:
• Adjust break duration and failure limits based on time-of-day, service tier, or risk profile.
• Learn from past outages and proactively shift policy behavior.
• Coordinate circuit states across services to prevent cascading failures.
Adding MCP to circuit breaker design turns a reactive safety mechanism into a proactive intelligent decision system.
yes, dynamic tuning on the threshold will work better compared to a static configurations, especially for scenarios with poor load balance, leading to each pod/instances’ traffic vary a lot and a static threshold won’t working well. And when it comes to proactive design and dynamic tuning, AI could definitely help 🙂