Circuit Breaker Policy Fine-tuning Best Practice

This article introduces the concept of the Circuit Breaker resilience policy in Polly, with a particular focus on HTTP clients requests using .NET extensions and Polly, but the philosophy also applies to a more generic CB tuning. It provides a practical guide to integrating the Circuit Breaker policy and fine-tuning its configuration to enhance service resilience and fault tolerance.

What is Circuit Breaker Policy

The Circuit Breaker (CB) is a resilience strategy that enables a service to fail fast when experiencing significant issues, thereby preventing further strain on itself and downstream dependencies. It temporarily halts the execution of failing operations, allowing time for recovery before subsequent retry attempts. In essence, the Circuit Breaker is designed to:

Allow the caller to fail fast and handle failures efficiently without repeated, unnecessary attempts
Protect the callee by reducing pressure during periods of instability

Basic concepts

The following table outlines the key configuration options for a Circuit Breaker policy:

Property	Data Type	Defaults (.NET extension)	Description
`FailureRatio`	`Double` (0 < x < 1.0)	`0.1`	The failure ratio that will trigger the circuit to open. A value of 0.1 means the circuit will break if 10% of sampled executions fail.
`MinimumThroughput`	`Integer` (x >= 2)	`100`	The minimum number of executions required within the sampling duration for the failure ratio to be considered.
`SamplingDuration`	`TimeSpan`	`30s`	The time window over which execution results are sampled to calculate the failure ratio.
`BreakDuration`	`TimeSpan`	`5s`	The period the circuit remains open before transitioning to a half-open state for re-evaluation.
`ShouldHandle`	`Exceptions`	See Microsoft.Extensions.Http.Resilience vs Polly	A predicate function that defines which exceptions or results are considered failures by the strategy.

In addition to the above parameters, an important concept to understand is Circuit Breaker coverage (CB coverage). This refers to the effectiveness of the circuit breaker in monitoring and protecting the service, and it is influenced by the interplay between SamplingDuration, MinimumThroughput, and the actual traffic volume.

CB coverage is defined as the percentage of sampling windows in which the configured MinimumThroughput is met. In other words, it represents the proportion of time the circuit breaker is actively able to evaluate failures and provide protection. A higher CB coverage means the service is consistently generating enough traffic for the circuit breaker to function effectively.

Tuning goals

Activate the CB policy only during actual incidents.
Ensure sufficient CB coverage relative to service traffic.
Avoid overly aggressive settings that cause false positives.
Ensure timely recovery after a circuit opens.

To meet these objectives, the Circuit Breaker policy should be carefully tuned with the following considerations:

Use a relatively high FailureRatio – This helps ensure the circuit only opens during meaningful failures, such as a true outage of a downstream service.
Lower MinimumThroughput and SamplingDuration moderately – This increases CB coverage, making the policy effective even under low or fluctuating traffic, while also reducing the delay in detecting failures.
In the meantime, MinimumThroughput and SamplingDuration should not be too low – Extremely low values may lead to statistically unreliable failure detection and increase the risk of false positives.
Keep BreakDuration reasonably short – A shorter break time allows for quicker recovery once the downstream service stabilizes, reducing the chance of unnecessarily blocking healthy requests.

Proposed Best Practices

Usage principles before fine tuning

Use a circuit breaker when the downstream service exposes multiple endpoints – This setup allows the circuit breaker to work effectively alongside strategies like hedging or fallback, providing resilience without broadly blocking functionality.
Avoid using a circuit breaker for services with a single critical endpoint – In such cases, opening the circuit could block all traffic, even if some requests might still succeed. However, exceptions can be made when failure tolerance is high or the downstream service experiences significant load and needs protection.
Isolate circuit breakers by request host, API path, or other dimensions – This ensures that issues with one endpoint or host don’t cascade and affect unrelated traffic, allowing for more granular and targeted resilience behavior.

Settle down the `FailureRatio`

Set a relatively high FailureRatio to capture non-transient issues. Ideally, this threshold should be defined by the downstream (callee) service, as they best understand the behavior and availability expectations of their APIs. They can specify the minimum availability level at which it still makes sense to allow traffic.

If no guidance is available from the downstream service, a good starting point is a FailureRatio of 0.5 or higher. This is a conservative default that helps avoid prematurely opening the circuit due to transient or low-severity failures.

Settle down the `BreakDuration`

We recommend starting with a small BreakDuration and increasing it as needed based on observed behavior. A shorter duration enables faster recovery, reducing the risk of unnecessarily blocking healthy requests once the downstream service stabilizes.

Based on our experience with a wide range of services, a default starting point of 5 seconds is typically effective and strikes a good balance between protection and responsiveness.

Fine-tune to get a proper `MinimumThroughput` and `SamplingDuration` pair

This 2 configurations are the key parts to determine whether a CB is configured effective or not.

Pre-requisites

Before going into below fine-tune steps, we strongly recommend you to distinguish various traffic patterns of your services in different environment or location, because you need to fine-tune and come up with configuration value pairs per those patterns, instead of relying on single set of configurations for all deployment instances.

Fine-tune steps

Given a suggested SamplingDuration time window, find the MinimumThroughput at which the failure rate stabilizes.
- a. Set the initial SamplingDuration to the minimum value 30s, which is the recommended default from .NET.
- b. Analyze your service logs during peak traffic to visualize relationship for failure rate versus throughput.
  - Based on your service logs, divide the observed time range into consecutive windows based on the SamplingDuration (e.g., 30-second intervals), counts the request traffic and requests success ratio for each time window. Plot these data points on a line chart showing failure rate against throughput over time.
  - When filtering failed requests, apply the Circuit Breaker’s ShouldHandle predicate to include only relevant failure types (refer to the table at the start of this article), and only focus on certain scenario/API (since they should use separated circuit).
- c. Identify the MinimumThroughput by below conditions
  - Choose the throughput value at which the failure rate begins to stabilize, indicating statistical reliability.
  - Ensure that the MinimumThroughput is greater than SamplingDuration * 1 (i.e., at least 1 request per second). If it’s lower, the traffic volume is too low to justify a circuit breaker policy.
  - If no suitable MinimumThroughput is found, increase the SamplingDuration (e.g., to 45 seconds) and repeat this analysis. The MinimumThroughput should be sufficiently large to reveal a stable failure rate with statistical significance. By the above steps, we start with a small SamplingDuration time window and gradually increase to accurately understand the real relationship between throughput and failure rate for your services.
Pic 1: A sample chart to show the relationship between failure rate and throughput
The value pair should meet a proper CB coverage
- a. Calculate the CB coverage for your chosen MinimumThroughput and SamplingDuration pair from the previous step.
  - Using your service logs, divide the time range into windows based on the current SamplingDuration.
  - For each window, count the number of requests and check if it meets or exceeds the MinimumThroughput.
  - Calculate the percentage of these time windows that satisfy the MinimumThroughput condition.
  - This percentage represents the CB coverage—the proportion of time during which the circuit breaker can actively monitor and protect the service.
- If the result is below 50%(0.5), increase the SamplingDuration and repeat above process, until the CB coverage exceeds 50%, ensuring the circuit breaker is effective for most of the traffic.
Pic 2: A sample pie chart that shows the CB coverage check

Some potential problems while fine-tuning

Too large `SamplingDuration`

It is recommended to keep the SamplingDuration at no more than 5 minutes. If, after increasing SamplingDuration up to this limit, you still do not observe stabilization of the failure rate as described in step 1.c, this may indicate one of the following:

The traffic volume is too low to provide statistically significant data for reliable circuit breaker tuning.
The downstream service is highly unstable, with frequent failures over extended periods, making it unsuitable for a circuit breaker policy that assumes failures are transient and limited to live-site incidents.

Low circuit breaker coverage

If CB coverage remains low even after increasing SamplingDuration, consider the following steps:

Check for traffic imbalance across pods or machines.

Circuit Breaker policies in .NET with Polly operate at the individual pod or machine level and assume a reasonably balanced load. If some pods receive disproportionately low traffic, the circuit breaker on those pods may rarely activate, leading to low overall coverage. Address any load-balancing issues to ensure more even traffic distribution before tuning the circuit breaker further.
Carefully lower the MinimumThroughput to increase coverage.

This involves a trade-off between improved CB coverage and the risk of introducing noise (false positives). When decreasing MinimumThroughput, ensure that:
- The additional noise introduced does not cause the observed failure rate to exceed (ideally, it should remain well below) the configured FailureRatio.
- The MinimumThroughput remains above the threshold of at least 1 request per second (MinimumThroughput > SamplingDuration * 1), ensuring statistical significance.

Pic 3: Relationship between failure rate and throughput, and potential buffer for decreasing MinimumThrough Chart showing relationship between failure rate and throughput

In the above screenshot sample, it’s a draw based on 30s sampling durations. If you find it hard to meet a high CB coverage with MinimumThroughput >= 50, usually because your services traffic is not that big, then you can consider decreasing the MinimumThroughput to a value in “Decrease Interval”. This provides higher coverage (because more time slices could possibly reach the new lower minimumThroughput bar), and an acceptable noise (all time-slice data points in this range have lower failure rates compared to your CB FailureRatio configuration).