Skip to content

Retry Strategy Best Practices in Software Development

Overview

A practical guide to retry boundaries, strategy selection, idempotency, and production rollout across thread pools, message queues, HTTP, and gRPC.

Abstract

Retry is one of the most common and most frequently abused fault-tolerance mechanisms in distributed systems. Its value lies in using a limited number of repeated attempts to mask transient failures such as short-lived network jitter, temporary service unavailability, rate limiting, connection re-establishment, leader or primary failover, and consumer crashes. But retry is not just "try again a few times after failure." Poorly designed retry behavior can cause duplicate writes, duplicate charges, message storms, thread-pool exhaustion, downstream avalanches, and cascading failures. The AWS Builders Library explicitly points out that retries are "selfish": clients spend more server resources through retries to improve their own success rate; when the root cause is overload, retries make overload worse and can slow recovery. (Amazon Web Services, Inc.)

The central conclusion of this article is: the default production retry policy should be "bounded attempts + per-attempt timeout + exponential backoff + jitter + idempotency protection + retry budget + dead-letter fallback." Immediate retry should be reserved only for extremely short-lived transient failures, and at most once. High-concurrency distributed systems should not use naked fixed-interval retries. Any operation with side effects, including writes, payments, order creation, coupon issuance, or message publication, must solve idempotency before retry is even discussed.

Keywords: retry strategy, transient failure, exponential backoff, jitter, idempotency, thread pool, message queue, HTTP, gRPC, dead-letter queue, retry storm


1. What Is Retry, Why Retry, and What Happens Without It?

1.1 Definition of Retry

Retry means that after an operation fails, the caller, execution framework, messaging middleware, RPC framework, or scheduler issues the same operation again under specific conditions, attempting to make the failed business flow eventually succeed.

The Microsoft Azure Retry Pattern defines it directly: transparently retry a failed operation when connecting to a service or network resource in order to handle transient faults and improve application stability. In cloud environments, transient faults include brief network interruptions, temporary service unavailability, and timeouts caused by a busy service. (Microsoft Learn)

In more operational terms, retry consists of four elements:

ElementMeaning
Retryable conditionWhich exceptions, error codes, or status codes are allowed to retry
Retry boundaryMaximum attempts and maximum total time
Retry intervalImmediate retry, fixed interval, increasing interval, exponential backoff, or backoff with jitter
Failure destinationAfter retries are exhausted, should the system fail, degrade, trip a circuit breaker, enter a dead-letter queue, or require manual handling

Judgment: any "retry" design without these four elements is not reliability engineering. It is luck-based behavior.

1.2 Why Retry?

The fundamental reason for retry is that many failures in modern software systems are not permanent failures. They are transient failures. The AWS Builders Library notes that systems often fail partially or briefly rather than failing as a whole, and a second attempt can often succeed when the fault is short-lived and random. (Amazon Web Services, Inc.)

Typical transient failures include:

ScenarioExample
Network jitterTCP reset, connection timeout, temporary DNS failure
Busy serviceHTTP 503, full thread pool, full connection pool
Rate limitingHTTP 429, API quota exceeded
Distributed switchoverPrimary-replica failover, leader election, broker failover
Eventual consistencyA newly created resource is temporarily unreadable
Consumer failureConsumer process crash, temporary database outage

The Azure documentation also states that many transient faults recover on their own, and that an operation will often succeed if the application retries after an appropriate delay. (Microsoft Learn)

1.3 What Happens If There Is No Retry?

A system with no retry at all will often expose brief fluctuations directly to users or upstream systems, turning requests that could have recovered automatically into visible failures. Typical consequences include:

Consequence of no retryExplanation
Lower availabilityNetwork jitter, temporary throttling, and brief 503s become user-visible failures immediately
Higher compensation costOrders, jobs, and sync flows that could have succeeded automatically need manual or asynchronous repair
Less stable call pathsUpstream services see more failures and may trigger alerts, degradation, or manual intervention
Higher message-loss riskIf MQ consumption fails and the message is acked or the offset is committed anyway, the message can be skipped
Worse user experienceUsers need to refresh, click again, or submit repeatedly

But the opposite also matters: blind retry is more dangerous than no retry. AWS warns explicitly that if the failure is caused by downstream overload, retries add more downstream load and make the situation significantly worse. In a five-layer call stack where every layer retries three times, request load on the bottom-layer database can be amplified by up to 243 times. (Amazon Web Services, Inc.)

So the real conclusion is not "always retry." It is:

Retry only transient failures; retry only operations that are idempotent or deduplicable; retry only within bounded attempts, bounded time, and bounded budget.


2. Core Principles of Retry Design

2.1 Set Timeouts Before Retries

Retry without timeout is a broken design. AWS notes that while a client waits for a request to finish, it continues to consume resources such as memory, threads, connections, and ephemeral ports. Large numbers of long-waiting requests can exhaust service resources, so clients must set timeouts. (Amazon Web Services, Inc.)

Azure also emphasizes that retry policy must be designed together with timeout. If timeout is too long, threads and connections pile up during failure; if timeout is too short, operations that would have succeeded may fail prematurely. (Microsoft Learn)

The correct model is:

text
per-attempt timeout < acceptable wait time for a single business attempt
total retry duration <= upstream deadline / SLO
retry attempts * per-attempt timeout + retry intervals <= total budget

2.2 Retry Only Transient Failures, Not Deterministic Ones

Azure recommends retrying only when the fault is transient and the operation may succeed after another attempt. HTTP 429 and 5xx responses are common retry candidates, while most 4xx responses such as 400, 401, 403, and 404 are usually not problems that retry can solve. (Microsoft Learn)

My engineering judgment is:

Error typeRetry?Reason
Network timeout, connection resetYesIt may be a transient network issue
HTTP 429YesBut Retry-After or rate-limit policy must be respected
HTTP 500/502/503/504YesThese are typical transient server-side failures
HTTP 400NoThe request is wrong and retrying keeps it wrong
HTTP 401/403No direct retryAuthentication or authorization must be refreshed or rejected
HTTP 404No by defaultUnless it is explicitly an eventual-consistency read delay
Business validation failureNoInsufficient inventory, insufficient balance, or illegal state is not transient
Non-idempotent write timeoutNo by defaultUnless there is an idempotency key or proof the original operation never took effect

2.3 Idempotency Is a Prerequisite for Retry

RFC 9110 defines an idempotent method as one where the intended effect of multiple identical requests is the same as a single request. PUT, DELETE, and all safe methods are idempotent. It also explicitly states that if a method is not idempotent, a client should not automatically retry unless it knows the actual request semantics are idempotent or can confirm the original request was never applied. (RFC Editor)

This principle is critical in business systems. The following operations should not be retried automatically without idempotency protection:

OperationRisk
Create orderDuplicate orders
Charge paymentDuplicate charges
Issue couponDuplicate issuance
Publish MQ messageDuplicate messages
Insert inventory ledger recordDuplicate records
Invoke external systemLocal failure even though the external side already succeeded

The correct approach is to introduce one or more of the following for write operations:

text
idempotencyKey / requestId / businessNo / unique constraint / dedup table / state machine

AWS also makes it clear that APIs with side effects are unsafe without idempotency, and that good API design should avoid duplicate side effects. (Amazon Web Services, Inc.)

2.4 Retry Must Be Bounded, Never Infinite

The Azure documentation explicitly requires avoiding infinite retries because they usually prevent overloaded resources from recovering and cause throttling and connection refusal to last even longer. Bounded retries should be used instead, often with a circuit breaker. (Microsoft Learn)

Retry boundaries should include at least three layers:

text
maxAttempts: maximum number of attempts
maxBackoff: maximum backoff interval
deadline / totalTimeout: maximum total time

My judgment is: a retry configuration without a total time limit is not acceptable. Configuring only maxAttempts is not enough because each individual request may still block for a long time.


3. Retry Inside Thread Pools

3.1 A Thread Pool Is Not a Retry Mechanism

The responsibility of Java ExecutorService and ThreadPoolExecutor is to execute tasks, not to guarantee business success. The Oracle documentation explains that ExecutorService.submit returns a Future, and the caller can wait for completion or cancel the task; Future.get() throws ExecutionException if the task failed with an exception. (Oracle Documentation)

That means: a thread pool does not automatically retry a business task just because your Runnable or Callable throws. If you submit a task with submit() but never call Future.get(), or if the task never catches its own exceptions, failures may be silently ignored with only logs or no business compensation at all.

3.2 Three Different Retry Scenarios in a Thread Pool

What people call "retry in a thread pool" actually covers three very different cases:

TypeTriggerNatureRecommended approach
Retry after execution failureRunnable/Callable throws during executionBusiness execution failureCatch the exception and reschedule according to policy
Retry after submission failureThe thread-pool queue is full or the executor is shut downResource rejectionDo not retry blindly; apply rate limiting, degradation, or reroute to another queue
Timed or delayed retryRetry after some delay following the first failureTask schedulingUse ScheduledExecutorService or MQ

The Oracle RejectedExecutionHandler documentation explains that the rejection handler is invoked when ThreadPoolExecutor.execute cannot accept a task, either because thread or queue bounds were exceeded, or because the executor has been shut down. (Oracle Documentation)

So RejectedExecutionHandler is not a business retry hook. It is a rejection hook for overload or shutdown. Calling executor.execute(r) in an endless loop inside the rejection handler is a terrible design and can easily cause CPU spinning, blocked caller threads, and cascading collapse.

3.3 The Correct Way to Retry Thread-Pool Tasks

If a task must be retried after failure, prefer delayed rescheduling rather than making a worker thread sleep. The Oracle documentation explains that ScheduledExecutorService can schedule commands after a delay and can also run tasks periodically. (Oracle Documentation)

A recommended model is:

java
public final class RetriableTask implements Runnable {
    private final ScheduledExecutorService scheduler;
    private final int attempt;

    public RetriableTask(ScheduledExecutorService scheduler, int attempt) {
        this.scheduler = scheduler;
        this.attempt = attempt;
    }

    @Override
    public void run() {
        try {
            // Execute business logic.
            doBusiness();
        } catch (TransientException ex) {
            if (attempt >= 3) {
                // Send to failure handling path.
                sendToDeadLetter(ex);
                return;
            }

            long delayMs = calculateBackoffWithJitter(attempt);
            // Re-schedule instead of blocking the worker thread.
            scheduler.schedule(
                new RetriableTask(scheduler, attempt + 1),
                delayMs,
                TimeUnit.MILLISECONDS
            );
        } catch (Exception ex) {
            // Non-transient failures should fail fast.
            sendToDeadLetter(ex);
        }
    }

    private void doBusiness() {
        // Business operation.
    }

    private long calculateBackoffWithJitter(int attempt) {
        long base = 100L;
        long max = 3000L;
        long exponential = Math.min(max, base * (1L << attempt));
        return ThreadLocalRandom.current().nextLong(0, exponential + 1);
    }

    private void sendToDeadLetter(Exception ex) {
        // Persist failed task for later diagnosis or compensation.
    }
}

3.4 Best-Practice Conclusions for Thread-Pool Retry

ScenarioRecommended strategyNot recommended
Short-lived network call failureLimited retries inside the task, or reschedule after failureSleeping for a long time inside a worker thread
Submission failure because the queue is fullRate limit, fail fast, degrade, or smooth bursts through an async queueInfinite resubmission inside RejectedExecutionHandler
Batch job failureRecord attempt count, retry with delay, and eventually persist into a failure tableInfinite in-process loops
Async task in a user request chainShort retry + fail fast + observabilityMaking the request thread wait synchronously through multiple retry rounds
Long-duration retryHand off to MQ, a scheduler, or a workflow engineKeeping state in an in-memory thread-pool queue

4. Retry in Message Queues

Retry in message queues is more complex than in HTTP or RPC because it involves acknowledgements, offset commit, redelivery, ordering, dead-letter queues, and consumer idempotency.

4.1 Retry in RabbitMQ

RabbitMQ relies on ack/nack/requeue. Its documentation warns that if all consumers keep requeueing messages because of transient conditions, the system can fall into a requeue or redelivery loop that consumes large amounts of network bandwidth and CPU. (RabbitMQ)

So when consumption fails, RabbitMQ should not be handled naively like this:

text
basic.nack(requeue = true)

Otherwise, when the database is down, a downstream service is unavailable, or all consumers fail, messages will be redelivered repeatedly and a consumption storm will appear.

The correct RabbitMQ model is:

text
consumption fails
  -> determine whether the exception is transient
  -> record retry count
  -> perform delayed retry
  -> move to DLX / DLQ after the retry limit is exceeded

The RabbitMQ Dead Letter Exchange documentation explains that messages can be dead-lettered, meaning republished to another exchange. One trigger is when a consumer uses basic.reject or basic.nack with requeue=false. (RabbitMQ)

4.2 Retry in Kafka

The core of Kafka retry is not that the broker automatically retries your consumer logic forever. The real core is offset management and delivery semantics. The KafkaConsumer documentation explains that the committed position is the last offset safely stored. After the process fails and restarts, the consumer resumes from that offset. Applications may commit offsets periodically automatically, or manually control when a record is considered consumed. (Apache Kafka)

That leads to a critical engineering fact:

Offset commit timingResult
Commit before processingMessage loss is possible after failure, resulting in at-most-once
Commit after processingDuplicate consumption is possible after failure, resulting in at-least-once
Transactional processing plus commitCan approach exactly-once, but depends on system boundaries

The Kafka design documentation also explains that Kafka effectively provides at-least-once by default. Users can simulate at-most-once by disabling producer retries and committing offsets before processing, but that introduces message-loss risk. (Apache Kafka)

For producer retry, the KafkaProducer documentation states that enabling the idempotent producer prevents producer retries from introducing duplicate messages. It also warns that if idempotent producer mode is enabled, applications should avoid doing their own resends because application-level resends cannot be deduplicated by producer idempotency. (Apache Kafka)

4.3 Categories of MQ Retry

StrategyMechanismSuitable scenariosRisk
Immediate redeliverynack/requeue or no offset commitExtremely short-lived failuresCan easily create redelivery loops
Blocking retrySleep inside the consumer thread and then retryLow-frequency, short-duration errorsBlocks partition or queue consumption
Delayed-queue retrySend failed messages to delayed topics or queuesDownstream is temporarily unavailableAdds topic or queue complexity
Tiered retryRetry topics such as 1 min, 5 min, 30 minRecovery time of external systems is uncertainOperational complexity increases
Dead-letter queueEnter DLQ after retry limitPoison messages or permanent failuresRequires manual handling or compensation
Stop consumptionPause the consumer or trip a circuit breakerWhole downstream is unavailableDelay accumulates, but the downstream is protected

The Spring Kafka documentation explains that Kafka non-blocking retry and DLT commonly require extra topics and listener configuration. Since Spring Kafka 2.7, @RetryableTopic and RetryTopicConfiguration simplify this infrastructure. (Home) Its configuration documentation also says that the recommended and simplest way to enable non-blocking retries is to add @RetryableTopic to the @KafkaListener method, letting the framework automatically configure the retry topics and DLT topics. (Home)

4.4 Best-Practice Conclusions for MQ Retry

My judgment is:

text
For short-lived and low-cost failures, 1 or 2 short blocking retries inside the consumer are acceptable.
When a downstream service is unavailable, do not block the consumer thread. Hand the message to a delayed retry queue.
Poison messages must not be retried forever. They must enter a DLQ.
For Kafka partitions that are sensitive to ordering, use non-blocking retry topics cautiously because they may break local ordering.
All message consumption must be idempotent by messageId or businessId.

In MQ scenarios, the most important questions are not "how many retries?" but:

text
failed messages must not be lost
duplicate messages must not create duplicate business effects
poison messages must not block the whole queue
downstream outages must not trigger a consumption storm

5. Retry in HTTP Requests

5.1 The Core of HTTP Retry: Status Codes, Idempotency, and Retry-After

HTTP retry must respect HTTP semantics first. RFC 9110 states explicitly that GET, HEAD, OPTIONS, and TRACE are safe methods, while PUT, DELETE, and all safe methods are idempotent methods. Idempotent methods can be retried automatically after communication failure because the expected effect of repeated requests is the same as a single request. (RFC Editor)

RFC 9110 also requires that clients must not automatically retry non-idempotent methods unless they can confirm that the request semantics are actually idempotent or that the original request was never applied. (RFC Editor)

That means:

HTTP methodDefault retry recommendation
GETRetry is acceptable, but cache behavior, throttling, and request cost still matter
HEADRetry is acceptable
OPTIONSRetry is acceptable
PUTRetry is acceptable, but server semantics must really be idempotent
DELETERetry is acceptable, but deletion behavior must truly be idempotent
POSTDo not retry automatically by default unless there is an idempotency key or explicit business guarantee
PATCHDo not retry automatically by default unless there is an idempotency key or explicit business guarantee

5.2 Which HTTP Status Codes Are Suitable for Retry?

Status code / exceptionRetry?Explanation
408 Request TimeoutYesA timeout may be transient
429 Too Many RequestsYesRetry must respect throttling and Retry-After
500 Internal Server ErrorYesTypical transient server-side failure
502 Bad GatewayYesTemporary gateway or upstream failure
503 Service UnavailableYesAppropriate for backoff retry
504 Gateway TimeoutYesUpstream timeout
400 Bad RequestNoThe request itself is wrong
401 UnauthorizedNo direct retryCredentials should be refreshed first
403 ForbiddenNoPermission issue
404 Not FoundNo by defaultUnless eventual-consistency delay is known explicitly
409 ConflictDepends on business semanticsOptimistic-lock conflict may justify retrying the full read-modify-write flow
422 Unprocessable EntityNoBusiness-semantic error

RFC 6585 defines 429 as the case where the user has sent too many requests in a given amount of time. The response may include Retry-After to indicate how long the client should wait before sending another request. (datatracker.ietf.org) RFC 9110 defines Retry-After as a server hint telling the user agent how long to wait before making a follow-up request. The value may be an HTTP date or a delay in seconds. (RFC Editor)

So the priority order of an HTTP client should be:

text
If the response contains Retry-After: honor Retry-After
Otherwise: use capped exponential backoff with jitter

For user-interaction request chains:

text
maxAttempts = 2~3
perAttemptTimeout = 200ms~2s depending on the business
backoff = 50ms, 100ms, 200ms + jitter
totalTimeout must stay below the user-experience budget

For background tasks:

text
maxAttempts = 3~6
backoff = capped exponential backoff with jitter
maxBackoff = 10s~60s
after failure, enter a task table, MQ, or DLQ

For writes such as payments, orders, or coupon issuance:

text
idempotencyKey is mandatory
server-side dedup tables or unique indexes are mandatory
the client may retry, but must not bypass idempotency checks
after timeout, query the status first before deciding whether compensation is needed

The worst possible HTTP implementation is:

text
while (true) {
    callHttp();
}

That is not availability engineering. It is incident generation.


6. Retry in gRPC Requests

6.1 gRPC Retry Is Not Just a Loop in an Interceptor

The gRPC documentation explains that built-in gRPC retry saves call history and, when retry conditions are met, replaces the failed call with a new call and replays the history. If the RPC has already received response headers, that RPC is considered committed and is no longer retried. (gRPC)

This point matters. gRPC retry is a per-RPC mechanism inside the protocol stack. It should not be reduced to a coarse business interceptor loop. Business interceptors do not understand whether an RPC is already committed, nor do they understand transparent retry, server pushback, or retry throttling.

6.2 Default gRPC Behavior

The gRPC documentation explains that retry is enabled by default, but there is no default retry policy. Without an explicit retry policy, gRPC cannot safely retry most RPCs and only performs very limited transparent retries, such as low-level race failures where the RPC can be confirmed not to have been processed by server application logic. (gRPC)

In other words:

text
gRPC retry support being enabled != your business RPCs being retried according to your intended policy

If you want business RPCs to retry predictably, you must configure a service config.

6.3 Core Parameters of a gRPC Retry Policy

The retry policy shown in the gRPC documentation includes:

json
{
  "retryPolicy": {
    "maxAttempts": 4,
    "initialBackoff": "0.1s",
    "maxBackoff": "1s",
    "backoffMultiplier": 2,
    "retryableStatusCodes": [
      "UNAVAILABLE"
    ]
  }
}

The gRPC documentation explains that retry can configure maximum attempts, exponential backoff, and retryable status codes, and that the backoff delay receives plus or minus 20 percent jitter to avoid large numbers of clients hammering the server at the same time. (gRPC)

6.4 gRPC Retry Throttling

gRPC supports retry throttling: each client maintains a token count for each server. Failed RPCs reduce the token count, while successful RPCs increase it. When the token count falls below a threshold, retry pauses until the count recovers. (gRPC)

That is exactly the kind of capability production systems need. Without retry throttling, gRPC retry can easily kill an already overloaded server more quickly.

gRPC scenarioRecommended strategy
Read-only queryRetry UNAVAILABLE with short deadlines and exponential backoff
Idempotent writeRetry is acceptable, but requestId or idempotencyKey is mandatory
Non-idempotent writeDo not retry automatically by default; query state after timeout
Streaming RPCRetry cautiously, especially bidirectional streams
User-facing request chainSmall retry count, short deadline
Background synchronizationMore attempts and longer backoff are acceptable, but total deadline is still required
Server overloadEnable retry throttling and trip a circuit breaker when needed

gRPC service config also supports timeout, retry policy, hedging policy, and other call-behavior configuration at the service or method level. (gRPC)


7. Which Retry Strategies Exist, and When Should Each Be Used?

7.1 Immediate Retry

Immediate retry means making another attempt immediately after a failure with no delay.

Suitable scenariosUnsuitable scenarios
Extremely short-lived network glitchesDownstream overload
Single packet-collision type issuesHigh-concurrency systems
Mild local CAS or optimistic-lock conflictsExternal service 5xx rates remain high

Azure recommends using immediate retry only for very short-lived transient failures, and not more than once. If the immediate retry also fails, the client should switch to exponential backoff or fallback. (Microsoft Learn)

My judgment: in production, immediate retry should happen at most once. Beyond that it becomes self-destructive traffic amplification.

7.2 Fixed-Interval Retry

Fixed-interval retry means retrying after the same delay each time, such as every 3 seconds.

Suitable scenariosUnsuitable scenarios
Low-concurrency background tasksLarge client populations
Operational scriptsHigh-QPS RPC
Manually triggered jobsRate-limited or overload scenarios

The main problem with fixed intervals is synchronization. If a large batch of clients all fail at the same time and all retry every 3 seconds, they create periodic traffic spikes.

7.3 Increasing-Interval Retry

Increasing-interval retry uses linearly or stepwise increasing delays, such as 1 s, 3 s, 5 s, or 10 s.

Suitable scenarioExplanation
Background jobsGentler than a fixed interval
Batch processingSuitable when failure cost is low and real-time requirements are weak
Simple MQ consumer failureCan work together with retry topics

This is better than fixed intervals, but in large distributed systems it is still weaker than exponential backoff plus jitter.

7.4 Exponential Backoff

Exponential backoff increases the wait time exponentially after each failure, for example:

text
100ms -> 200ms -> 400ms -> 800ms -> 1600ms

The Spring Batch documentation explains that after transient failure, waiting for some time before another attempt is often helpful. A common approach is exponentially increasing wait time, and Spring Batch provides ExponentialBackoffPolicy for that purpose. (Home)

Exponential backoff fits:

ScenarioReason
HTTP 5xxGives the downstream time to recover
gRPC UNAVAILABLEService instances or connections may recover
Cloud API throttlingReduces request frequency
Database failoverWaits for the new primary to become available
MQ delayed retryAvoids immediately hammering the downstream again

7.5 Truncated Exponential Backoff Plus Jitter

This is the strategy I consider the best default for distributed systems.

Google Cloud IAM recommends truncated exponential backoff with introduced jitter for safely retryable requests. Its documentation explains that retrying immediately after a failure can send a burst of requests in a short period and exceed quota, while jitter prevents synchronized retries and reduces thundering-herd behavior. (Google Cloud Documentation)

AWS also emphasizes that if all failed calls wake up and retry at the same moment after backoff, they can create another overload wave. Jitter spreads those retries over time. (Amazon Web Services, Inc.)

The recommended formula is:

text
delay = random(0, min(base * 2^attempt, maxBackoff))

That is a Full Jitter style strategy, and it fits high-concurrency systems well.

7.6 Server-Directed Retry

Server-directed retry means the client gives priority to the wait time explicitly returned by the server.

The standard HTTP example is Retry-After. RFC 9110 defines it as either an HTTP date or a delay in seconds. (RFC Editor) Azure also recommends that when a response contains a Retry-After header, the client should wait at least that long and let the server hint take precedence over local backoff calculation. (Microsoft Learn)

Suitable scenarios include:

text
HTTP 429
HTTP 503
API gateway throttling
Cloud-service quota limitation
server-side protective throttling

7.7 Retry Budget

Retry budget is not just a per-request maximum attempt count. It limits the total amount of retry traffic a process, service, or dependency can generate over some time window.

Azure recommends implementing a retry budget in addition to per-request retry limits, because many concurrent requests that each retry a few times can still crush the downstream. (Microsoft Learn)

Suitable scenarios include:

text
high-QPS microservices
shared downstream dependencies
third-party API integrations
rate-limited cloud services

My judgment is: a high-QPS service without a retry budget will eventually suffer a retry storm.

7.8 Circuit Breaking Together with Retry

Circuit breaking is not a retry strategy, but it is the braking system of retry. Azure recommends using a circuit breaker for continuously failing operations. When the number of failures exceeds a threshold within a time window, requests should fail immediately instead of continuing to hit the broken dependency. (Microsoft Learn)

Suitable scenarios include:

text
continuous downstream 5xx
connection-pool exhaustion
database outage
large-scale third-party API failure

The relationship between retry and circuit breaking is:

text
small amount of transient failure: retry
continuous failure: circuit break
recovery probing: half-open probing
recovery succeeds: close the breaker

7.9 Dead-Letter Queue or Failure Table

Dead-letter handling is not retry itself. It is the destination after retry is exhausted. Azure recommends using a dead-letter queue once all retry attempts are consumed so request information is not lost and failure handling can be deferred. (Microsoft Learn)

Suitable scenarios include:

text
MQ consumption failure
asynchronous task failure
order-compensation failure
external-system synchronization failure
batch-processing failure

A dead-letter system should include:

text
failure reason
original message
attempt count
last failure time
business key
traceId
manual replay tool
idempotency protection

8. Scenario Selection Matrix

ScenarioRecommended retry strategyMax attemptsInterval strategyIdempotency requirementFinal destination
User-facing HTTP queryShort retry2~3One immediate retry + short exponential backoff + jitterRecommendedFailure or degraded response
User-facing HTTP writeCautious retry0~2Exponential backoff + jitterMandatoryQuery status or compensate
gRPC queryBuilt-in retry policy2~4initialBackoff + maxBackoff + multiplier + jitterRecommendedReturn status
gRPC writeRetry only idempotent writes0~3Exponential backoff + retry throttlingMandatoryQuery status or compensate
Thread-pool taskReschedule3~5ScheduledExecutor delay + jitterDepends on businessFailure table
RabbitMQ consumptionDelayed retry + DLQ3~10Multi-level delayConsumption must be idempotentDLQ
Kafka consumptionRetry topic + DLT3~10Non-blocking delayed topicsConsumption must be idempotentDLT
Database optimistic-lock conflictShort retry1~3Immediate retry or short backoffOperation must be replayableReturn conflict
Third-party API rate limitPrefer server hintsDepends on quotaRetry-After / exponential backoffDepends on APIDelayed task
Scheduled batch jobLong backoffMultipleCapped exponential backoffTask must be idempotentFailure table or manual action
Payment chargeDo not retry blindly by default0~1Query state firstStrong idempotency requiredReconciliation or compensation

A qualified enterprise retry standard should include the following.

9.1 Checklist Before Retrying

text
1. Is this error transient?
2. Is this operation idempotent?
3. Is a per-attempt timeout already configured?
4. Is there a total deadline?
5. Will this duplicate retries already happening at another layer?
6. Is there a retry budget?
7. Where does the message or task go after retries are exhausted?
8. Are metrics and logs available?

9.2 Default Policy

text
User-facing request chains:
  maxAttempts = 2~3
  backoff = 50ms / 100ms / 200ms + jitter
  totalTimeout <= user experience budget

Internal RPC:
  maxAttempts = 2~4
  explicit perAttemptTimeout
  capped exponential backoff with jitter
  combined with circuit breaking, rate limiting, and retry budget

MQ consumption:
  1~2 short local retries
  then delayed retry topic or queue
  after limit, move to DLQ or DLT

Background jobs:
  longer backoff is acceptable
  persist attempt count and state
  do not rely on process memory to track retry state

9.3 Observability Metrics

Azure recommends recording retry count, average retry count, and total retry time. Occasional transient failures and retries are expected, but a sustained increase in retry volume usually means a performance or availability problem. (Microsoft Learn)

At minimum, a production system should monitor:

text
retry_attempts_total
retry_success_total
retry_exhausted_total
retry_latency_seconds
retry_budget_exhausted_total
retry_by_exception
retry_by_status_code
dead_letter_total
message_redelivery_total
consumer_retry_lag

9.4 Most Dangerous Anti-Patterns

Anti-patternConsequence
Infinite retryThreads, connections, CPU, and the downstream all get dragged down
Retry without timeoutEvery attempt may hang forever
Retrying at every layerRetry multiplication explodes
Automatic retry of non-idempotent writesDuplicate charges, orders, or coupon issuance
Immediate MQ requeue after failureRedelivery loops
Large-scale fixed-interval retrySynchronized traffic spikes
Sleeping in worker threads for retryThread pools get saturated
Tracking only per-request counts but not global budgetHigh concurrency can still crush the downstream
Dropping the request after retries are exhaustedData loss and no compensation path

10. Conclusion

Retry strategy in software development is fundamentally about trading a limited number of extra attempts for greater tolerance of transient failure. It should be treated as part of reliability engineering, not as a few lines of loop logic in exception handling.

The final judgment of this article is:

  1. What is retry?
    Retry means re-executing an operation according to policy after failure in order to handle transient faults, partial failure, and brief unavailability.

  2. Why retry?
    Because networks, services, cloud resources, messaging systems, and distributed components all experience short-lived failures. Rational retry can significantly improve success rate and perceived availability.

  3. What if there is no retry?
    Many recoverable transient failures become visible business failures immediately, though that does not mean blind retry is acceptable.

  4. How should thread pools retry?
    Thread pools do not automatically retry business tasks. Execution failures should be caught and rescheduled through ScheduledExecutorService or another task system. Submission failures should be rate-limited, degraded, or rejected rather than endlessly resubmitted from a rejection handler.

  5. How should MQ retry work?
    MQ retry must handle duplicate consumption, offset or ack semantics, delayed retry, and dead-letter queues. RabbitMQ should not requeue forever, and Kafka consumers must control offset commit timing while ensuring consumer idempotency.

  6. How should HTTP retry work?
    HTTP retry must obey method idempotency, status-code semantics, and Retry-After. Idempotent semantics such as GET, PUT, and DELETE are much safer to retry. POST and PATCH should not be retried automatically by default unless the business provides an idempotency key.

  7. How should gRPC retry work?
    gRPC should prefer official retry policy through service config, using maxAttempts, initialBackoff, maxBackoff, backoffMultiplier, retryableStatusCodes, and retry throttling.

  8. Which retry strategy is most recommended?
    For modern distributed systems, the default should be bounded truncated exponential backoff with jitter. Immediate retry should happen at most once. Fixed-interval retry fits only simple low-concurrency tasks. Long-lived MQ failures should move into delayed retry and DLQ workflows. High-QPS RPC must add retry budget and circuit breaking.

The article can be closed with one sentence:

Retry is medicine, not food. Used in small doses, with timeout, idempotency, backoff, jitter, and circuit breaking, it can save a system. Used without limits, without idempotency, without budget, and without observability, it drags the system into collapse.

Chinese Reference

GitHub Discussions

Join the discussion

Comments are synchronized with GitHub Discussions in stellhub/stell-web.

Powered by VitePress and GitHub Discussions.