Designing Highly Available Distributed Rate Limiting Systems: From Single-Node Token Buckets to Asynchronous Quota Allocation

Overview

A systematic analysis of server-side rate limiting under high concurrency and burst traffic, covering local rate limiting, global rate limiting, token bucket, leaky bucket, Redis hotspots, quota pre-allocation, asynchronous reporting, edge throttling, partitioned limits, failure degradation, and high-availability architecture.

Abstract

Rate limiting is a common traffic governance mechanism used by server-side systems under high concurrency, traffic bursts, resource contention, and abusive access. Its goal is to control the rate of requests, connections, or resource consumption within a given time window, so backend services, databases, caches, third-party dependencies, and shared infrastructure remain within sustainable capacity. Single-node rate limiting uses local counters, local token buckets, or leaky bucket algorithms to make fast decisions inside a single process, node, or gateway instance. It has low latency, low dependency, and failure isolation characteristics, but it produces global quota error when load is unevenly balanced across multiple instances. Distributed rate limiting coordinates rate limit decisions across multiple instances through shared counters, global rate limit services, or quota coordination services. It can reduce the errors of single-node rate limiting under uneven load, but it introduces additional centralized dependencies, remote call paths, and server-side hotspots in the rate limiting system. Because rate limiting itself targets extremely high-QPS scenarios, precise per-request global counting transfers business request pressure linearly to the rate limit backend, causing Redis, cache clusters, or database backends to face single hot keys, single-shard writes, remote call amplification, and availability risks. Mainstream implementations in public official documentation show that large-scale distributed rate limiting usually combines local coarse-grained limiting, global fine-grained limiting, quota pre-allocation, periodic reporting, asynchronous rebalancing, edge rate limiting, partitioned limiting, and failure degradation. This design sacrifices strict precision in exchange for high availability, hotspot resistance, and system-wide protection.

Keywords: distributed rate limiting; global rate limiting; token bucket; leaky bucket; quota allocation; high availability; hotspot governance; server-side protection

1. Introduction

In microservices, API gateways, service meshes, edge networks, and cloud-native architectures, a single backend resource is usually accessed by multiple clients, service instances, or entry nodes. When request rate exceeds backend processing capacity, excessive requests cause increased queueing latency, thread pool exhaustion, connection pool exhaustion, amplified database pressure, cache avalanche, cascading downstream failures, or resource consumption by malicious traffic. Official documentation usually defines rate limiting around controlling the number of requests within a period of time, preventing overload, defending against abuse, ensuring fair use, and enforcing quota entitlements. For example, Envoy Gateway describes rate limiting as a technique that controls inbound request counts during a defined time period and can be used for business quotas, system stability, overload prevention, and denial-of-service defense. NGINX documentation uses request rate limiting to prevent DDoS or prevent upstream servers from being overwhelmed by too many concurrent requests. AWS API Gateway documentation uses throttling and quotas to help APIs avoid being overwhelmed by too many requests. Google Cloud Armor uses rate-based rules to protect applications from large request volumes and prevent individual clients from exhausting application resources [1][2][3][4].

Rate limiting is not limited to public APIs. Typical scenarios include public API gateway throttling, login and verification-code anti-brute-force controls, crawler and malicious client suppression, tenant-level API quotas, internal service-to-service call protection, database and cache front protection, message consumption rate control, third-party API quota protection, usage control by user identity or API key, and traffic protection for a route, service, instance, or backend cluster inside a service mesh.

2. Rate Limit Objects and Algorithm Basics

Rate limit objects can be requests, connections, bytes, tokens, business cost, or abstract quota units. Rate limit dimensions can include IP, user, tenant, API key, route, method, service name, region, device fingerprint, request header, cookie, business tag, or a combination of dimensions. Common algorithms include fixed window, sliding window, leaky bucket, and token bucket. Different algorithms correspond to different time-window precision, burst absorption capability, state storage cost, and computational complexity.

Token buckets are common in cloud provider API gateways. AWS API Gateway documentation states that it uses the token bucket algorithm for throttling, where a token represents one request and the system decides based on request submission rate and burst limits. NGINX request rate limiting documentation states that its method is based on the leaky bucket algorithm. Apache APISIX limit-count plugin documentation states that it uses a fixed window algorithm to limit request counts in a given time interval, rejecting requests that exceed the quota [3][4][5].

3. Single-Node Rate Limiting: Capability Boundaries and Error Sources

Single-node rate limiting means the rate limit state is stored inside one process, proxy, gateway instance, or service node, and decisions do not depend on external shared state. Envoy documentation calls local rate limiting non-distributed rate limiting, and explains that Envoy supports L4 connection-level local rate limiting and HTTP request-level local rate limiting. Envoy's HTTP local rate limit filter uses a token bucket and returns 429 by default when no token is available. NGINX also supports counting connections or request rates by key inside the same NGINX instance based on a shared memory zone [2][6][7].

Single-node rate limiting solves several problems. First, it can reject excessive requests quickly on the local node, preventing requests from entering business thread pools, connection pools, or backend dependencies. Second, it does not require a remote rate limit service, so decision latency is low. Third, when a rate limit service, Redis, or database is unavailable, it does not enlarge the failure domain. Fourth, it can absorb instantaneous bursts and act as a coarse-grained protection layer before later global rate limiting.

The problem with single-node rate limiting is that its state is visible only inside the local instance. In multi-replica deployments, if each instance independently maintains its local rate limit bucket, global behavior depends on load balancing results. Suppose the global target quota is L, the number of instances is N, and each instance's local quota is set to L/N. When traffic is evenly distributed, the global allowed volume is close to L. When traffic is uneven, hot instances may exhaust their local quota early and reject requests, while cold instances still have remaining quota, causing the global allowed volume to be lower than the target. If every instance is configured with the full quota L, the theoretical global allowed upper bound expands to N x L. Envoy Gateway documentation describes the same phenomenon: local rate limiting is executed independently by each Envoy instance. If two Envoy replicas each limit to 50 requests per minute, the total may allow 100 requests per minute. With three replicas, the theoretical maximum increases with the replica count [8].

Therefore, single-node rate limiting errors come from two directions: replica count can amplify the total quota, and uneven load balancing can cause some instances to reject requests while other instances have idle quota. This error becomes more obvious with long-lived connections, HTTP/2, gRPC, client connection reuse, uneven L4 load balancing, regional traffic skew, and concentrated access by hot tenants.

4. Distributed Rate Limiting: Global Consistency and New Failure Points

The core goal of distributed rate limiting is to share rate limit state across multiple gateways, proxies, or service instances, so multiple entry nodes jointly follow the same global quota. Envoy documentation explains that global rate limiting is appropriate in specific scenarios, especially when many downstream hosts forward to a small number of upstream hosts and average request latency is low. Envoy provides two types of global rate limiting implementations: per-connection or per-HTTP-request checks, and quota-based implementations that use periodic load reporting and fairly share global limits across multiple Envoy instances [9].

A typical per-request global rate limiting implementation is a proxy calling an external Rate Limit Service on the request path. Envoy's HTTP rate limit filter calls the rate limit service when route or virtual host matching rate limit configuration exists. Each matching configuration produces a descriptor sent to the rate limit service. If any descriptor is over limit, the proxy returns 429. Istio official task documentation also explains that Envoy supports both global and local rate limiting. Global rate limiting uses a global gRPC rate limiting service to provide limits for the whole mesh, while local rate limiting limits request rate by service instance [9][10].

Distributed rate limiting solves the global quota error of single-node limiting under uneven load. A global rate limit service can maintain unified counters by tenant, API, route, IP, service, or combined descriptor, so different entry instances no longer rely only on local state. However, distributed rate limiting also introduces new failure points. Envoy's HTTP rate limit filter has a failure_mode_deny configuration. When the rate limit service does not respond, if this configuration is true, Envoy does not allow traffic to pass when communication with the rate limit service fails; if a call error occurs and the configuration is true, an error is returned. Kong Gateway documentation explains that when a Kong node using the Redis strategy disconnects from Redis, it falls back to local rate limiting. At that point, counters cannot be synchronized across nodes, and users may issue requests beyond the configured limit. Apache APISIX limit-count provides an allow_degradation configuration, which allows APISIX to continue processing requests without the plugin when the plugin or its dependency is unavailable [11][12][5].

This leads to an objective constraint: while distributed rate limiting improves global quota consistency, it must handle trade-offs among external rate limit services, shared storage, network calls, timeouts, failure degradation, and data consistency. If fail-closed is selected, rate limit service failure may block normal requests. If fail-open or local fallback is selected, global rate limit precision decreases.

5. Capacity Constraints of Precise Synchronous Global Counting

Rate limiting usually happens in extremely high-QPS scenarios. If every business request synchronously accesses a global counting backend, request volume to the rate limit backend is on the same order as the protected business request volume. Envoy documentation clearly states that per-connection or per-HTTP-request global rate limiting calls the rate limit service for every new connection or request. Envoy's reference implementation is a Go/gRPC service that reads rate limit configuration, generates cache keys, accesses Redis cache, and returns decisions [9][13].

When global counters are implemented with Redis, cache clusters, or databases, precise synchronous counting creates the following constraints. First, every allow or reject decision depends on remote state reads and writes, making the rate limiting system itself a synchronous dependency on the request path. Second, all requests for the same global rate limit key must operate on the same logical counter. Hot tenants, hot APIs, or global total keys become hotspots. Third, although Redis Cluster can distribute keys across nodes through slots, official documentation states that a single key is not split across multiple nodes. Each master node serves a subset of 16,384 hash slots in stable state, and a single hash slot is served by a single node. Redis documentation also provides hot key tracking to identify keys that consume a high proportion of CPU time or network bytes during tracking [14][15].

Therefore, implementing "per-request, strongly consistent, globally precise" rate limiting with Redis or any high-performance database does not eliminate hotspot writes and centralized dependency. It merely converts business request pressure into rate limit storage pressure. For extremely high QPS, a small number of hot keys, or global total quota scenarios, the bottleneck of precise synchronous counting concentrates on the rate limit service, network path, Redis shard, database partition, or a single hot key.

6. Availability-First Asynchronous Quota Model

Large-scale distributed rate limiting usually does not take "every request must be remotely and precisely counted" as its only goal. Its primary goal is protecting system availability. Multiple official documents describe the tension between rate limit precision and distributed architecture. AWS API Gateway documentation states that throttles and quotas are applied on a best-effort basis and should be treated as targets rather than guaranteed request ceilings. Usage plan throttling and quotas are also not hard limits, and clients may exceed quotas in some cases. Azure API Management documentation states that because of the distributed nature of the throttling architecture, rate limiting is never completely accurate, and the difference between configured allowed request counts and actual request counts varies with request volume, rate, backend latency, and other factors. Google Cloud Armor documentation states that enforced rate limits are approximate and may not strictly match configured thresholds. It recommends using rate limiting for abuse mitigation or maintaining application and service availability, rather than strict quotas or licensing requirements. Cloudflare documentation states that rate limiting rules are not intended to allow an exact number of requests to reach the origin, counter updates may have delays of up to several seconds, and counters are maintained per data center by default. Fastly documentation states that its rate counters are designed to quickly count high traffic, not for exact counting, and distinguishes anti-abuse rate limiting from resource rate limiting [3][16][17][18][19][20].

The basic idea of an asynchronous quota model is to move "real-time quota acquisition and calculation" out of the synchronous request path. Envoy Rate Limit Quota Service protocol provides a typical model: the data plane groups requests into quota buckets and periodically reports request load to RLQS; RLQS aggregates usage reports from multiple data plane instances and distributes Rate Limit Assignments to each instance; when quota changes, RLQS proactively pushes new assignments; the Envoy filter makes request decisions based on the locally received quota assignment. This mode changes per-request remote counting into periodic usage reporting and local quota consumption, reducing synchronous pressure on the global rate limit service under extremely high QPS [21].

In this model, precision loss comes from reporting intervals, quota distribution latency, load changes among instances, quota TTL, and using the last assignment or predefined fallback values during failures. Availability benefits come from localizing the request path, reducing remote call frequency, smoothing rate limit service load, and increasing space for failure degradation.

7. Implementation Patterns in Mainstream Vendors and Open Source Components

Mainstream implementations in public official documentation can be grouped into the following categories.

The first category is local rate limiting. NGINX uses shared memory zones to count connections or request rates by key within a single instance and limits requests based on the leaky bucket model. Envoy local rate limit filter uses a local token bucket and returns 429 when no token is available. This mode is suitable for node-level protection, burst absorption, and low-dependency scenarios [2][6][7].

The second category is external global rate limit services. Envoy executes global rate limiting through a gRPC Rate Limit Service, and the reference implementation uses Go/gRPC with a Redis backend. Istio integrates Envoy's global rate limit HTTP filter through EnvoyFilter and uses Envoy's Go + Redis reference implementation as an example. Envoy Gateway's global rate limit task documentation also requires Redis as the rate limit backend [9][10][13][22].

The third category combines local and global rate limiting. Envoy documentation explains that local rate limiting can be combined with global rate limiting to reduce load on the global rate limit service; local token buckets can absorb large bursts that might overwhelm the global service. Envoy Gateway documentation also explains that local limits execute first on each Envoy instance, then global limits execute through a shared rate limit service. A request must pass both checks to be allowed [8][9].

The fourth category is multi-strategy counting in gateway plugins. Apache APISIX limit-count supports local, redis, and redis-cluster strategies. Local stores counters in local memory, while redis and redis-cluster store counters in Redis or Redis Cluster and allow multiple APISIX nodes to share quota. Kong Gateway's advanced rate limiting plugin supports a Redis strategy and falls back to local rate limiting when Redis is disconnected [5][12].

The fifth category is managed cloud provider rate limiting. AWS API Gateway uses token buckets and supports account-level, stage-level, method-level, and usage-plan-level throttling and quotas, but defines them as best-effort targets rather than strict upper bounds. Azure API Management provides rate-limit and rate-limit-by-key policies and clearly states that rate limiting is not completely accurate under distributed throttling architecture. Google Cloud Armor provides throttle and rate-based ban, and supports aggregating requests by keys such as ALL, IP, HTTP_HEADER, XFF_IP, Cookie, HTTP_PATH, SNI, REGION_CODE, JA3, and JA4. Its documentation also states that cross-region and internal routing affect accuracy. Cloudflare WAF rate limiting rules maintain counters by characteristic combinations, but do not support globally exact counters by default. Fastly Edge Rate Limiting provides ratecounter and penaltybox, where ratecounter is used to quickly count high traffic rather than count precisely [3][16][17][18][19][20].

These implementations show that large-scale rate limiting is not a single "Redis precise counter." It is composed of local rate limiting, global rate limiting, shared storage, asynchronous reporting, edge throttling, approximate counting, failure degradation, and multi-level protection.

8. Hotspot and Large-Connection Governance Under Massive Traffic

When massive-traffic applications adopt distributed rate limiting, the rate limit service faces two types of hotspots. The first type is counting hotspots, where many requests concentrate on the same rate limit key, Redis slot, database partition, or global quota object. The second type is connection hotspots, where many proxies, gateways, or clients establish long-lived connections to a small number of rate limit service instances, causing connection counts, HTTP/2 streams, gRPC subchannels, connection pools, and server thread models to become imbalanced.

Standard techniques for counting hotspots include:

Local coarse limiting first. Before entering the global rate limit service, each proxy instance first uses a local token bucket or leaky bucket to absorb bursts. Envoy documentation clearly states that a local token bucket can absorb large bursts that might overwhelm the global rate limit service, forming a two-stage model of local coarse-grained and global fine-grained limiting [9].
Asynchronous quota pre-allocation. Split global quota into local quota assignments distributed to data plane instances. Instances consume quota locally and periodically report actual usage, and the server rebalances. Envoy RLQS adopts this model of periodic reporting by the data plane, aggregation by the server, and assignment distribution [21].
Splitting keys by rate limit dimension. Use tenant, API, route, region, caller, user, request header, or business tag as the rate limit key, so different business entities fall into different counting objects. Cloudflare, Google Cloud Armor, APISIX, and Envoy RLS all support aggregation by different keys, characteristics, or descriptors [5][13][18][19].
Virtual sharding of hot keys. For naturally global hot keys, split a single logical key into multiple virtual buckets. Local instances or entry nodes write to different buckets by hash, then asynchronously aggregate approximate global usage. This design sacrifices precision and real-time accuracy in exchange for write spreading. Redis Cluster's documented fact that a single key is not split across nodes means hot keys need to be split at the application layer, not rely only on Redis Cluster to automatically eliminate single-key hotspots [14][15].
Regional, cell, and cluster-level rate limiting. Pre-split global quota by region, availability zone, cell, or cluster, so rate limit decisions are made locally. Google Cloud Armor documentation states that its rate limit threshold is enforced independently in each associated backend or region, and Cloudflare also states that counters are maintained per data center by default. These implementations reflect edge or regional partitioning of rate limiting [18][19].

Standard techniques for connection hotspots include:

Horizontally scaling rate limit services and enabling client-side load balancing. gRPC official documentation explains that load balancing distributes requests across multiple servers to avoid overloading one server. The default pick_first strategy does not truly load balance, while round_robin connects to all resolved addresses and rotates backends on each RPC. For high-frequency gRPC services such as RLS, client-side round_robin or xDS/Envoy load balancing can prevent many RPCs from being fixed on one service instance [23].
Using L7 load balancing policies. Envoy supports pluggable load balancing policies, including weighted round robin, client-side weighted round robin, weighted least request, ring hash, Maglev, and random. For rate limit services, least request, weighted round robin, or client-side weighted round robin based on load reporting can reduce request imbalance among instances [24].
Using service discovery and EndpointSlice to dynamically update backend sets. Kubernetes Service provides a stable access entry for a group of Pods, EndpointSlice records the current Pod set supporting the Service, and service proxies route traffic to backends. This mechanism can expose multi-replica RLS, but long-lived protocols still need to combine gRPC client load balancing or L7 proxies to avoid a small number of long connections sticking to a small number of backends [25].
Connection pool, timeout, and circuit breaking isolation. A global rate limit service must configure short timeouts, connection pool capacity, error statistics, and failure strategies. Envoy rate limit filter has default timeout configuration and supports fields such as failure_mode_deny, status_on_error, and runtime fractional enforcement. This mechanism allows systems to execute fail-open, fail-closed, or proportional degradation when the rate limit service is slow, wrong, or unreachable [11].
Edge rate limiting and pre-origin traffic shaping. Edge-side rate limiting capabilities from Cloudflare, Fastly, Google Cloud Armor, and similar platforms can perform coarse filtering before requests reach origins or internal service meshes. Edge rate limiting documentation usually states that counting is not globally and strictly precise, but it reduces the raw request volume that origin systems and internal rate limit services need to process [18][19][20].

9. Reference Architecture for Distributed Rate Limiting Systems

A distributed rate limiting system for extremely high QPS can use the following layered architecture.

The first layer is edge rate limiting, deployed at CDN, WAF, cloud load balancer, or public API gateway. It performs coarse-grained rate limiting by IP, geography, device fingerprint, path, header, cookie, or API key. This layer reduces the impact of malicious traffic, crawlers, brute force attempts, and abnormal bursts on internal systems.

The second layer is local instance rate limiting, deployed inside Envoy, NGINX, APISIX, Kong, application sidecars, or service processes. This layer uses local token buckets, leaky buckets, or fixed-window counters to make fast decisions, protecting local thread pools, connection pools, and downstream dependencies while reducing pressure on global rate limit services.

The third layer is global quota coordination, maintaining logical quotas by tenant, service, route, API key, business resource, or combined descriptor. For medium and low QPS rules or rules requiring stronger consistency, per-request RLS plus Redis or Redis Cluster can be used. For extremely high QPS, hot keys, or obviously uneven load, a quota lease, periodic reporting, and quota assignment model should be used, so the data plane makes decisions locally and global coordination frequency is reduced to seconds or longer.

The fourth layer is asynchronous aggregation and auditing. It receives logs, metrics, rate limit hit statistics, and quota consumption reports for compensation calculation, reporting, alerting, tenant usage auditing, rule tuning, and long-term quota management. This layer does not perform strongly dependent decisions on the synchronous request path.

The fifth layer is failure degradation. It defines behavior when the rate limit service is unavailable, Redis is unavailable, quota assignments expire, the configuration center is unavailable, or cross-region network partitions occur. Common strategies include fail-open, fail-closed, continuing to use the last quota, local default quota, proportional rejection, enabling only edge and local rate limiting, and disabling non-core high-cost APIs.

The core constraint of this architecture is that the synchronous path keeps only necessary local decisions; global coordination is completed through asynchronous reporting, quota pre-allocation, and periodic rebalancing. Precision is determined by the business scenario. Anti-abuse and service protection scenarios should prioritize availability, while billing, licensing, and strict quota scenarios require additional auditing, compensation, and low-rate strongly consistent checks.

10. Conclusion

The purpose of rate limiting is to control the rate of requests, connections, or resource consumption, protecting backend services and shared resources from overload, abuse, and bursts. Single-node rate limiting solves the problem of fast local protection, but it has global quota errors when load balancing is uneven across multiple instances. Distributed rate limiting solves the problem of shared quotas across multiple instances, but introduces external rate limit services, shared storage, and network calls as new failure points. Because rate limiting itself targets extremely high-QPS scenarios, precise synchronous per-request global counting transfers business request pressure linearly to the rate limit backend and creates hot keys and single-shard bottlenecks in Redis, databases, or cache clusters. Mainstream official implementations show that large-scale rate limiting systems generally accept a certain degree of approximation and combine local rate limiting, global rate limiting, asynchronous quotas, periodic reporting, edge traffic shaping, partitioned execution, hotspot splitting, and failure degradation to protect server-side systems. For massive-traffic applications, the design focus of distributed rate limiting is not pursuing absolutely precise counting, but maintaining system availability, reducing hotspots, protecting core dependencies, and preventing the rate limiting system itself from becoming a new bottleneck under explicit error boundaries.

References

[1] Envoy Gateway Rate Limiting official documentation. [2] NGINX Limiting Access to Proxied HTTP Resources official documentation. [3] AWS API Gateway Throttle Requests official documentation. [4] Google Cloud Armor Rate Limiting Overview official documentation. [5] Apache APISIX limit-count plugin official documentation. [6] Envoy Local Rate Limiting official documentation. [7] Envoy HTTP Local Rate Limit Filter official documentation. [8] Envoy Gateway Local and Global Rate Limiting official documentation. [9] Envoy Global Rate Limiting official documentation. [10] Istio Enabling Rate Limits Using Envoy official documentation. [11] Envoy HTTP Rate Limit Filter / Proto official documentation. [12] Kong Rate Limiting Advanced plugin official documentation. [13] Envoy Ratelimit Go/gRPC reference implementation official repository. [14] Redis Cluster Specification official documentation. [15] Redis HOTKEYS official documentation. [16] AWS API Gateway Usage Plans official documentation. [17] Azure API Management rate-limit policy official documentation. [18] Google Cloud Armor Rate Limiting Overview / Envoy Integration official documentation. [19] Cloudflare WAF Rate Limiting Rules / Request Rate Calculation official documentation. [20] Fastly Rate Limiting official documentation. [21] Envoy Rate Limit Quota Service / Rate Limit Quota Filter official documentation. [22] Envoy Gateway Global Rate Limit official documentation. [23] gRPC Custom Load Balancing Policies official documentation. [24] Envoy Supported Load Balancers official documentation. [25] Kubernetes Service / EndpointSlice official documentation.

Chinese Reference

Read the original Chinese article

Designing Highly Available Distributed Rate Limiting Systems: From Single-Node Token Buckets to Asynchronous Quota Allocation ​

Overview ​

Abstract ​

1. Introduction ​

2. Rate Limit Objects and Algorithm Basics ​

3. Single-Node Rate Limiting: Capability Boundaries and Error Sources ​

4. Distributed Rate Limiting: Global Consistency and New Failure Points ​

5. Capacity Constraints of Precise Synchronous Global Counting ​

6. Availability-First Asynchronous Quota Model ​

7. Implementation Patterns in Mainstream Vendors and Open Source Components ​

8. Hotspot and Large-Connection Governance Under Massive Traffic ​

9. Reference Architecture for Distributed Rate Limiting Systems ​

10. Conclusion ​

References ​

Chinese Reference ​

Join the discussion