Skip to content

Site Reliability Engineering for Middleware Platforms

Overview

A systematic study of how middleware and microservice teams should define SLI, SLO, and SLA, and how observability and service governance should form a closed reliability loop.

Abstract

Site Reliability Engineering, or SRE, is an engineering discipline for managing reliability. Its core idea is not "add more dashboards" or "put more people on call," but to turn reliability into something measurable, negotiable, governable, and reviewable through SLI, SLO, and SLA. In Google's SRE model, an SLI is a quantitative measure of some aspect of service level, an SLO is the target value or target range built on top of an SLI, and an SLA is a formal commitment made to service consumers, often with consequences when the commitment is missed. Google SRE documentation explicitly describes SLI as a quantitatively defined measure of some aspect of service level, and SLO as a target value or range of values for that measure. Google Cloud also emphasizes that an SLA is usually a commitment made to users and may trigger compensation, service credits, or other penalties when not achieved. (Google SRE)

This article argues that the hard part of middleware SRE is not copying availability metrics from application services. It is that middleware is usually a shared dependency and a combined system of control plane and data plane. Therefore, middleware reliability indicators must cover request success on the data plane, configuration propagation latency on the control plane, rule consistency, persistence durability, client-perceived availability, fault isolation, and recovery capability at the same time. For microservices, SLI, SLO, and SLA should be designed around user journeys, interface contracts, call paths, and business correctness. Observability is not just a downstream support system for governance. It is the factual source of governance. Service governance is not merely an executor downstream from observability either. It feeds topology, labels, change events, and governance context back into the observability system. The two should form a closed loop of "observe, judge, govern, validate, and review."

Keywords: SRE, SLI, SLO, SLA, middleware, microservices, service governance, observability, error budget, site reliability engineering


1. Introduction

In traditional operations models, reliability is often described with vague phrases such as "stable," "mostly fine," "fewer incidents recently," or "machine load looks acceptable." The biggest problem with such language is simple: it does not support engineering decisions. When development teams want to ship faster, operations teams want to reduce risk, and business teams want fewer incidents, the absence of a shared reliability language leaves people with only debate, intuition, and blame after the fact.

SRE tries to solve that problem. The Google SRE model proposes the "four golden signals" of latency, traffic, errors, and saturation as the most fundamental dimensions for monitoring user-facing systems. Google SRE explicitly recommends that if you can monitor only four kinds of metrics, focus on latency, traffic, errors, and saturation first. (Google SRE) But in modern enterprise architecture, especially in middleware and microservice ecosystems, those four dimensions alone are not enough. The reasons are straightforward:

First, middleware is not a single business endpoint. It is shared infrastructure depended on by many systems. Registry centers, configuration centers, rate-limiting systems, gateways, message queues, caches, database proxies, and tracing platforms all have amplified blast radius when they fail.

Second, a microservice is not an isolated process. It is a dynamic system composed of service calls, asynchronous messaging, cache access, database access, configuration changes, and traffic-control rules such as rate limiting and circuit breaking. A service process being alive does not mean users are having a healthy experience.

Third, if observability is only able to "see the problem" but cannot drive rate limiting, circuit breaking, degradation, routing, scaling, or rollback, it degenerates into an evidence system for after the incident instead of a control system for before and during the incident.

Therefore, this article argues that the core of middleware site reliability engineering is to use SLI, SLO, and SLA as reliability contracts, observability as the factual foundation, and service governance as the execution mechanism, in order to build a continuous closed-loop reliability system.


2. What SLI, SLO, and SLA Mean and Why We Need Them

2.1 SLI: Service Level Indicator

SLI stands for Service Level Indicator. It answers one question: how is the system actually performing right now?

An SLI must be quantitative, not vague. For example:

Reliability DimensionPoor ExpressionProper SLI Expression
AvailabilityThe service is pretty stableSuccessful requests over the last 30 days / total valid requests
LatencyThe API feels slowp95 latency, p99 latency
CorrectnessThe data is mostly fineCorrect results / total requests
FreshnessConfig sync is fastp99 latency from config release to client effective time
DurabilityMessages are not lostSuccessfully persisted messages / acknowledged messages

A qualified SLI should satisfy four conditions:

  1. Measurable: it must be computable from metrics, logs, traces, probes, or business validation.
  2. User-relevant: it should reflect caller, business, or end-user experience rather than only machine health.
  3. Boundary-defined: numerator, denominator, time window, and filters must be explicit.
  4. Attributable: when the metric degrades, it should be possible to break the issue down by service, API, caller, region, version, dependency, or governance rule.

My judgment is that many teams fail at monitoring not because they have too few metrics, but because they defined the wrong SLI. If a team watches only CPU, memory, process liveness, and open port state, but does not watch user request success, config propagation latency, message backlog latency, or rule-hit correctness, the monitoring system has very limited value for reliability governance.

2.2 SLO: Service Level Objective

SLO stands for Service Level Objective. It answers another question: what level of reliability should the system achieve?

An SLO usually contains three parts:

text
SLO = SLI + target threshold + time window

For example:

text
Within the last 30 days, 99.9% of valid order-creation requests should return successfully within 300 ms.

Here:

PartExample
SLISuccess rate and latency of valid order-creation requests
Target thresholdsuccess rate >= 99.9%, p95 latency <= 300 ms
Time windowlast 30 days

The major value of SLO is that it introduces the error budget. If the SLO is 99.9%, then the allowed unavailable portion is 0.1%. That 0.1% is the error budget. The Google SRE Workbook points out that once you have an SLO, you can derive an error budget and define policies for how teams should act when that budget is exhausted. (Google SRE)

At its core, the error budget is an engineering management tool:

text
Error budget = 1 - SLO

For example:

SLO30-Day Error Budget
99%about 7.2 hours of unavailability
99.9%about 43.2 minutes of unavailability
99.99%about 4.32 minutes of unavailability

The purpose of SLO is not to pursue "never fail." It is to define how much failure the system can tolerate. If error budget is still abundant, teams can ship faster and experiment more aggressively. If the budget is burning quickly, teams should reduce change frequency and prioritize reliability work.

2.3 SLA: Service Level Agreement

SLA stands for Service Level Agreement. It answers a different question: what is the provider committing to the consumer, and what happens if the commitment is not met?

The distinction between SLA and SLO is critical:

DimensionSLOSLA
NatureInternal reliability targetFormal external or cross-team commitment
PurposeGuide engineering governanceClarify accountability and breach consequences
PenaltyUsually no direct penaltyUsually compensation, downgrade, credits, or assessment
Setting principleShould be stricter than SLAShould not exceed real system capability
AudienceEngineering, SRE, platform teamsCustomers, business teams, upstream and downstream consumers

One especially important principle is: SLO should be stricter than SLA.

For example, if the external SLA promise is 99.9%, the internal SLO may be 99.95%. That leaves a safety buffer. If the internal SLO and external SLA are identical, then any slight SLO miss can immediately become an SLA breach, which is operationally dangerous.

2.4 Why Design SLI, SLO, and SLA

The reason to design SLI, SLO, and SLA is not to generate reports. It is to establish order in reliability governance.

First, turn reliability from intuition into quantified judgment.
Without SLI, no one can say whether a service is actually good or bad. A live process is not the same as a reliable service, and a high request success rate is not the same as business correctness. Reliability must be defined precisely.

Second, turn team disputes into target negotiation.
Developers want to release, business wants stability, and platform teams want lower operational risk. Without SLO, there is no common decision standard. With SLO, the discussion becomes: "Does the current error budget allow continued release?"

Third, turn alerts from noise into decision signals.
The Google SRE Workbook emphasizes that alerts based on SLI and error budget should notify teams only about events that truly matter, especially those that are consuming significant error budget. (Google SRE) That is much more mature than alerting every time CPU exceeds 80%.

Fourth, turn incident review into system improvement.
After an incident, the question should not only be "who caused it?" It should be: which SLI degraded, which SLO was violated, why did error budget burn, were governance actions timely, and did observability provide enough data?

Fifth, turn platform capability into a service contract.
A middleware platform should not merely say "we provide a registry center, a configuration center, and a rate-limiting system." It should say: "we guarantee config release reaches clients within p99 3 seconds, service discovery availability is 99.95%, and rule-hit correctness is 99.99%."


3. How Middleware Should Define SLI, SLO, and SLA

Middleware SLI, SLO, and SLA cannot simply copy business API indicators. Middleware usually has these characteristics:

  1. Shared dependency: many businesses rely on it, so one failure affects many systems.
  2. Control plane and data plane separation: config publishing is control plane, request forwarding is data plane.
  3. A mix of strong consistency and eventual consistency: common in registry centers, config centers, and rule distribution.
  4. Deep client SDK participation: reliability depends not only on the server but also on client caching, retry, and degradation behavior.
  5. Amplified failure impact: a middleware wobble can trigger business-wide avalanches.

Therefore, middleware metrics should be designed by capability domain, not just by machine resource.

3.1 Core Categories of Middleware SLI

3.1.1 Availability SLI

This applies to registry centers, configuration centers, gateways, rate-limiting services, message queues, cache proxies, and similar systems.

text
Availability SLI = successful requests / total valid requests

But "successful" must be defined in middleware semantics:

Middleware TypeDefinition of Success
Registry centerservice registration succeeds, heartbeat renewal succeeds, service discovery returns valid instances
Configuration centerconfig read succeeds, config publish succeeds, client watch succeeds
API gatewayrequest is routed correctly and does not fail because of gateway-side errors
Rate-limiting systemrule lookup succeeds, rate-limit judgment succeeds, quota deduction succeeds
MQsend acknowledgment succeeds, message persistence succeeds, consumption acknowledgment succeeds
Cache proxyget, set, and delete succeed, excluding invalid client requests

Here, caller-originated errors must be excluded. HTTP 400, auth failure, or illegal arguments should not all count as middleware failure. Otherwise the SLI is polluted by consumer misuse.

3.1.2 Latency SLI

Middleware latency cannot be represented by averages alone. It must use percentiles, especially p95, p99, and p999.

Middleware TypeExample Latency SLI
Registry centerp99 service-discovery latency
Configuration centerp99 latency from config publish to client effective time
Gatewaygateway internal p99 latency, excluding backend business latency
MQp99 produce ack, p99 end-to-end delivery
Cache proxyp99 command execution latency
Rate-limiting systemp99 latency of a single rate-limit judgment

One key point is that middleware latency must separate self time from downstream time. For example, a slow gateway request may come from a slow backend rather than from the gateway itself. Therefore, gateway SLI should at least break down into:

text
gateway_total_latency
gateway_internal_latency
upstream_service_latency

Otherwise governance decisions will be misguided.

3.1.3 Correctness SLI

This is the most frequently ignored and also one of the most important middleware indicators.

Middleware CapabilityCorrectness SLI
Registry centerwhether discovery results include healthy instances and remove unhealthy ones
Configuration centerwhether client config version matches the server-intended version
Rate-limiting systemwhether rule matching and quota deduction are correct
Routing systemwhether gray traffic enters the intended version
Circuit-breaking systemwhether breaker state follows error-rate and window rules
MQwhether messages are duplicated, reordered, or lost, and whether promised semantics are respected
Cachewhether data is expired, penetrated, or polluted

My view is that middleware SRE is incomplete if it defines only availability and latency but not correctness. The scariest middleware failures are often not "it is down," but "it is still alive and returning the wrong result."

3.1.4 Consistency and Propagation SLI

Control-plane middleware must define propagation metrics.

For a configuration center:

text
config propagation latency = time client observes target config version - time server marks publish successful

For a registry center:

text
instance-removal propagation latency = time client no longer discovers an unhealthy instance - time server judges the instance unhealthy

For a governance rule center:

text
rule effective latency = time traffic actually follows the new rule - time rule publish succeeds

These indicators are often closer to real middleware value than plain API availability. A configuration center API can be 100% available while config takes 5 minutes to become effective, which is still a serious business problem.

3.1.5 Durability and Recovery SLI

This category applies to MQ, registry/config storage, rule centers, and metadata centers.

IndicatorMeaning
RPOmaximum data loss tolerated after a failure
RTOhow quickly service must recover after a failure
Data loss ratelost data / data that should have been persisted
Replica sync delayreplication lag between follower and leader
Snapshot recovery success ratesuccessful snapshot recoveries / total recoveries

If a middleware system stores core configuration, governance rules, or message data but has no RPO or RTO indicators, its reliability model is incomplete.

3.1.6 Saturation SLI

Saturation is not ordinary resource monitoring. It should reflect whether the system is approaching its processing limit.

MiddlewareSaturation Indicators
Gatewayconnection count, active requests, thread-pool queue, upstream pending requests
MQtopic backlog, consumer lag, broker disk waterline
Registry centerwatch count, push-queue backlog, heartbeat handling queue
Configuration centerlong-connection count, change push queue, client watch count
Rate-limiting systemhot-rule QPS, quota-store conflicts, remote-judgment latency
Cachememory waterline, eviction rate, hot-key traffic

The value of saturation metrics is to predict failure instead of waiting for it to happen and only then alerting.

3.2 Example Middleware SLOs

MiddlewareSLIExample SLO
Registry centerservice-discovery success rate>= 99.95% over the last 30 days
Registry centerunhealthy-instance removal propagation latencyp99 <= 5 seconds
Configuration centerconfig-read success rate>= 99.99% over the last 30 days
Configuration centerconfig effective latency after publishp99 <= 3 seconds
API gatewaygateway-side 5xx rate<= 0.05% over the last 30 days
API gatewaygateway internal processing latencyp99 <= 20 ms
Rate-limiting systemjudgment success rate>= 99.99% over the last 30 days
Rate-limiting systemrule-hit correctness>= 99.999% over the last 30 days
MQproduce ack success rate>= 99.95% over the last 30 days
MQend-to-end delivery latencyp99 <= 1 second
Rule centerrule publish success rate>= 99.99% over the last 30 days
Rule centerrule effective latencyp99 <= 5 seconds

One design principle matters here: middleware SLO must be tiered.

For example:

TierTypical ScopeAvailability Target
P0 core pathpayment, order creation, core gateway, core registry and config99.99%
P1 important pathprimary business systems, core management platforms99.95%
P2 normal pathinternal back-office systems, lower-frequency jobs99.9%
P3 non-core pathexperimental services, low-priority tools99%

Giving every middleware system the same 99.99% target without tiering is fake reliability engineering. Reliability has a cost, and the highest target should be reserved for truly critical business paths.

3.3 Example Middleware SLAs

Middleware SLA can target two kinds of consumers.

The first is internal business teams. For example, a platform team may commit:

text
The registry center provides 99.95% monthly availability for P0 business namespaces.
The configuration center guarantees p99 publish-to-effective latency within 3 seconds for P0 configuration.
The rate-limiting system provides 99.99% judgment success for core traffic paths.
If SLA is missed, the platform team must provide root cause, repair plan, and recurrence-prevention measures in the incident review.

The second is external customers. For example, a cloud or SaaS platform may commit:

text
The API gateway provides at least 99.9% monthly availability.
If availability falls below the SLA, service compensation is provided according to contract.

My suggestion is that internal middleware should also have SLA, but it does not need to start as a financial compensation contract. It can begin as an engineering accountability agreement covering incident severity, response time, escalation path, review deadline, remediation deadline, downgrade plans, and business onboarding requirements.


4. How Microservices Should Define SLI, SLO, and SLA

Microservice SLI, SLO, and SLA differ from middleware metrics. They are closer to business semantics and should be defined around whether the user request was completed correctly, not merely whether the service process is alive.

Spring Boot documentation defines observability as the ability to observe the internal state of a running system from the outside, and notes that it includes the three pillars of logs, metrics, and traces. For Spring Boot applications, metrics and traces are commonly based on Micrometer Observation. (Spring Boot) That means microservice reliability metrics should not stop at the machine layer. They must reach application framework, API, call path, and business-tag dimensions.

4.1 Core Categories of Microservice SLI

4.1.1 Request Availability

text
request availability = successful requests / total valid requests

But "successful request" must be defined in business semantics:

ScenarioIs It a Success?
HTTP 200 and business code = 0success
HTTP 200 but business code = inventory deduction failednot necessarily success, depends on business definition
HTTP 400 invalid argumentsusually not service failure
HTTP 401 unauthorizedusually not service failure
HTTP 500service failure
timeoutservice failure
circuit-break fallback resultsuccess depends on whether it still meets user expectation

This is crucial. Systems that look only at HTTP status code will seriously underestimate business failure rate.

4.1.2 Request Latency

Microservice latency SLI should at least include:

text
p50 / p95 / p99 / p999

And should be broken down by dimensions such as:

DimensionExample
serviceorder-service
endpointPOST /orders
callercheckout-service
regioncn-hz / sg
zoneaz1 / az2
versionv1.2.3
statussuccess / error
dependencymysql / redis / mq

An overall p99 alone is not meaningful. A service may look fine globally while one major customer, one region, or one version is performing badly. That is still a reliability issue.

4.1.3 Business Correctness

Business correctness is often more important than technical availability.

Business ScenarioCorrectness SLI Example
Order servicesuccessful order creation / total order requests
Payment servicecorrect payment status transitions / total payment requests
Inventory servicesuccessful and accurate stock deduction / total deduction requests
Coupon systemcorrectly issued coupons / total eligible requests
Search servicerequests returning valid results / total search requests

I strongly recommend that core business services define correctness indicators explicitly. A service returning HTTP 200 while generating wrong orders, wrong balances, or wrong route decisions is not healthy.

4.1.4 Dependency SLI

Many microservice incidents are actually dependency incidents. Therefore microservice SLI should also include:

text
dependency call success rate
dependency call p99 latency
dependency timeout rate
dependency retry rate
dependency circuit-break count
dependency degradation hit rate

Going one step further, teams can define:

text
the contribution of critical dependency availability to this service's own SLO

That helps answer an important question: when the service SLO is violated, is the root cause inside the service, in a dependency, in incoming traffic, or in governance rules?

4.1.5 Asynchronous Path SLI

Microservices rely heavily on MQ, events, and eventual consistency, so SLO cannot cover synchronous HTTP alone.

Asynchronous SLI includes:

IndicatorExample
Message production success ratesuccess rate of publishing order-created events
Message consumption success rateinventory service consumption success for order events
Consumption latencyp99 from production to consumption completion <= 5 seconds
Backlogconsumer lag
Dead-letter ratedead letters / total messages
Eventual consistency latencyp99 from order creation to inventory deduction completion

If the system depends heavily on MQ but its SLO covers only HTTP APIs, that SLO is incomplete.

4.2 Example Microservice SLOs

ServiceSLIExample SLO
User loginlogin success rate>= 99.9% over the last 30 days
User loginlogin latencyp95 <= 300 ms, p99 <= 800 ms
Order creationorder-creation success rate>= 99.95% over the last 30 days
Order creationbusiness correctness>= 99.99% over the last 30 days
Payment servicepayment request success rate>= 99.99% over the last 30 days
Payment serviceeventual consistency latency of payment statep99 <= 10 seconds
Search servicequery success rate>= 99.9% over the last 30 days
Recommendation servicenon-empty recommendation rate>= 99.5% over the last 30 days
Message consumptionconsumption success rate>= 99.95% over the last 30 days
Message consumptionconsumption latencyp99 <= 5 seconds

Microservice SLO should also be tiered by interface criticality:

Interface TierExampleSLO Strategy
Core transaction APIorder, payment, inventory deductionhigh availability, high correctness, low latency
Core read APIhomepage, product detail, searchhigh availability, medium-high latency requirement
Internal management APIadmin config, operations platformmedium availability
Offline job APIreporting, data syncfocus on completion rate and timeliness
Experimental APIA/B testing, exploratory recommendationmore error budget allowed

4.3 Example Microservice SLA

Microservice SLA can be split into external SLA and internal SLA.

External SLA example:

text
Open APIs provide at least 99.9% availability in a calendar month.
If availability falls below 99.9%, service compensation is provided according to contract.
Planned maintenance windows are excluded from SLA.
Client-originated invalid requests are excluded from SLA.

Internal SLA example:

text
The order service commits to the payment service:
POST /orders/create under P0 tier has monthly availability >= 99.95% and p99 latency <= 800 ms.
When error-budget burn exceeds threshold, the order team must freeze non-emergency release and provide a repair plan within 24 hours.

I believe internal microservice SLA is highly valuable because it turns "services blaming each other" into "dependencies governed by contract."


5. What These Indicators Are Actually For

5.1 Guiding Alerting

Alerts without SLO are usually noise.
Alerts with SLO become reliability signals.

Traditional alerts:

text
CPU > 80%
thread-pool queue > 1000
API error count > 100

SLO-based alerts:

text
The order service is burning error budget too quickly in the last hour.
If this continues for 6 hours, it will consume the 30-day error budget.

SLO alerts are closer to business risk. Grafana documentation also points out that alerts can be based on error-budget burn rate instead of reacting immediately to every minor deviation, and that SLO helps teams align around shared reliability goals. (Grafana Labs)

5.2 Guiding Release

When error budget is healthy:

text
normal release is allowed
gray experiments are allowed
architecture adjustments are allowed

When error budget is burning fast:

text
freeze non-essential releases
pause high-risk changes
prioritize stability fixes
extend gray observation windows
reduce release frequency

That is far more actionable than "things feel unstable lately, maybe do not release."

5.3 Guiding Capacity Planning

SLO helps answer whether capacity is sufficient:

text
If traffic grows 30%, will p99 latency still meet SLO?
Will MQ backlog push consumption latency beyond SLO?
As registry watch count grows, can push latency still meet SLO?

Capacity planning should not look only at CPU waterline. It should ask whether the SLO can still be achieved.

5.4 Guiding Architecture Design

If a service has a 99.99% SLO, it may need:

text
multi-replica deployment
cross-AZ disaster tolerance
rate limiting and circuit breaking
asynchronous traffic shaving
degradation plan
idempotency mechanisms
data backup
failure drills

If another service has only a 99% SLO, spending the same amount of reliability cost is unnecessary. Reliability design must be tiered, otherwise it becomes waste.

5.5 Guiding Incident Review

Incident review should revolve around SLI and SLO:

text
Which SLI degraded?
Which SLO was violated?
How much error budget was consumed?
Which callers, endpoints, regions, or versions were affected?
Were governance actions timely?
Did observability provide enough evidence for localization?
How can future incidents of the same type consume less error budget?

That is far more useful than writing generic lessons such as "strengthen monitoring, strengthen testing, strengthen on-call awareness."

5.6 Guiding Platform Productization

For a middleware platform, SLI, SLO, and SLA can directly become product capabilities:

text
SLO dashboards
error-budget dashboards
tenant reliability tiering
governance rule recommendations
capacity risk prediction
release admission checks
blast-radius analysis
service dependency scoring

In other words, SRE indicators are not just fields in a monitoring system. They are a product foundation for platform engineering.


6. How to Rely on Observability for Better Service Governance

The OpenTelemetry documentation states that the purpose of OpenTelemetry is to collect, process, and export signals, including measurable system state and events propagated across distributed system components. OpenTelemetry is also defined as a vendor-neutral open-source observability framework for generating, collecting, and exporting telemetry such as logs, metrics, and traces. (OpenTelemetry) The OpenTelemetry Collector provides a vendor-neutral implementation for receiving, processing, and exporting telemetry, reducing the burden of maintaining multiple agents or collectors. (OpenTelemetry)

With those capabilities in mind, service governance should no longer rely on manual observation. It should rely on observability data to form a closed loop.

6.1 Foundational Architecture of an Observability System

A reasonable observability system can be designed like this:

text
business services / middleware / gateways / service mesh

metrics + logs + traces + events

OpenTelemetry SDK / agent / collector

Prometheus / Tempo / Elasticsearch / Loki / ClickHouse

SLO calculation / error budget / topology analysis / root-cause analysis

service governance platform

rate limiting / circuit breaking / degradation / routing / gray release / scaling / rollback

Each kind of data has a different role:

Data TypeRole
Metricscompute SLI, SLO, error budget, and alerts
Tracesanalyze call paths, bottlenecks, and dependencies
Logslocate error detail, business context, and audit evidence
Eventsrecord release, config change, governance rule change, and failure drill events
Topologybuild service dependency graphs and blast-radius analysis
Profilesanalyze CPU, memory, lock contention, and other performance issues

6.2 Closed-Loop Governance Model

I recommend the following loop:

text
observe → judge → decide → execute → validate → review

Step 1: Observe

Collect data such as:

text
service request success rate
endpoint p95 and p99 latency
error-code distribution
caller distribution
dependency latency
instance load
version information
gray traffic ratio
rate-limit and circuit-break hits
config change events
release events

Step 2: Judge

Judge whether the service is violating SLO:

text
Is the current error rate above budget?
Is error-budget burn too fast?
Does the issue affect only one caller?
Does it affect only one version?
Does it affect only one zone or region?
Is it correlated with a recent release or config change?

Step 3: Decide

Choose a governance action according to the judgment:

Observed SignalGovernance Action
One version has rising error ratepause gray rollout, shift traffic back, auto rollback
One caller has abnormal trafficcaller-specific rate limiting, isolation, quota adjustment
Downstream dependency is timing outcircuit break, degrade, shorten timeout, reduce retry
p99 latency is risingscale out, isolate hot spots, warm cache
MQ consumption backlog is growingscale consumers, limit producers, prioritize consumption
Config release is abnormalrollback config, pause release, client-side degrade
Service discovery is abnormalenable local cache, protect last-known-good instances

Step 4: Execute

Execute governance rules such as:

text
dynamic routing
gray release
circuit breaking
rate limiting
degradation
isolation
retry-budget control
timeout control
load-balancer weight adjustment
scaling
config rollback

Step 5: Validate

After execution, observe again:

text
Is error-budget burn slowing down?
Has p99 latency recovered?
Has error rate gone down?
Is the blast radius shrinking?
Did the action introduce new side effects?

Step 6: Review

Review should focus not only on the incident cause, but also on:

text
Was the SLI defined correctly?
Was the SLO reasonable?
Was alerting timely?
Were governance actions effective?
Can the action be automated?
Should rate limiting, circuit breaking, or degradation strategies be adjusted?
Is architecture change required?

6.3 Governance Strategies Driven by Observability Data

6.3.1 Rate-Limiting Governance

Rate limiting should not rely only on static QPS thresholds. It should adapt to SLO state.

For example:

text
If error-budget burn is too high:
    reduce quota for low-priority callers
    reserve quota for critical callers
    enable queueing or rejection for non-core endpoints

That is more reasonable than applying the same cut to every flow.

6.3.2 Circuit-Breaking Governance

Circuit breaking should combine:

text
error rate
timeout rate
p99 latency
concurrency
downstream health
error-budget burn rate

It cannot simply say "trip after 50 failures." Otherwise low-traffic and high-traffic services get very different and often irrational results.

6.3.3 Gray-Release Governance

Gray release must rely on observability data.

Its decision conditions should include:

text
Is the new version's error rate worse than baseline?
Is the new version's p99 worse?
Are core endpoint SLOs affected?
Does the impact appear only for certain callers?
Are there new abnormal log patterns?
Did the call path introduce new slow dependencies?

If the gray system is not connected to SLO and error budget and expands traffic only by elapsed time, that is dangerous.

6.3.4 Degradation Governance

Degradation strategies should be tiered:

LevelStrategy
L1disable non-core functions
L2use cached result
L3return default result
L4reject low-priority requests
L5protect only core flows such as order and payment

Whether degradation should trigger should be determined jointly by SLO state, dependency health, and error-budget burn rate.

6.3.5 Routing Governance

Routing decisions can depend on:

text
region
data center
version
caller
user segment
error rate
latency
instance health
cost
compliance requirement

For example, in a cross-region architecture, routing between mainland China and Singapore regions cannot be decided by latency alone. It must also consider data compliance, user residency, cross-border constraints, and fault-isolation boundaries.


7. Dependency and Feedback Relationship Between Service Governance and Observability

Service governance and observability are not a simple upstream-downstream chain. They depend on each other and continuously drive each other.

Istio documentation points out that a service mesh can provide zero-trust security, observability, and advanced traffic management without application code changes, and that Istio generates detailed telemetry for service-to-service communication in the mesh to help operators troubleshoot, maintain, and optimize systems. (Istio) Envoy also exposes observability capabilities across statistics, access logging, and tracing, and distributed tracing helps engineers understand call flows, parallelism, and latency sources in large service-oriented architectures. (Envoy Proxy)

This means a modern governance platform should naturally be deeply integrated with observability.

7.1 Observability Depends on Service Governance

Observability data becomes valuable only when service governance provides context.

7.1.1 Depend on Service Registration Metadata

Without service metadata, metrics show only IPs and ports:

text
10.1.2.3:8080
10.1.2.4:8080

With governance metadata, you can understand:

text
service=order-service
env=prod
region=cn-hz
zone=az1
version=v1.2.3
owner=trade-team
priority=P0

Without those tags, observability sees only "machine anomaly," not "business impact."

7.1.2 Depend on Routing and Call Relationships

Governance systems know:

text
checkout-service → order-service → inventory-service → payment-service

Only with those relationships can observability build a topology and analyze blast radius.

7.1.3 Depend on Governance Rule Events

When failures happen, teams must know:

text
Was a new version just released?
Was a rate-limit rule just changed?
Was a route weight just adjusted?
Was a timeout setting just modified?
Was a circuit breaker just enabled?
Was configuration just changed?

If observability is not connected to those governance events, root-cause analysis becomes much harder.

7.2 Service Governance Depends on Observability

Governance actions cannot be executed blindly. They must depend on observability data.

7.2.1 Rate Limiting Depends on Traffic Observation

Without caller QPS, endpoint QPS, error rate, latency, and priority, fine-grained rate limiting is impossible.

7.2.2 Circuit Breaking Depends on Error Observation

Without error rate, timeout rate, and dependency latency, there is no basis for deciding whether to trip the breaker.

7.2.3 Gray Release Depends on Version Observation

Without version-level metrics and traces, there is no way to judge whether the new version is healthy.

7.2.4 Scaling Depends on Saturation Observation

Without CPU, memory, thread pool, connection pool, queue depth, backlog, or consumer lag, there is no basis for deciding whether to scale.

7.2.5 Degradation Depends on Business Impact Observation

Without business SLI, there is no way to know whether degradation truly protected the core path.

7.3 How the Two Drive Each Other

The two systems should form the following feedback relationship:

text
Service governance provides:
    service catalog
    call relationships
    routing rules
    version information
    traffic policies
    rate-limit rules
    circuit-break rules
    change events

Observability provides:
    SLI
    SLO compliance state
    error budget
    latency distribution
    error distribution
    saturation
    trace bottlenecks
    blast radius
    root-cause clues

Together they drive:
    automatic rate limiting
    automatic circuit breaking
    automatic degradation
    automatic rollback
    intelligent scaling
    change admission
    incident review
    architecture optimization

In one sentence: service governance decides how the system acts, while observability decides whether that action has evidence behind it.


8. Practical Rollout Path for Middleware SRE

8.1 Step One: Establish Service Tiering

Start with business criticality:

TierExampleReliability Requirement
P0payment, order creation, core registry cluster, core config clusterextremely high
P1search, recommendation, core gateway, MQ primary pathhigh
P2admin console, common job servicesmedium
P3experimental systems, low-frequency internal toolslow

8.2 Step Two: Build an SLI Catalog

Every service should define:

text
availability SLI
latency SLI
error-rate SLI
saturation SLI
correctness SLI
dependency SLI
change events

Middleware should additionally define:

text
control-plane SLI
data-plane SLI
config propagation SLI
rule consistency SLI
client-perceived SLI
RPO / RTO

8.3 Step Three: Build SLO Templates

For example:

text
P0 HTTP API:
    availability >= 99.95%
    p99 latency <= 800 ms
    error budget window = 30 days

P0 configuration center:
    read availability >= 99.99%
    publish success >= 99.99%
    propagation p99 <= 3 s

P0 registry center:
    discovery availability >= 99.95%
    heartbeat success >= 99.99%
    unhealthy-instance removal p99 <= 5 s

8.4 Step Four: Establish Error-Budget Policy

For example:

Error-Budget StatePolicy
remaining > 50%normal release
remaining 20% to 50%high-risk release requires approval
remaining 5% to 20%only low-risk changes are allowed
remaining < 5%freeze non-emergency releases
exhaustedenter a stability-repair cycle

8.5 Step Five: Connect Release and Config Changes

Every change should become an observable event:

text
application release
config release
route adjustment
rate-limit rule adjustment
circuit-break rule adjustment
scaling action
database change
middleware upgrade

Without change events, root-cause analysis becomes extremely difficult.

8.6 Step Six: Automate Governance, but Add Guardrails

Automation must not run without guardrails. It should include:

text
maximum rate-limit ratio
maximum rollback scope
minimum gray observation time
critical-caller protection list
manual confirmation threshold
automation cooldown period
governance action audit log

I do not recommend pursuing fully autonomous fault handling from day one. A more realistic path is:

text
first achieve automatic detection
then automatic recommendation
then semi-automatic execution
finally limited closed-loop automation in controlled scope

9. Conclusion

This article discussed how SLI, SLO, and SLA should be defined in middleware and microservice systems, and what engineering value they bring to site reliability engineering. The core conclusions are:

First, SLI, SLO, and SLA are the reliability language of SRE. SLI measures reality, SLO defines the target, and SLA forms the commitment. Without them, reliability management remains trapped in subjective experience.

Second, middleware SRE cannot simply copy microservice SRE. Middleware must additionally focus on control plane, data plane, consistency, propagation latency, rule correctness, client perception, RPO/RTO, and failure amplification.

Third, microservice SRE must define indicators around user journeys and business correctness. HTTP 200 does not guarantee business success, and process liveness does not guarantee service reliability.

Fourth, the greatest value of SLO is the error budget. Error budget puts release velocity and system stability into the same engineering language, so teams can make decisions from data instead of intuition.

Fifth, observability is the factual foundation of service governance. Without metrics, logs, traces, events, and topology, rate limiting, circuit breaking, degradation, gray release, and rollback become blind operations.

Sixth, service governance in turn drives the evolution of observability. Governance systems provide service catalog, call relationships, versions, routes, rules, and change events, allowing observability data to carry real business context.

Ultimately, a mature middleware SRE system should not be only "monitoring + alerting + on-call." It should be:

text
reliability target definition

observability data collection

error-budget calculation

governance policy execution

effect validation

incident review and system improvement

That is what site reliability engineering should look like for modern microservice and middleware platforms.

Chinese Reference

GitHub Discussions

Join the discussion

Comments are synchronized with GitHub Discussions in stellhub/stell-web.

Powered by VitePress and GitHub Discussions.