Tracing Research for Large-Scale Enterprises

Overview

A research-oriented walkthrough of cross-language tracing design choices, interoperability concerns, and rollout considerations for large enterprises.

1. Conclusion

For a large enterprise with cross-language microservices, cross-team ownership, and long-term platform governance needs, the recommended tracing foundation is:

text

OpenTelemetry Collector + Grafana Tempo Distributed + Grafana

The reasoning is straightforward:

OpenTelemetry standardizes instrumentation and export
OpenTelemetry Collector provides the governance pipeline
Tempo provides scalable and cost-efficient trace storage
Grafana provides a unified entry point for traces, logs, and metrics

Among the mainstream options, this combination is usually the strongest default for platform-scale adoption.

2. Platform Goals

An enterprise tracing platform should do more than display a single request path. It should support:

cross-language instrumentation
low-intrusion onboarding through SDKs, agents, sidecars, and gateways
vendor-neutral collection
high-ingestion throughput
low-cost long-term storage
multi-tenant isolation
controllable sampling policy
correlation across metrics, logs, and traces
cloud-native deployment across clusters and regions

The platform target is therefore:

text

standardized collection + governable telemetry pipeline + scalable storage + unified analysis entry

3. Why OpenTelemetry Should Be the Collection Standard

OpenTelemetry is the right primary standard because it separates instrumentation from backend choice. That matters in a large company because platform evolution should not require codewide tracing rewrites.

The most important architectural consequence is that services should instrument once and remain backend-neutral. This allows teams to:

move between tracing backends
use the same SDK conventions across languages
centralize sampling and enrichment policy
align tracing with the broader observability stack

4. Why the Collector Matters

The collector is not just a forwarder. It is the enterprise telemetry control point.

In a serious platform, the collector layer should own:

resource enrichment
batching
rate limiting
PII filtering
tenant identification
routing
tail sampling
backpressure behavior

Without this layer, application services either push tracing complexity into their own code or send raw telemetry directly to the backend with little governance.

5. Why Tempo Is a Strong Backend Default

Tempo is compelling mainly because trace data becomes very large very quickly. A practical enterprise backend must be:

horizontally scalable
object-storage friendly
compatible with common protocols
cost-aware for long retention
easy to correlate with logs and metrics

Tempo fits these requirements well, especially when the organization already values the Grafana ecosystem.

6. Comparison of Mainstream Options

6.1 OpenTelemetry + Tempo + Grafana

Best fit for:

large-scale platformization
cross-language adoption
low-cost storage
strong governance layering

Strengths:

backend-neutral collection standard
scalable storage model
strong logs/metrics/traces integration

Tradeoffs:

requires platform engineering discipline
not as "single product" oriented as some APM-style systems

6.2 OpenTelemetry + Jaeger

Still a good option for teams that already run Jaeger successfully or want a simpler intermediate step. It is mature and understandable, but it is usually less attractive than Tempo when storage economics and platform-scale evolution matter more.

6.3 SkyWalking

Strong when the organization wants a more opinionated, out-of-the-box observability platform, especially around Java-heavy APM scenarios. The tradeoff is weaker vendor neutrality and less flexibility in tracing-standard governance compared with an OpenTelemetry-first design.

6.4 Jaeger Native SDK + Jaeger

This is no longer the best direction for a new enterprise platform because backend-coupled instrumentation becomes a long-term liability.

6.5 Zipkin

Useful historically and still workable in lighter scenarios, but not an ideal long-term foundation for a large enterprise tracing platform.

7. Recommended Architecture

7.1 Overall Shape

A practical deployment path looks like this:

text

Business Services
  -> Node-level OpenTelemetry Collector
  -> Gateway/OpenTelemetry Collector cluster
  -> Tempo Distributed
  -> Object Storage
  -> Grafana

7.2 Recommended Data Flow

The intended flow is:

services emit spans through OpenTelemetry SDKs or agents
local collectors batch, enrich, and normalize
gateway collectors apply tenant routing, filtering, and sampling
Tempo persists trace data into object storage
Grafana becomes the query and correlation entry point

7.3 Why Services Should Not Send Traces Directly to Tempo

Direct-to-backend ingestion weakens platform control. It makes it harder to:

standardize sampling
enforce data masking
inject tenant identity
control ingestion pressure
route to multiple backends
audit onboarding quality

For large organizations, the collector layer is not optional plumbing. It is part of the platform contract.

8. Recommended Platform Capabilities

An enterprise tracing platform should explicitly design:

ingestion patterns for SDKs, agents, sidecars, and gateways
collector-level enrichment and filtering
low-cardinality label conventions
multi-tenant routing and isolation
security and compliance boundaries
trace-to-log and trace-to-metric jump links

The platform should also define what counts as acceptable span cardinality, attribute naming, retention class, and sensitive-field policy.

9. Sampling Strategy

A practical strategy usually mixes:

baseline head sampling for ordinary traffic
tail sampling for errors and slow traces
explicit preservation of critical business paths
differentiated policy for premium or high-risk tenants

Tail sampling is especially important at scale because it lets the platform keep the traces that are diagnostically valuable instead of only the traces that happened to be selected early.

10. Rollout Plan

A realistic enterprise rollout can be phased:

PoC validation
standardized collector-based onboarding
production-grade governance with sampling and multi-tenancy
deeper diagnosis with cross-signal correlation and platform self-service

This phased model is more credible than trying to force all tracing maturity at once.

11. Recommended Final Stack

The recommended final stack is:

instrumentation standard: OpenTelemetry
telemetry gateway: OpenTelemetry Collector
tracing backend: Grafana Tempo Distributed
visualization and unified entry: Grafana
storage foundation: object storage such as S3, OSS, COS, MinIO, or Ceph
compatibility protocols: OTLP first, with Jaeger and Zipkin compatibility as needed

12. Final Recommendation

The best tracing platform for a large enterprise is not the one with the prettiest UI or the most opinionated story. It is the one that keeps collection standard-neutral, gives the platform a governable control point, scales economically, and aligns tracing with the rest of observability.

That is why the recommended default is OpenTelemetry plus Collector plus Tempo plus Grafana. It is the most balanced answer for enterprises that need long-term interoperability, cost control, and platform-level governance rather than a short-lived tool choice.

Chinese Reference

Read the original Chinese article

Tracing Research for Large-Scale Enterprises ​

Overview ​

1. Conclusion ​

2. Platform Goals ​

3. Why OpenTelemetry Should Be the Collection Standard ​

4. Why the Collector Matters ​

5. Why Tempo Is a Strong Backend Default ​

6. Comparison of Mainstream Options ​

6.1 OpenTelemetry + Tempo + Grafana ​

6.2 OpenTelemetry + Jaeger ​

6.3 SkyWalking ​

6.4 Jaeger Native SDK + Jaeger ​

6.5 Zipkin ​

7. Recommended Architecture ​

7.1 Overall Shape ​

7.2 Recommended Data Flow ​

7.3 Why Services Should Not Send Traces Directly to Tempo ​

8. Recommended Platform Capabilities ​

9. Sampling Strategy ​

10. Rollout Plan ​

11. Recommended Final Stack ​

12. Final Recommendation ​

Chinese Reference ​

Join the discussion

Tracing Research for Large-Scale Enterprises

Overview

1. Conclusion

2. Platform Goals

3. Why OpenTelemetry Should Be the Collection Standard

4. Why the Collector Matters

5. Why Tempo Is a Strong Backend Default

6. Comparison of Mainstream Options

6.1 OpenTelemetry + Tempo + Grafana

6.2 OpenTelemetry + Jaeger

6.3 SkyWalking

6.4 Jaeger Native SDK + Jaeger

6.5 Zipkin

7. Recommended Architecture

7.1 Overall Shape

7.2 Recommended Data Flow

7.3 Why Services Should Not Send Traces Directly to Tempo

8. Recommended Platform Capabilities

9. Sampling Strategy

10. Rollout Plan

11. Recommended Final Stack

12. Final Recommendation

Chinese Reference