OpenTelemetry

OpenTelemetry(OTel)是 CNCF 的可观测性标准,将链路追踪、指标、日志三个信号统一到同一套 API / SDK / 协议(OTLP),彻底解决观测工具碎片化问题。

三大信号

可观测性(Observability)
    │
    ├── Traces(链路)   一次请求在多个服务间的完整调用链
    ├── Metrics(指标)  QPS、延迟、错误率等可聚合的数字
    └── Logs(日志)     带时间戳的结构化文本事件

三者在 OTel 中通过同一个 TraceId 关联,实现从指标告警 → 定位链路 → 查看日志的闭环排查。

架构总览

应用进程
┌──────────────────────────────────────────┐
│  OTel SDK / Java Agent(自动插桩)         │
│                                          │
│  Traces → OTLP Exporter ──────────────► │
│  Metrics → OTLP Exporter ─────────────► │──► OTel Collector
│  Logs   → OTLP Exporter ──────────────► │
└──────────────────────────────────────────┘

OTel Collector
┌─────────────────────────────────────────┐
│  Receiver(OTLP gRPC/HTTP)              │
│  Processor(批量、采样、属性过滤)          │
│  Exporter                               │
│    ├── Jaeger / Zipkin(Traces)         │
│    ├── Prometheus / Thanos(Metrics)    │
│    └── Loki / Elasticsearch(Logs)     │
└─────────────────────────────────────────┘

Java Agent 自动插桩(零代码改动)

OTel Java Agent 通过字节码增强,自动为 Spring MVC、JDBC、Redis、Kafka、gRPC 等数十种框架注入 Span:

# 下载 Agent
curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar \
     -o opentelemetry-javaagent.jar
 
# 启动应用时挂载(不改代码)
java \
  -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=order-service \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4318 \
  -Dotel.exporter.otlp.protocol=http/protobuf \
  -Dotel.logs.exporter=otlp \
  -Dotel.metrics.exporter=otlp \
  -Dotel.traces.exporter=otlp \
  -Dotel.resource.attributes=env=prod,version=1.2.0 \
  -jar app.jar
# Docker / K8s 环境变量等价写法
OTEL_SERVICE_NAME: order-service
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4318
OTEL_EXPORTER_OTLP_PROTOCOL: http/protobuf
OTEL_LOGS_EXPORTER: otlp
OTEL_METRICS_EXPORTER: otlp
OTEL_TRACES_EXPORTER: otlp

自动插桩支持的框架列表见 opentelemetry-java-instrumentation 官方文档


Spring Boot 3.x 集成(Micrometer 桥接)

Spring Boot 3.x 通过 Micrometer Tracing + OTel Bridge 集成,不需要 Java Agent:

// build.gradle
implementation 'io.micrometer:micrometer-tracing-bridge-otel'
implementation 'io.opentelemetry:opentelemetry-exporter-otlp'
implementation 'io.micrometer:micrometer-registry-otlp'  // 指标也走 OTLP
# application.yml
management:
  tracing:
    sampling:
      probability: 1.0
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces
    metrics:
      endpoint: http://otel-collector:4318/v1/metrics
      export:
        step: 30s
 
spring:
  application:
    name: order-service

Micrometer Tracing 详见 链路追踪,指标详见 指标采集


手动插桩(自定义 Span / 属性)

import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.*;
 
@Service
public class PaymentService {
 
    private final Tracer tracer = GlobalOpenTelemetry.getTracer("payment-service");
 
    public PayResult pay(PayRequest req) {
        // 创建子 Span
        Span span = tracer.spanBuilder("payment.process")
            .setSpanKind(SpanKind.INTERNAL)
            .startSpan();
 
        try (Scope scope = span.makeCurrent()) {
            span.setAttribute("payment.method", req.getMethod());
            span.setAttribute("payment.amount", req.getAmount());
 
            PayResult result = doCharge(req);
 
            span.setAttribute("payment.txnId", result.getTxnId());
            return result;
 
        } catch (PaymentException e) {
            // 标记 Span 状态为错误
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

通过注解简化(Spring AOP + Micrometer)

@NewSpan("inventory.check")
public boolean checkInventory(@SpanTag("productId") Long productId) {
    return inventoryRepo.hasStock(productId);
}

OTel Collector 部署

Collector 作为中间层,隔离应用与后端存储,支持多路输出、采样、数据转换:

# otel-collector-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512
  # 尾部采样(基于完整链路决策,比头部采样更精准)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-policy
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-rest
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
 
exporters:
  # 链路 → Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  # 链路 → Zipkin(可同时输出多个后端)
  zipkin:
    endpoint: http://zipkin:9411/api/v2/spans
  # 指标 → Prometheus(pull 模式)
  prometheus:
    endpoint: 0.0.0.0:8889
  # 日志 → Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  # 全信号 → OTLP(上游 Collector 或 SaaS)
  otlp:
    endpoint: https://api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [jaeger, zipkin]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]
# docker-compose.yml(最小化可观测栈)
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.99.0
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8889:8889"   # Prometheus 指标暴露
    volumes:
      - ./otel-collector-config.yml:/etc/otelcol/config.yaml
 
  jaeger:
    image: jaegertracing/all-in-one:1.56
    ports:
      - "16686:16686"   # Jaeger UI
      - "14250:14250"   # gRPC
 
  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
 
  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"

OTLP 协议

OTLP(OpenTelemetry Protocol)是 OTel 定义的传输协议,基于 Protocol Buffers:

传输方式端口适用场景
gRPC4317低延迟、高吞吐,服务端之间
HTTP/protobuf4318防火墙友好,通用
HTTP/JSON4318调试、简单集成

SaaS 可观测性平台

不想自建 Collector + 后端的场景,可直接对接商业 SaaS:

平台支持信号特点
Grafana CloudTraces / Metrics / Logs免费额度,Tempo + Mimir + Loki
Datadog全信号功能最全,价格较高
HoneycombTraces高基数分析能力强
New Relic全信号100GB/月免费
阿里云 ARMS全信号国内延迟低,与阿里云生态集成

只需修改 Exporter 的 endpoint 和认证 Header,代码无需改动。


与 Spring Boot 观测体系的关系

Spring Boot 应用
  │
  ├── Micrometer Tracing ──► OTel Bridge ──► OTel SDK ──► OTLP ──► Collector
  ├── Micrometer Metrics ──► OTel Exporter ──────────────────────► Collector
  └── SLF4J + Logback ──► logstash-logback-encoder ──► 文件 ──► Promtail ──► Loki
                       或  OTel Log Appender ──────────────────────────────► Collector

相关链接

架构

同目录中间件

Java 实战