Prometheus

Prometheus 是 CNCF 旗下的开源监控与告警系统,以多维度时间序列数据模型和强大的 PromQL 查询语言著称。采用拉取(Pull)模型,通过 HTTP 定期抓取目标暴露的 /metrics 端点。

适合:基础设施与应用指标监控、告警规则配置;可视化层通常配合 Grafana 使用。


数据模型

<metric_name>{<label_name>=<label_value>, ...} <value> [<timestamp>]

# 示例
http_requests_total{method="GET", status="200", handler="/api/users"} 1027
jvm_memory_used_bytes{area="heap", id="G1 Eden Space"} 1.234e+08

四种指标类型

类型说明典型用途
Counter单调递增计数器,重启归零请求总数、错误总数
Gauge可增可减的瞬时值内存使用量、在线用户数
Histogram分桶统计(含 _count_sum_bucket请求耗时分布、响应大小
Summary客户端计算分位数(含 _count_sum_quantileP99 延迟(单实例精确值)

架构

Exporters / App /metrics
        │  (Pull HTTP)
        ▼
Prometheus Server
   ├── TSDB(本地时序数据库)
   ├── Retrieval(定时抓取)
   ├── PromQL 引擎
   └── Alertmanager(告警路由)
        │
        ▼
   Email / PagerDuty / Webhook / DingTalk

快速上手(Docker Compose)

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
# prometheus.yml
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: "spring-app"
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ["app:8080"]

Spring Boot 集成

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management:
  endpoints:
    web:
      exposure:
        include: prometheus,health,info
  metrics:
    tags:
      application: ${spring.application.name}
// 自定义业务指标
@Component
public class OrderMetrics {
    private final Counter orderCounter;
    private final Timer orderTimer;
 
    public OrderMetrics(MeterRegistry registry) {
        this.orderCounter = Counter.builder("orders.created.total")
            .description("Total orders created")
            .tag("channel", "web")
            .register(registry);
        this.orderTimer = Timer.builder("orders.process.duration")
            .description("Order processing time")
            .register(registry);
    }
 
    public void recordOrder() {
        orderCounter.increment();
    }
 
    public void recordProcessTime(Runnable task) {
        orderTimer.record(task);
    }
}

PromQL 常用查询

# 最近 5 分钟 HTTP 请求速率(QPS)
rate(http_requests_total[5m])
 
# 按状态码统计错误比例
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
 
# P99 请求延迟(Histogram)
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
 
# JVM 堆内存使用率
jvm_memory_used_bytes{area="heap"}
  / jvm_memory_max_bytes{area="heap"}
 
# 过去 1 小时内实例是否在线(0/1)
up{job="spring-app"}

告警规则(AlertManager)

# alerts/rules.yml
groups:
  - name: app
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "错误率超过 5%(当前: {{ $value | humanizePercentage }})"
 
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "实例 {{ $labels.instance }} 已下线"

存储与高可用

方案说明
本地 TSDB默认 15 天保留,适合单机
Remote Write写入 Thanos / Cortex / VictoriaMetrics
Thanos多 Prometheus 联邦 + 长期存储 + 全局查询
VictoriaMetrics单二进制、高压缩比、兼容 PromQL

相关链接