Prometheus

返回运维工具

开源时序监控与告警系统,Pull 模式采集指标,配合 Grafana 可视化,是云原生可观测体系的核心组件。


核心概念

概念说明
Metric指标,时序数据的基本单位
Label键值对,用于区分指标维度
Target被采集的端点(HTTP /metrics
ScrapePull 拉取目标指标的行为
AlertRule基于 PromQL 定义的告警条件
Alertmanager负责路由、去重、静默、通知

指标类型

类型说明示例
Counter只增不减的累计值请求总数、错误总数
Gauge可升可降的瞬时值内存使用量、在线用户数
Histogram分桶统计,支持 quantile 计算请求延迟分布
Summary客户端计算分位数P50/P99 响应时间

prometheus.yml 配置

global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
 
rule_files:
  - 'rules/*.yml'
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['app:8080']
 
  # Kubernetes 服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

PromQL 常用查询

# 过去 5 分钟 QPS
rate(http_requests_total[5m])
 
# 按路径分组的 QPS
sum(rate(http_requests_total[5m])) by (path)
 
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
 
# P99 延迟(Histogram)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
 
# JVM 堆使用率
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
 
# 实例是否在线(1=在线 0=离线)
up{job="spring-boot"}

告警规则

# rules/app.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "错误率超过 5%"
          description: "当前错误率 {{ $value | humanizePercentage }}"
 
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "实例 {{ $labels.instance }} 离线"

Spring Boot 集成(Micrometer)

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    tags:
      application: ${spring.application.name}
    distribution:
      percentiles-histogram:
        http.server.requests: true
      slo:
        http.server.requests: 50ms, 100ms, 500ms

自定义指标:

@Service
public class OrderService {
 
    private final Counter orderCounter;
    private final Timer orderTimer;
 
    public OrderService(MeterRegistry registry) {
        this.orderCounter = Counter.builder("order.created")
            .tag("type", "online")
            .register(registry);
        this.orderTimer = Timer.builder("order.processing.time")
            .publishPercentileHistogram()
            .register(registry);
    }
 
    public Order create(OrderRequest req) {
        return orderTimer.record(() -> {
            Order order = doCreate(req);
            orderCounter.increment();
            return order;
        });
    }
}

Docker Compose 部署

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
 
  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
 
  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana_data:/var/lib/grafana
 
volumes:
  prometheus_data:
  grafana_data:

相关文档