Prometheus
→ 返回运维工具
开源时序监控与告警系统,Pull 模式采集指标,配合 Grafana 可视化,是云原生可观测体系的核心组件。
核心概念
| 概念 | 说明 |
|---|---|
| Metric | 指标,时序数据的基本单位 |
| Label | 键值对,用于区分指标维度 |
| Target | 被采集的端点(HTTP /metrics) |
| Scrape | Pull 拉取目标指标的行为 |
| AlertRule | 基于 PromQL 定义的告警条件 |
| Alertmanager | 负责路由、去重、静默、通知 |
指标类型
| 类型 | 说明 | 示例 |
|---|---|---|
| Counter | 只增不减的累计值 | 请求总数、错误总数 |
| Gauge | 可升可降的瞬时值 | 内存使用量、在线用户数 |
| Histogram | 分桶统计,支持 quantile 计算 | 请求延迟分布 |
| Summary | 客户端计算分位数 | P50/P99 响应时间 |
prometheus.yml 配置
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'rules/*.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['app:8080']
# Kubernetes 服务发现
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)PromQL 常用查询
# 过去 5 分钟 QPS
rate(http_requests_total[5m])
# 按路径分组的 QPS
sum(rate(http_requests_total[5m])) by (path)
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# P99 延迟(Histogram)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# JVM 堆使用率
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
# 实例是否在线(1=在线 0=离线)
up{job="spring-boot"}告警规则
# rules/app.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "错误率超过 5%"
description: "当前错误率 {{ $value | humanizePercentage }}"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: warning
annotations:
summary: "实例 {{ $labels.instance }} 离线"Spring Boot 集成(Micrometer)
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency># application.yml
management:
endpoints:
web:
exposure:
include: health,info,prometheus
metrics:
tags:
application: ${spring.application.name}
distribution:
percentiles-histogram:
http.server.requests: true
slo:
http.server.requests: 50ms, 100ms, 500ms自定义指标:
@Service
public class OrderService {
private final Counter orderCounter;
private final Timer orderTimer;
public OrderService(MeterRegistry registry) {
this.orderCounter = Counter.builder("order.created")
.tag("type", "online")
.register(registry);
this.orderTimer = Timer.builder("order.processing.time")
.publishPercentileHistogram()
.register(registry);
}
public Order create(OrderRequest req) {
return orderTimer.record(() -> {
Order order = doCreate(req);
orderCounter.increment();
return order;
});
}
}Docker Compose 部署
services:
prometheus:
image: prom/prometheus:v2.51.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
grafana:
image: grafana/grafana:10.4.0
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana_data:/var/lib/grafana
volumes:
prometheus_data:
grafana_data:相关文档
- Kubernetes — K8s 服务发现与 Pod 注解
- Docker — 容器化部署
- SkyWalking — 分布式链路追踪
- OpenTelemetry — 统一可观测性标准