SkyWalking

返回运维工具

Apache SkyWalking 是开源 APM(应用性能监控)平台,提供分布式链路追踪、服务拓扑、性能指标和日志聚合,Java Agent 零代码侵入。


核心概念

概念说明
Trace一次完整请求在多个服务间的调用链
Segment单个服务内的 Span 集合
Span最小追踪单位,记录一次操作
Service逻辑服务名,如 order-service
Endpoint服务入口,如 GET /orders
OAP Server后端分析与存储服务
UI可视化界面

Java Agent 接入(零代码改动)

# 下载 SkyWalking Agent
curl -L https://archive.apache.org/dist/skywalking/java-agent/9.3.0/apache-skywalking-java-agent-9.3.0.tgz \
     -o skywalking-agent.tgz
tar -xzf skywalking-agent.tgz
# 启动时挂载 Agent
java \
  -javaagent:/opt/skywalking-agent/skywalking-agent.jar \
  -Dskywalking.agent.service_name=order-service \
  -Dskywalking.collector.backend_service=oap:11800 \
  -jar app.jar
# Docker / K8s 等价环境变量
SW_AGENT_NAME: order-service
SW_AGENT_COLLECTOR_BACKEND_SERVICES: oap:11800

Docker Compose 部署

services:
  oap:
    image: apache/skywalking-oap-server:9.7.0
    ports:
      - "11800:11800"   # gRPC(Agent 上报)
      - "12800:12800"   # HTTP REST API
    environment:
      SW_STORAGE: elasticsearch
      SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200
    depends_on:
      - elasticsearch
 
  ui:
    image: apache/skywalking-ui:9.7.0
    ports:
      - "8080:8080"
    environment:
      SW_OAP_ADDRESS: http://oap:12800
 
  elasticsearch:
    image: elasticsearch:8.13.0
    environment:
      discovery.type: single-node
      xpack.security.enabled: "false"
      ES_JAVA_OPTS: "-Xms1g -Xmx1g"
    volumes:
      - es_data:/usr/share/elasticsearch/data
 
volumes:
  es_data:

存储后端选项:

存储适用场景
H2(默认)本地开发、快速体验
Elasticsearch生产推荐,查询性能强
BanyanDBSkyWalking 原生存储,资源占用低
MySQL / TiDB中小规模生产

Kubernetes 部署(Helm)

helm repo add skywalking https://apache.jfrog.io/artifactory/skywalking-helm
helm install skywalking skywalking/skywalking \
  --namespace monitoring --create-namespace \
  --set oap.replicas=2 \
  --set ui.enabled=true \
  --set elasticsearch.enabled=true

K8s Pod 注入 Agent(Init Container 方式):

initContainers:
  - name: skywalking-agent
    image: apache/skywalking-java-agent:9.3.0-java17
    command: ['sh', '-c', 'cp -r /skywalking/agent /opt/']
    volumeMounts:
      - name: sw-agent
        mountPath: /opt/skywalking
 
containers:
  - name: app
    env:
      - name: JAVA_TOOL_OPTIONS
        value: "-javaagent:/opt/skywalking/agent/skywalking-agent.jar"
      - name: SW_AGENT_NAME
        value: "order-service"
      - name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
        value: "skywalking-oap:11800"
    volumeMounts:
      - name: sw-agent
        mountPath: /opt/skywalking
 
volumes:
  - name: sw-agent
    emptyDir: {}

Spring Boot 集成(无 Agent 方式)

<!-- pom.xml -->
<dependency>
    <groupId>org.apache.skywalking</groupId>
    <artifactId>apm-toolkit-micrometer-1.10</artifactId>
    <version>9.3.0</version>
</dependency>
<dependency>
    <groupId>org.apache.skywalking</groupId>
    <artifactId>apm-toolkit-logback-1.x</artifactId>
    <version>9.3.0</version>
</dependency>

自定义 Span:

import org.apache.skywalking.apm.toolkit.trace.Trace;
import org.apache.skywalking.apm.toolkit.trace.ActiveSpan;
import org.apache.skywalking.apm.toolkit.trace.Tag;
 
@Service
public class PaymentService {
 
    @Trace(operationName = "payment.process")
    @Tag(key = "payment.method", value = "arg[0]")
    public PayResult pay(String method, PayRequest req) {
        ActiveSpan.tag("payment.amount", String.valueOf(req.getAmount()));
        return doCharge(req);
    }
}

Logback 集成(日志关联 TraceId)

<!-- logback-spring.xml -->
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
        <layout class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.TraceIdPatternLogbackLayout">
            <pattern>%d{yyyy-MM-dd HH:mm:ss} [%tid] %-5level %logger - %msg%n</pattern>
        </layout>
    </encoder>
</appender>

日志中自动注入 traceId,在 SkyWalking UI 可通过 TraceId 关联查看日志。


告警规则配置

# alarm-settings.yml
rules:
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000        # 响应时间 > 1s
    period: 10
    count: 3
    silence-period: 5
    message: "服务 {name} 响应时间超过 1s"
 
  service_error_rate_rule:
    metrics-name: service_success_rate
    op: "<"
    threshold: 9500        # 成功率 < 95%
    period: 10
    count: 3
    message: "服务 {name} 成功率低于 95%"
 
webhooks:
  - https://your-alert-webhook/notify

相关文档