可观测性体系建设
可观测性是分布式系统的”眼睛”,包括链路追踪、日志聚合、监控告警三大支柱。通过可观测性体系,可以快速定位问题、分析性能瓶颈、保障系统稳定。本文详解如何构建完整的可观测性体系。
一、可观测性三大支柱
graph TB
T[链路追踪 Tracing<br/>SkyWalking]
L[日志聚合 Logging<br/>ELK/Loki]
M[监控告警 Metrics<br/>Prometheus]
graph TB
App[应用] --> Agent[探针/Agent]
Agent --> Collect[数据采集]
Collect --> Store[存储]
Store --> Viz[可视化]
Viz --> Alert[告警]
二、链路追踪
2.1 SkyWalking 部署
# docker-compose.yml
version: '3.8'
services:
# Elasticsearch 存储
elasticsearch:
image: elasticsearch:7.17.0
environment:
- discovery.type=single-node
- ES_JAVA_OPTS=-Xms512m -Xmx512m
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
# SkyWalking OAP
oap:
image: apache/skywalking-oap-server:9.0.0
environment:
SW_STORAGE: elasticsearch7
SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200
depends_on:
- elasticsearch
ports:
- "11800:11800"
- "12800:12800"
# SkyWalking UI
ui:
image: apache/skywalking-ui:9.0.0
environment:
SW_OAP_ADDRESS: http://oap:12800
depends_on:
- oap
ports:
- "8080:8080"
volumes:
es_data:
2.2 Java 应用接入
# 1. 下载 SkyWalking Agent
wget https://downloads.apache.org/skywalking/9.0.0/apache-skywalking-apm-9.0.0.tar.gz
tar -xzf apache-skywalking-apm-9.0.0.tar.gz
# 2. 配置 Agent
vim apache-skywalking-apm/agent/config/agent.config
# 修改配置
collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:127.0.0.1:11800}
agent.service_name=${SW_AGENT_SERVICE:MyApp}
# 3. 启动应用(Java Agent 方式)
java -javaagent:/path/to/skywalking/agent/skywalking-agent.jar \
-Dskywalking.agent.service_name=myapp \
-Dskywalking.collector.backend_service=127.0.0.1:11800 \
-jar myapp.jar
2.3 Spring Boot 集成
# application.yml
skywalking:
agent:
service_name: order-service
namespace: production
collector:
backend_service: 127.0.0.1:11800
sampling:
rate: 100 # 采样率 100%
logging:
pattern:
console: "%d{yyyy-MM-dd HH:mm:ss} [%traceId}] %-5level %logger{36} - %msg%n"
/**
* 自定义追踪
*/
@Component
public class CustomTraceService {
@Autowired
private Tracer tracer;
/**
* 手动创建 Span
*/
public void processOrder(Order order) {
Span span = tracer.buildSpan("processOrder").start();
try (Scope scope = tracer.activateSpan(span)) {
// 业务处理
span.setTag("orderId", order.getId());
span.setTag("userId", order.getUserId());
// 调用其他服务
callPaymentService(order);
} catch (Exception e) {
span.setTag(Tags.ERROR.getKey(), true);
span.log(ExceptionUtils.getMessage(e));
throw e;
} finally {
span.finish();
}
}
/**
* 跨服务调用(携带 Trace 上下文)
*/
public void callPaymentService(Order order) {
Span span = tracer.buildSpan("callPaymentService").start();
try (Scope scope = tracer.activateSpan(span)) {
// 创建 HTTP 请求头(传递 Trace 上下文)
HttpHeaders headers = new HttpHeaders();
tracer.inject(span.context(),
Format.Builtin.HTTP_HEADERS,
new HttpHeadersCarrier(headers));
// 发起 HTTP 调用
restTemplate.exchange(
"http://payment-service/api/pay",
HttpMethod.POST,
new HttpEntity<>(order, headers),
PaymentResult.class
);
} finally {
span.finish();
}
}
/**
* HTTP 请求头载体
*/
static class HttpHeadersCarrier implements TextMap {
private final HttpHeaders headers;
HttpHeadersCarrier(HttpHeaders headers) {
this.headers = headers;
}
@Override
public Iterator<Map.Entry<String, String>> iterator() {
return headers.entrySet().stream()
.map(e -> new AbstractMap.SimpleEntry<>(e.getKey(), e.getValue().get(0)))
.iterator();
}
@Override
public void put(String key, String value) {
headers.set(key, value);
}
}
}
2.4 链路分析
典型链路视图
Trace ID: abc123...
┌─────────────────────────────────────────────────────┐
│ Gateway (100ms) │
│ └─ Order Service (80ms) │
│ ├─ Database: select (20ms) │
│ ├─ Redis: get (5ms) │
│ └─ Payment Service (50ms) │
│ ├─ Database: insert (30ms) │
│ └─ MQ: send (15ms) │
└─────────────────────────────────────────────────────┘
性能瓶颈分析:
├── 慢查询:Database: insert (30ms)
├── 慢调用:Payment Service (50ms)
└── 优化建议:
├── 添加数据库索引
└── Payment 服务异步化
三、日志聚合
3.1 ELK 架构
graph TB
subgraph App[应用层]
A1[Spring Boot 应用]
end
subgraph Collect[采集层]
F[Filebeat]
end
subgraph Process[处理层]
K[Kafka]
LS[Logstash]
end
subgraph Store[存储层]
ES[Elasticsearch 集群]
end
subgraph View[展示层]
KB[Kibana]
end
A1 -->|JSON 日志 | F
F --> K
K --> LS
LS --> ES
ES --> KB
ELK 日志架构
应用层 ├── Spring Boot 应用 └── 日志输出(JSON 格式)
采集层 ├── Filebeat(轻量级) └── 日志采集 → Kafka
处理层 ├── Logstash(可选) ├── 日志解析、过滤、 enrichment └── Kafka → Logstash → Elasticsearch
存储层 └── Elasticsearch 集群
展示层 └── Kibana ├── 日志搜索 ├── 可视化仪表板 └── 告警
### 3.2 Spring Boot 日志配置
```xml
<!-- pom.xml -->
<dependency>
<groupId>net.logstash.logback</groupId>
<artifactId>logstash-logback-encoder</artifactId>
<version>7.2</version>
</dependency>
<!-- logback-spring.xml -->
<configuration>
<appender name="JSON_CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<!-- 自定义字段 -->
<provider class="net.logstash.logback.composite.loggingevent.ArgumentsJsonProvider"/>
<customFields>{"app":"order-service","env":"production"}</customFields>
</encoder>
</appender>
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>logs/app.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>logs/app.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
</rollingPolicy>
<encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>
<root level="INFO">
<appender-ref ref="JSON_CONSOLE"/>
<appender-ref ref="FILE"/>
</root>
</configuration>
3.3 Filebeat 配置
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
json.keys_under_root: true
json.add_error_key: true
fields:
app: order-service
env: production
output.kafka:
enabled: true
hosts: ["kafka1:9092", "kafka2:9092"]
topic: "logs-order-service"
partition.hash:
reachable_only: true
compression: gzip
max_message_bytes: 1000000
# 或直接输出到 Elasticsearch
output.elasticsearch:
hosts: ["elasticsearch:9200"]
indices:
- index: "logs-order-service-%{+yyyy.MM.dd}"
3.4 Kibana 查询
# Kibana 查询示例
# 1. 按 Trace ID 查询
trace.id: "abc123..."
# 2. 按错误级别查询
log.level: "ERROR" AND app: "order-service"
# 3. 按时间范围查询
@timestamp >= now-1h AND @timestamp <= now
# 4. 按用户 ID 查询
userId: "12345"
# 5. 慢查询日志
duration_ms > 1000
# 6. 统计错误数量
GET /logs-*/_count
{
"query": {
"bool": {
"must": [
{ "term": { "log.level": "ERROR" } },
{ "term": { "app": "order-service" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}
四、监控告警
4.1 Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'order-service'
static_configs:
- targets: ['order-service:8080']
metrics_path: '/actuator/prometheus'
- job_name: 'gateway'
static_configs:
- targets: ['gateway:8080']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
4.2 Spring Boot 监控
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.yml
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
endpoint:
health:
show-details: always
metrics:
export:
prometheus:
enabled: true
tags:
app: order-service
env: production
4.3 自定义指标
/**
* 自定义监控指标
*/
@Component
public class CustomMetrics {
private final MeterRegistry meterRegistry;
// 计数器
private final Counter orderCreatedCounter;
private final Counter orderFailedCounter;
// 计时器
private final Timer processOrderTimer;
// 分布摘要
private final DistributionSummary orderAmountSummary;
// 高斯仪
private final Gauge activeOrdersGauge;
public CustomMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
// 计数器
this.orderCreatedCounter = Counter.builder("order.created.total")
.description("订单创建总数")
.tag("app", "order-service")
.register(meterRegistry);
this.orderFailedCounter = Counter.builder("order.failed.total")
.description("订单失败总数")
.tag("app", "order-service")
.register(meterRegistry);
// 计时器
this.processOrderTimer = Timer.builder("order.process.duration")
.description("订单处理耗时")
.tag("app", "order-service")
.register(meterRegistry);
// 分布摘要
this.orderAmountSummary = DistributionSummary.builder("order.amount")
.description("订单金额分布")
.tag("app", "order-service")
.baseUnit("CNY")
.register(meterRegistry);
// 高斯仪
this.activeOrdersGauge = Gauge.builder("order.active.count")
.description("活跃订单数")
.tag("app", "order-service")
.register(meterRegistry, this, CustomMetrics::getActiveOrders);
}
/**
* 记录订单创建
*/
public void recordOrderCreated(String status) {
orderCreatedCounter.increment(Tags.of("status", status));
}
/**
* 记录订单处理耗时
*/
public <T> T recordProcessOrder(Supplier<T> supplier) {
return processOrderTimer.record(supplier);
}
/**
* 记录订单金额
*/
public void recordOrderAmount(BigDecimal amount) {
orderAmountSummary.record(amount.doubleValue());
}
/**
* 获取活跃订单数
*/
private double getActiveOrders() {
// 从数据库或缓存获取
return orderMapper.countActive();
}
}
4.4 告警规则
# alerts.yml
groups:
- name: order-service-alerts
rules:
# 1. 服务宕机告警
- alert: ServiceDown
expr: up{job="order-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务宕机:{{ $labels.instance }}"
description: "{{ $labels.job }} 服务已宕机超过 1 分钟"
# 2. 错误率过高告警
- alert: HighErrorRate
expr: |
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/ rate(http_server_requests_seconds_count[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "错误率过高"
description: "错误率超过 5%(当前值:{{ $value | humanizePercentage }})"
# 3. 响应时间过长告警
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_server_requests_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "响应时间过长"
description: "P95 响应时间超过 1 秒(当前值:{{ $value }}s)"
# 4. JVM 内存告警
- alert: HighMemoryUsage
expr: |
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "JVM 内存使用过高"
description: "堆内存使用率超过 85%(当前值:{{ $value | humanizePercentage }})"
# 5. 订单失败告警
- alert: OrderFailed
expr: rate(order_failed_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "订单失败率过高"
description: "订单失败速率超过 10 个/分钟(当前值:{{ $value }})"
4.5 告警通知
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
send_resolved: true
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
webhook_configs:
- url: 'http://dingtalk-webhook/webhook'
send_resolved: true
- name: 'warning'
email_configs:
- to: 'team@example.com'
send_resolved: true
五、 Grafana 可视化
5.1 监控仪表板
{
"dashboard": {
"title": "Order Service Dashboard",
"panels": [
{
"title": "QPS",
"type": "graph",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[1m])",
"legendFormat": "{{ method }} {{ uri }}"
}
]
},
{
"title": "响应时间(P95/P99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))",
"legendFormat": "P99"
}
]
},
{
"title": "错误率",
"type": "graph",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count{status=~\"5..\"}[5m]) / rate(http_server_requests_seconds_count[5m])",
"legendFormat": "Error Rate"
}
]
},
{
"title": "JVM 内存",
"type": "graph",
"targets": [
{
"expr": "jvm_memory_used_bytes{area=\"heap\"}",
"legendFormat": "Used"
},
{
"expr": "jvm_memory_max_bytes{area=\"heap\"}",
"legendFormat": "Max"
}
]
},
{
"title": "订单指标",
"type": "stat",
"targets": [
{
"expr": "rate(order_created_total[1m])",
"legendFormat": "订单创建速率"
},
{
"expr": "order_active_count",
"legendFormat": "活跃订单数"
}
]
}
]
}
}
六、总结
6.1 核心要点
- 链路追踪:SkyWalking 无侵入接入,快速定位问题
- 日志聚合:ELK/Loki 集中管理,JSON 格式便于分析
- 监控告警:Prometheus + Grafana,实时监控 + 告警通知
- 指标设计:RED 方法(Rate、Error、Duration)
- 告警分级:Critical、Warning、Info
6.2 实施建议
graph TB
subgraph Phase1[阶段 1:基础建设 1-2 周]
P1A[部署 SkyWalking]
P1B[部署 Prometheus+Grafana]
P1C[应用接入 Agent]
end
subgraph Phase2[阶段 2:日志聚合 1-2 周]
P2A[部署 ELK/Loki]
P2B[统一日志格式 JSON]
P2C[配置采集规则]
end
subgraph Phase3[阶段 3:监控告警 1-2 周]
P3A[定义核心指标]
P3B[配置告警规则]
P3C[配置通知渠道]
end
subgraph Phase4[阶段 4:持续优化]
P4A[优化告警阈值]
P4B[完善仪表板]
P4C[定期演练]
end
Phase1 --> Phase2
Phase2 --> Phase3
Phase3 --> Phase4
可观测性不是一次性工程,而是持续建设的过程。随着业务发展,需要不断优化监控指标、告警规则、可视化仪表板。
参考资料:
- SkyWalking 官方文档
- Prometheus 官方文档
- ELK 官方文档
- Google SRE 手册