微服务监控体系
监控体系架构
核心组件
┌─────────────────────────────────────────────────────┐
│ 监控大屏 │
│ (Grafana Dashboard) │
└───────────────────┬─────────────────────────────────┘
│
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 指标监控 │ │ 日志系统 │ │ 链路追踪 │
│ Prometheus │ │ ELK │ │ SkyWalking │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
▼
┌─────────────────┐
│ 告警中心 │
│ AlertManager │
└─────────────────┘
监控层次
基础设施监控:
- CPU、内存、磁盘使用率
- 网络流量、连接数
- 容器资源使用
应用性能监控:
- QPS、响应时间、错误率
- JVM 指标、GC 情况
- 线程池、连接池状态
业务指标监控:
- 订单量、支付成功率
- 用户活跃度、转化率
- 业务流程指标
用户体验监控:
- 页面加载时间
- API 可用性
- 客户端错误率
Prometheus 监控
1. 快速开始
添加依赖:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
配置 Actuator:
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
show-details: always
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
访问指标:
curl http://localhost:8080/actuator/prometheus
2. 自定义指标
Counter(计数器):
@Component
public class OrderMetrics {
private final Counter orderCounter;
private final Counter orderFailCounter;
public OrderMetrics(MeterRegistry meterRegistry) {
this.orderCounter = Counter.builder("order.created.total")
.description("总订单数")
.tag("service", "order-service")
.register(meterRegistry);
this.orderFailCounter = Counter.builder("order.failed.total")
.description("失败订单数")
.tag("service", "order-service")
.register(meterRegistry);
}
public void recordOrder() {
orderCounter.increment();
}
public void recordOrderFail() {
orderFailCounter.increment();
}
}
Gauge(仪表盘):
@Component
public class UserMetrics {
private final AtomicInteger onlineUserCount = new AtomicInteger();
public UserMetrics(MeterRegistry meterRegistry) {
Gauge.builder("user.online.count", onlineUserCount, AtomicInteger::get)
.description("在线用户数")
.register(meterRegistry);
}
public void updateOnlineUser(int count) {
onlineUserCount.set(count);
}
}
Timer(计时器):
@Component
public class ApiMetrics {
private final Timer apiTimer;
public ApiMetrics(MeterRegistry meterRegistry) {
this.apiTimer = Timer.builder("api.response.time")
.description("API 响应时间")
.tag("service", "user-service")
.register(meterRegistry);
}
public <T> T record(Supplier<T> supplier) {
return apiTimer.record(supplier);
}
}
// 使用
@Service
public class UserService {
@Autowired
private ApiMetrics apiMetrics;
public User getUser(Long id) {
return apiMetrics.record(() -> userRepository.findById(id));
}
}
DistributionSummary(分布摘要):
@Component
public class PayloadMetrics {
private final DistributionSummary payloadSummary;
public PayloadMetrics(MeterRegistry meterRegistry) {
this.payloadSummary = DistributionSummary.builder("payload.size")
.description("请求体大小分布")
.baseUnit("bytes")
.register(meterRegistry);
}
public void record(int size) {
payloadSummary.record(size);
}
}
3. Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['user-service:8080', 'order-service:8081']
- job_name: 'nacos'
static_configs:
- targets: ['nacos:8848']
metrics_path: '/nacos/actuator/prometheus'
- job_name: 'sentinel'
static_configs:
- targets: ['sentinel:8080']
4. 告警规则
# alert_rules.yml
groups:
- name: service_alert
rules:
- alert: ServiceDown
expr: up{job="spring-boot"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务 {{ $labels.instance }} 宕机"
description: "服务 {{ $labels.instance }} 已经宕机超过 1 分钟"
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/ rate(http_server_requests_seconds_count[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "服务 {{ $labels.instance }} 错误率过高"
description: "错误率超过 5%"
- alert: HighResponseTime
expr: histogram_quantile(0.95,
rate(http_server_requests_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "服务 {{ $labels.instance }} 响应时间过长"
description: "P95 响应时间超过 1 秒"
- alert: HighMemoryUsage
expr: jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "JVM 堆内存使用率过高"
description: "堆内存使用率超过 85%"
- alert: HighCpuUsage
expr: process_cpu_usage > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "CPU 使用率超过 80%"
Grafana 监控大屏
1. JVM 监控面板
{
"dashboard": {
"title": "JVM 监控",
"panels": [
{
"title": "JVM 堆内存",
"type": "graph",
"targets": [
{
"expr": "jvm_memory_used_bytes{area=\"heap\"}",
"legendFormat": "{{id}}"
}
]
},
{
"title": "JVM GC 次数",
"type": "graph",
"targets": [
{
"expr": "rate(jvm_gc_pause_seconds_count[5m])",
"legendFormat": "{{action}}"
}
]
},
{
"title": "JVM 线程数",
"type": "graph",
"targets": [
{
"expr": "jvm_threads_live_threads",
"legendFormat": "活跃线程"
},
{
"expr": "jvm_threads_daemon_threads",
"legendFormat": "守护线程"
}
]
}
]
}
}
2. 业务监控面板
{
"dashboard": {
"title": "业务监控",
"panels": [
{
"title": "订单量统计",
"type": "stat",
"targets": [
{
"expr": "sum(increase(order_created_total[1h]))",
"legendFormat": "小时订单量"
}
]
},
{
"title": "订单成功率",
"type": "gauge",
"targets": [
{
"expr": "sum(order_created_total) - sum(order_failed_total)
/ sum(order_created_total) * 100",
"legendFormat": "成功率"
}
]
},
{
"title": "QPS 趋势",
"type": "graph",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[1m])",
"legendFormat": "{{instance}}"
}
]
}
]
}
}
3. 微服务监控面板
{
"dashboard": {
"title": "微服务监控",
"panels": [
{
"title": "服务健康状态",
"type": "table",
"targets": [
{
"expr": "up{job=\"spring-boot\"}",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "服务调用链",
"type": "graph",
"targets": [
{
"expr": "rate(http_client_requests_seconds_count[5m])",
"legendFormat": "{{service}} -> {{method}}"
}
]
},
{
"title": "Sentinel 限流",
"type": "graph",
"targets": [
{
"expr": "sentinel_block_qps",
"legendFormat": "{{resource}}"
}
]
}
]
}
}
ELK 日志系统
1. Logback 配置
<!-- logback-spring.xml -->
<configuration>
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<customFields>{"service":"${spring.application.name}"}</customFields>
<includeMdc>true</includeMdc>
</encoder>
</appender>
<appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender">
<queueSize>8192</queueSize>
<neverBlock>true</neverBlock>
<appender-ref ref="JSON"/>
</appender>
<root level="INFO">
<appender-ref ref="ASYNC"/>
</root>
</configuration>
2. Filebeat 配置
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
service: user-service
environment: prod
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
output.elasticsearch:
hosts: ["elasticsearch:9200"]
indices:
- index: "logs-%{+yyyy.MM.dd}"
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
3. Kibana 可视化
日志搜索:
service: "user-service" AND level: "ERROR"
日志统计:
GET /logs-*/_search
{
"aggs": {
"logs_over_time": {
"date_histogram": {
"field": "@timestamp",
"interval": "1h"
}
},
"log_levels": {
"terms": {
"field": "level"
}
}
}
}
告警管理
1. AlertManager 配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
send_resolved: true
- name: 'critical-alerts'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
webhook_configs:
- url: 'http://dingtalk-webhook'
- name: 'warning-alerts'
email_configs:
- to: 'dev-team@example.com'
send_resolved: true
2. 钉钉告警
@Component
public class DingTalkAlertHandler {
@Value("${dingtalk.webhook.url}")
private String webhookUrl;
@Value("${dingtalk.secret}")
private String secret;
private final RestTemplate restTemplate = new RestTemplate();
public void sendAlert(String title, String content) {
long timestamp = System.currentTimeMillis();
String sign = generateSign(timestamp);
Map<String, Object> message = new HashMap<>();
message.put("msgtype", "markdown");
message.put("markdown", Map.of(
"title", title,
"text", generateMarkdown(title, content)
));
String url = webhookUrl + "×tamp=" + timestamp + "&sign=" + sign;
restTemplate.postForObject(url, message, String.class);
}
private String generateSign(long timestamp) {
String stringToSign = timestamp + "\n" + secret;
return URLEncoder.encode(
HmacSHA256(stringToSign, secret)
);
}
private String generateMarkdown(String title, String content) {
return "#### " + title + "\n\n" +
"> " + content + "\n\n" +
"###### 告警时间:" + LocalDateTime.now();
}
}
3. 告警降噪
@Component
public class AlertDeduplicator {
private final Cache<String, LocalDateTime> alertCache =
CacheBuilder.newBuilder()
.expireAfterWrite(5, TimeUnit.MINUTES)
.build();
public boolean shouldSend(String alertKey) {
try {
alertCache.get(alertKey, () -> {
return LocalDateTime.now();
});
return false; // 告警已存在,不发送
} catch (ExecutionException e) {
return true; // 发送告警
}
}
}
监控最佳实践
1. 指标设计
RED 方法:
- Rate:请求速率
- Errors:错误率
- Duration:响应时间
USE 方法:
- Utilization:资源使用率
- Saturation:资源饱和度
- Errors:硬件错误
2. 日志规范
// 好的日志
log.info("订单创建成功,orderId={}, userId={}", orderId, userId);
log.error("订单创建失败,orderId={}, reason={}", orderId, e.getMessage(), e);
// 不好的日志
log.info("order created");
log.error("error", e);
3. 告警分级
| 级别 | 说明 | 响应时间 | 通知方式 |
|---|---|---|---|
| P0 | 核心功能不可用 | 5 分钟 | 电话 + 短信 + IM |
| P1 | 重要功能受影响 | 15 分钟 | 短信 + IM |
| P2 | 部分功能异常 | 1 小时 | IM |
| P3 | 轻微问题 | 4 小时 | 邮件 |
4. 值班制度
- 轮班安排:每周轮换
- 升级机制:超时未处理自动升级
- 复盘机制:事故后复盘改进
总结
微服务监控体系是保障系统稳定运行的关键,需要建立指标监控、日志聚合、链路追踪、告警管理等全方位的监控能力。
Prometheus + Grafana 是主流的监控方案,ELK 是常用的日志系统,结合 SkyWalking 链路追踪,可以构建完整的可观测性体系。
建立合理的告警分级和值班制度,确保问题及时发现和处理。