Skip to content
清晨的一缕阳光
返回

微服务监控体系

微服务监控体系

监控体系架构

核心组件

┌─────────────────────────────────────────────────────┐
│                  监控大屏                           │
│              (Grafana Dashboard)                    │
└───────────────────┬─────────────────────────────────┘

        ┌───────────┼───────────┐
        │           │           │
        ▼           ▼           ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│   指标监控   │ │   日志系统   │ │   链路追踪   │
│ Prometheus  │ │     ELK     │ │ SkyWalking  │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
       │               │               │
       └───────────────┼───────────────┘


              ┌─────────────────┐
              │   告警中心       │
              │  AlertManager   │
              └─────────────────┘

监控层次

基础设施监控

应用性能监控

业务指标监控

用户体验监控

Prometheus 监控

1. 快速开始

添加依赖

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

配置 Actuator

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: always
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}

访问指标

curl http://localhost:8080/actuator/prometheus

2. 自定义指标

Counter(计数器)

@Component
public class OrderMetrics {
    
    private final Counter orderCounter;
    private final Counter orderFailCounter;
    
    public OrderMetrics(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("order.created.total")
            .description("总订单数")
            .tag("service", "order-service")
            .register(meterRegistry);
        
        this.orderFailCounter = Counter.builder("order.failed.total")
            .description("失败订单数")
            .tag("service", "order-service")
            .register(meterRegistry);
    }
    
    public void recordOrder() {
        orderCounter.increment();
    }
    
    public void recordOrderFail() {
        orderFailCounter.increment();
    }
}

Gauge(仪表盘)

@Component
public class UserMetrics {
    
    private final AtomicInteger onlineUserCount = new AtomicInteger();
    
    public UserMetrics(MeterRegistry meterRegistry) {
        Gauge.builder("user.online.count", onlineUserCount, AtomicInteger::get)
            .description("在线用户数")
            .register(meterRegistry);
    }
    
    public void updateOnlineUser(int count) {
        onlineUserCount.set(count);
    }
}

Timer(计时器)

@Component
public class ApiMetrics {
    
    private final Timer apiTimer;
    
    public ApiMetrics(MeterRegistry meterRegistry) {
        this.apiTimer = Timer.builder("api.response.time")
            .description("API 响应时间")
            .tag("service", "user-service")
            .register(meterRegistry);
    }
    
    public <T> T record(Supplier<T> supplier) {
        return apiTimer.record(supplier);
    }
}

// 使用
@Service
public class UserService {
    
    @Autowired
    private ApiMetrics apiMetrics;
    
    public User getUser(Long id) {
        return apiMetrics.record(() -> userRepository.findById(id));
    }
}

DistributionSummary(分布摘要)

@Component
public class PayloadMetrics {
    
    private final DistributionSummary payloadSummary;
    
    public PayloadMetrics(MeterRegistry meterRegistry) {
        this.payloadSummary = DistributionSummary.builder("payload.size")
            .description("请求体大小分布")
            .baseUnit("bytes")
            .register(meterRegistry);
    }
    
    public void record(int size) {
        payloadSummary.record(size);
    }
}

3. Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['user-service:8080', 'order-service:8081']
  
  - job_name: 'nacos'
    static_configs:
      - targets: ['nacos:8848']
    metrics_path: '/nacos/actuator/prometheus'
  
  - job_name: 'sentinel'
    static_configs:
      - targets: ['sentinel:8080']

4. 告警规则

# alert_rules.yml
groups:
  - name: service_alert
    rules:
      - alert: ServiceDown
        expr: up{job="spring-boot"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.instance }} 宕机"
          description: "服务 {{ $labels.instance }} 已经宕机超过 1 分钟"
      
      - alert: HighErrorRate
        expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) 
              / rate(http_server_requests_seconds_count[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "服务 {{ $labels.instance }} 错误率过高"
          description: "错误率超过 5%"
      
      - alert: HighResponseTime
        expr: histogram_quantile(0.95, 
              rate(http_server_requests_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "服务 {{ $labels.instance }} 响应时间过长"
          description: "P95 响应时间超过 1 秒"
      
      - alert: HighMemoryUsage
        expr: jvm_memory_used_bytes{area="heap"} 
              / jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "JVM 堆内存使用率过高"
          description: "堆内存使用率超过 85%"
      
      - alert: HighCpuUsage
        expr: process_cpu_usage > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "CPU 使用率超过 80%"

Grafana 监控大屏

1. JVM 监控面板

{
  "dashboard": {
    "title": "JVM 监控",
    "panels": [
      {
        "title": "JVM 堆内存",
        "type": "graph",
        "targets": [
          {
            "expr": "jvm_memory_used_bytes{area=\"heap\"}",
            "legendFormat": "{{id}}"
          }
        ]
      },
      {
        "title": "JVM GC 次数",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(jvm_gc_pause_seconds_count[5m])",
            "legendFormat": "{{action}}"
          }
        ]
      },
      {
        "title": "JVM 线程数",
        "type": "graph",
        "targets": [
          {
            "expr": "jvm_threads_live_threads",
            "legendFormat": "活跃线程"
          },
          {
            "expr": "jvm_threads_daemon_threads",
            "legendFormat": "守护线程"
          }
        ]
      }
    ]
  }
}

2. 业务监控面板

{
  "dashboard": {
    "title": "业务监控",
    "panels": [
      {
        "title": "订单量统计",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(increase(order_created_total[1h]))",
            "legendFormat": "小时订单量"
          }
        ]
      },
      {
        "title": "订单成功率",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(order_created_total) - sum(order_failed_total) 
                    / sum(order_created_total) * 100",
            "legendFormat": "成功率"
          }
        ]
      },
      {
        "title": "QPS 趋势",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_server_requests_seconds_count[1m])",
            "legendFormat": "{{instance}}"
          }
        ]
      }
    ]
  }
}

3. 微服务监控面板

{
  "dashboard": {
    "title": "微服务监控",
    "panels": [
      {
        "title": "服务健康状态",
        "type": "table",
        "targets": [
          {
            "expr": "up{job=\"spring-boot\"}",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "服务调用链",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_client_requests_seconds_count[5m])",
            "legendFormat": "{{service}} -> {{method}}"
          }
        ]
      },
      {
        "title": "Sentinel 限流",
        "type": "graph",
        "targets": [
          {
            "expr": "sentinel_block_qps",
            "legendFormat": "{{resource}}"
          }
        ]
      }
    ]
  }
}

ELK 日志系统

1. Logback 配置

<!-- logback-spring.xml -->
<configuration>
    <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <customFields>{"service":"${spring.application.name}"}</customFields>
            <includeMdc>true</includeMdc>
        </encoder>
    </appender>
    
    <appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender">
        <queueSize>8192</queueSize>
        <neverBlock>true</neverBlock>
        <appender-ref ref="JSON"/>
    </appender>
    
    <root level="INFO">
        <appender-ref ref="ASYNC"/>
    </root>
</configuration>

2. Filebeat 配置

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    fields:
      service: user-service
      environment: prod
    multiline.pattern: '^\d{4}-\d{2}-\d{2}'
    multiline.negate: true
    multiline.match: after

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  indices:
    - index: "logs-%{+yyyy.MM.dd}"

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~

3. Kibana 可视化

日志搜索

service: "user-service" AND level: "ERROR"

日志统计

GET /logs-*/_search
{
  "aggs": {
    "logs_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1h"
      }
    },
    "log_levels": {
      "terms": {
        "field": "level"
      }
    }
  }
}

告警管理

1. AlertManager 配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true
  
  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
    webhook_configs:
      - url: 'http://dingtalk-webhook'
  
  - name: 'warning-alerts'
    email_configs:
      - to: 'dev-team@example.com'
        send_resolved: true

2. 钉钉告警

@Component
public class DingTalkAlertHandler {
    
    @Value("${dingtalk.webhook.url}")
    private String webhookUrl;
    
    @Value("${dingtalk.secret}")
    private String secret;
    
    private final RestTemplate restTemplate = new RestTemplate();
    
    public void sendAlert(String title, String content) {
        long timestamp = System.currentTimeMillis();
        String sign = generateSign(timestamp);
        
        Map<String, Object> message = new HashMap<>();
        message.put("msgtype", "markdown");
        message.put("markdown", Map.of(
            "title", title,
            "text", generateMarkdown(title, content)
        ));
        
        String url = webhookUrl + "&timestamp=" + timestamp + "&sign=" + sign;
        restTemplate.postForObject(url, message, String.class);
    }
    
    private String generateSign(long timestamp) {
        String stringToSign = timestamp + "\n" + secret;
        return URLEncoder.encode(
            HmacSHA256(stringToSign, secret)
        );
    }
    
    private String generateMarkdown(String title, String content) {
        return "#### " + title + "\n\n" +
               "> " + content + "\n\n" +
               "###### 告警时间:" + LocalDateTime.now();
    }
}

3. 告警降噪

@Component
public class AlertDeduplicator {
    
    private final Cache<String, LocalDateTime> alertCache = 
        CacheBuilder.newBuilder()
            .expireAfterWrite(5, TimeUnit.MINUTES)
            .build();
    
    public boolean shouldSend(String alertKey) {
        try {
            alertCache.get(alertKey, () -> {
                return LocalDateTime.now();
            });
            return false;  // 告警已存在,不发送
        } catch (ExecutionException e) {
            return true;  // 发送告警
        }
    }
}

监控最佳实践

1. 指标设计

RED 方法

USE 方法

2. 日志规范

// 好的日志
log.info("订单创建成功,orderId={}, userId={}", orderId, userId);
log.error("订单创建失败,orderId={}, reason={}", orderId, e.getMessage(), e);

// 不好的日志
log.info("order created");
log.error("error", e);

3. 告警分级

级别说明响应时间通知方式
P0核心功能不可用5 分钟电话 + 短信 + IM
P1重要功能受影响15 分钟短信 + IM
P2部分功能异常1 小时IM
P3轻微问题4 小时邮件

4. 值班制度

总结

微服务监控体系是保障系统稳定运行的关键,需要建立指标监控、日志聚合、链路追踪、告警管理等全方位的监控能力。

Prometheus + Grafana 是主流的监控方案,ELK 是常用的日志系统,结合 SkyWalking 链路追踪,可以构建完整的可观测性体系。

建立合理的告警分级和值班制度,确保问题及时发现和处理。


分享这篇文章到:

上一篇文章
Spring Boot Kubernetes 部署实战
下一篇文章
Java 并发容器详解