Spring Boot Prometheus + Grafana 监控

前言

Prometheus 是流行的开源监控系统，Grafana 是强大的可视化平台。本文将介绍 Spring Boot 集成 Prometheus 和 Grafana 的完整监控方案。

Prometheus 基础

1. 添加依赖

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

2. Spring Boot 配置

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active}

3. Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['localhost:8080']
        labels:
          application: 'demo'
          environment: 'dev'
  
  - job_name: 'spring-boot-prod'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['prod-server:8080']
        labels:
          application: 'demo'
          environment: 'prod'

4. 启动 Prometheus

docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

访问：http://localhost:9090

Grafana 配置

1. 启动 Grafana

docker run -d \
  --name grafana \
  -p 3000:3000 \
  -v $(pwd)/grafana-data:/var/lib/grafana \
  grafana/grafana

访问：http://localhost:3000 默认账号：admin / admin

2. 添加数据源

登录 Grafana
Configuration → Data Sources
Add data source → Prometheus
URL: http://prometheus:9090
Save & Test

3. 导入 Dashboard

推荐 Dashboard ID：

JVM Micrometer - ID: 4701
Spring Boot 2.1 - ID: 10280
Micrometer/SpringBoot - ID: 11378

导入步骤：

Create → Import
输入 Dashboard ID
选择 Prometheus 数据源
Import

自定义 Dashboard

1. JVM 监控

{
  "dashboard": {
    "title": "JVM Monitoring",
    "panels": [
      {
        "title": "JVM Memory Used",
        "type": "graph",
        "targets": [
          {
            "expr": "jvm_memory_used_bytes{area=\"heap\"}",
            "legendFormat": "{{id}}"
          }
        ]
      },
      {
        "title": "JVM GC Count",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(jvm_gc_pause_seconds_count[1m])",
            "legendFormat": "{{action}}"
          }
        ]
      },
      {
        "title": "JVM Thread Count",
        "type": "graph",
        "targets": [
          {
            "expr": "jvm_threads_live_threads",
            "legendFormat": "Live Threads"
          },
          {
            "expr": "jvm_threads_daemon_threads",
            "legendFormat": "Daemon Threads"
          }
        ]
      }
    ]
  }
}

2. HTTP 请求监控

{
  "dashboard": {
    "title": "HTTP Requests",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_server_requests_seconds_count[1m])",
            "legendFormat": "{{method}} {{uri}}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_server_requests_seconds_count{status=~\"5..\"}[1m])",
            "legendFormat": "5xx Errors"
          }
        ]
      }
    ]
  }
}

3. 业务指标监控

{
  "dashboard": {
    "title": "Business Metrics",
    "panels": [
      {
        "title": "Order Count",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(business_order_total[1m])",
            "legendFormat": "Orders/min"
          }
        ]
      },
      {
        "title": "Order Amount",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(business_order_amount_sum[1m])",
            "legendFormat": "Amount/min"
          }
        ]
      },
      {
        "title": "User Registered",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(business_user_registered_total[1m])",
            "legendFormat": "Users/min"
          }
        ]
      }
    ]
  }
}

告警配置

1. Prometheus 告警规则

# alerting_rules.yml
groups:
  - name: spring-boot-alerts
    interval: 30s
    rules:
      # 服务宕机
      - alert: ServiceDown
        expr: up{job="spring-boot"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
      
      # 高内存使用
      - alert: HighMemoryUsage
        expr: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% for more than 5 minutes."
      
      # 高错误率
      - alert: HighErrorRate
        expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
          description: "Error rate is above 5% for more than 5 minutes."
      
      # 慢响应
      - alert: SlowResponse
        expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow response on {{ $labels.instance }}"
          description: "P95 response time is above 1s for more than 5 minutes."
      
      # GC 频繁
      - alert: FrequentGC
        expr: rate(jvm_gc_pause_seconds_count[5m]) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Frequent GC on {{ $labels.instance }}"
          description: "GC is happening more than once every 2 seconds."

2. 配置告警规则

# prometheus.yml
rule_files:
  - "alerting_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

3. Alertmanager 配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'email-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'email-critical'
    - match:
        severity: warning
      receiver: 'email-warning'

receivers:
  - name: 'email-notifications'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true
  
  - name: 'email-critical'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
  
  - name: 'email-warning'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true

4. 启动 Alertmanager

docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager

Docker Compose 部署

1. 完整配置

# docker-compose.yml
version: '3.8'

services:
  app:
    build: .
    ports:
      - "8080:8080"
    environment:
      - SPRING_PROFILES_ACTIVE=prod
    depends_on:
      - prometheus

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    depends_on:
      - app

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager

volumes:
  prometheus-data:
  grafana-data:
  alertmanager-data:

2. Grafana 自动配置

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    options:
      path: /var/lib/grafana/dashboards

最佳实践

1. 指标命名

# ✅ 推荐
business_order_total
business_order_amount_sum
http_server_requests_seconds_count

# ❌ 不推荐
orderCount
orderAmount
requestTime

2. 标签管理

# ✅ 推荐 - 有限基数
tags:
  - method: GET
  - status: 200
  - uri: /api/users

# ❌ 不推荐 - 高基数
tags:
  - userId: 12345
  - orderId: 67890

3. 采集间隔

# 开发环境
scrape_interval: 15s
evaluation_interval: 15s

# 生产环境
scrape_interval: 30s
evaluation_interval: 30s

4. 数据保留

# prometheus.yml
global:
  scrape_interval: 15s

# 保留 15 天
storage:
  tsdb:
    retention:
      time: 15d

总结

Prometheus + Grafana 监控要点：

✅ Prometheus 配置 - 采集配置、告警规则
✅ Grafana Dashboard - JVM、HTTP、业务指标
✅ 告警配置 - Alertmanager、通知渠道
✅ Docker 部署 - 完整监控栈
✅ 最佳实践 - 指标命名、标签管理

Prometheus + Grafana 是 Spring Boot 监控的标准方案。