前言
Prometheus 是流行的开源监控系统,Grafana 是强大的可视化平台。本文将介绍 Spring Boot 集成 Prometheus 和 Grafana 的完整监控方案。
Prometheus 基础
1. 添加依赖
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
2. Spring Boot 配置
management:
endpoints:
web:
exposure:
include: health,info,prometheus
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active}
3. Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['localhost:8080']
labels:
application: 'demo'
environment: 'dev'
- job_name: 'spring-boot-prod'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['prod-server:8080']
labels:
application: 'demo'
environment: 'prod'
4. 启动 Prometheus
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
Grafana 配置
1. 启动 Grafana
docker run -d \
--name grafana \
-p 3000:3000 \
-v $(pwd)/grafana-data:/var/lib/grafana \
grafana/grafana
访问:http://localhost:3000 默认账号:admin / admin
2. 添加数据源
- 登录 Grafana
- Configuration → Data Sources
- Add data source → Prometheus
- URL: http://prometheus:9090
- Save & Test
3. 导入 Dashboard
推荐 Dashboard ID:
- JVM Micrometer - ID: 4701
- Spring Boot 2.1 - ID: 10280
- Micrometer/SpringBoot - ID: 11378
导入步骤:
- Create → Import
- 输入 Dashboard ID
- 选择 Prometheus 数据源
- Import
自定义 Dashboard
1. JVM 监控
{
"dashboard": {
"title": "JVM Monitoring",
"panels": [
{
"title": "JVM Memory Used",
"type": "graph",
"targets": [
{
"expr": "jvm_memory_used_bytes{area=\"heap\"}",
"legendFormat": "{{id}}"
}
]
},
{
"title": "JVM GC Count",
"type": "graph",
"targets": [
{
"expr": "rate(jvm_gc_pause_seconds_count[1m])",
"legendFormat": "{{action}}"
}
]
},
{
"title": "JVM Thread Count",
"type": "graph",
"targets": [
{
"expr": "jvm_threads_live_threads",
"legendFormat": "Live Threads"
},
{
"expr": "jvm_threads_daemon_threads",
"legendFormat": "Daemon Threads"
}
]
}
]
}
}
2. HTTP 请求监控
{
"dashboard": {
"title": "HTTP Requests",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[1m])",
"legendFormat": "{{method}} {{uri}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m]))",
"legendFormat": "P99"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count{status=~\"5..\"}[1m])",
"legendFormat": "5xx Errors"
}
]
}
]
}
}
3. 业务指标监控
{
"dashboard": {
"title": "Business Metrics",
"panels": [
{
"title": "Order Count",
"type": "graph",
"targets": [
{
"expr": "rate(business_order_total[1m])",
"legendFormat": "Orders/min"
}
]
},
{
"title": "Order Amount",
"type": "graph",
"targets": [
{
"expr": "rate(business_order_amount_sum[1m])",
"legendFormat": "Amount/min"
}
]
},
{
"title": "User Registered",
"type": "graph",
"targets": [
{
"expr": "rate(business_user_registered_total[1m])",
"legendFormat": "Users/min"
}
]
}
]
}
}
告警配置
1. Prometheus 告警规则
# alerting_rules.yml
groups:
- name: spring-boot-alerts
interval: 30s
rules:
# 服务宕机
- alert: ServiceDown
expr: up{job="spring-boot"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
# 高内存使用
- alert: HighMemoryUsage
expr: jvm_memory_used_bytes / jvm_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 5 minutes."
# 高错误率
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is above 5% for more than 5 minutes."
# 慢响应
- alert: SlowResponse
expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow response on {{ $labels.instance }}"
description: "P95 response time is above 1s for more than 5 minutes."
# GC 频繁
- alert: FrequentGC
expr: rate(jvm_gc_pause_seconds_count[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent GC on {{ $labels.instance }}"
description: "GC is happening more than once every 2 seconds."
2. 配置告警规则
# prometheus.yml
rule_files:
- "alerting_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
3. Alertmanager 配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email-notifications'
routes:
- match:
severity: critical
receiver: 'email-critical'
- match:
severity: warning
receiver: 'email-warning'
receivers:
- name: 'email-notifications'
email_configs:
- to: 'team@example.com'
send_resolved: true
- name: 'email-critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
- name: 'email-warning'
email_configs:
- to: 'team@example.com'
send_resolved: true
4. 启动 Alertmanager
docker run -d \
--name alertmanager \
-p 9093:9093 \
-v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager
Docker Compose 部署
1. 完整配置
# docker-compose.yml
version: '3.8'
services:
app:
build: .
ports:
- "8080:8080"
environment:
- SPRING_PROFILES_ACTIVE=prod
depends_on:
- prometheus
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
depends_on:
- app
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
volumes:
prometheus-data:
grafana-data:
alertmanager-data:
2. Grafana 自动配置
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /var/lib/grafana/dashboards
最佳实践
1. 指标命名
# ✅ 推荐
business_order_total
business_order_amount_sum
http_server_requests_seconds_count
# ❌ 不推荐
orderCount
orderAmount
requestTime
2. 标签管理
# ✅ 推荐 - 有限基数
tags:
- method: GET
- status: 200
- uri: /api/users
# ❌ 不推荐 - 高基数
tags:
- userId: 12345
- orderId: 67890
3. 采集间隔
# 开发环境
scrape_interval: 15s
evaluation_interval: 15s
# 生产环境
scrape_interval: 30s
evaluation_interval: 30s
4. 数据保留
# prometheus.yml
global:
scrape_interval: 15s
# 保留 15 天
storage:
tsdb:
retention:
time: 15d
总结
Prometheus + Grafana 监控要点:
- ✅ Prometheus 配置 - 采集配置、告警规则
- ✅ Grafana Dashboard - JVM、HTTP、业务指标
- ✅ 告警配置 - Alertmanager、通知渠道
- ✅ Docker 部署 - 完整监控栈
- ✅ 最佳实践 - 指标命名、标签管理
Prometheus + Grafana 是 Spring Boot 监控的标准方案。