RocketMQ 运维监控是保障消息系统稳定运行的关键。本文将深入探讨 RocketMQ 监控指标、告警配置、故障排查等实战技巧。
一、监控架构
1.1 整体架构
graph TB
subgraph RocketMQ 集群
NS[NameServer]
B1[Broker 1]
B2[Broker 2]
end
subgraph 监控采集
EXP[RocketMQ Exporter]
PROM[Prometheus]
end
subgraph 展示告警
GRAF[Grafana]
ALERT[AlertManager]
end
subgraph 通知
DING[钉钉]
EMAIL[邮件]
SMS[短信]
end
NS --> EXP
B1 --> EXP
B2 --> EXP
EXP --> PROM
PROM --> GRAF
PROM --> ALERT
ALERT --> DING
ALERT --> EMAIL
ALERT --> SMS
1.2 监控层次
| 层次 | 监控内容 | 工具 |
|---|---|---|
| 基础设施 | CPU、内存、磁盘、网络 | Node Exporter |
| JVM | GC、堆内存、线程 | JMX Exporter |
| RocketMQ | TPS、延迟、堆积 | RocketMQ Exporter |
| 业务 | 消息量、消费进度 | 自定义指标 |
二、关键指标
2.1 Broker 指标
| 指标名称 | 说明 | 告警阈值 |
|---|---|---|
rocketmq_broker_tps | 写入 TPS | - |
rocketmq_broker_qps | 查询 QPS | - |
rocketmq_broker_put_latency | 写入延迟 | > 100ms |
rocketmq_broker_dispatch_behind | 分发延迟 | > 1000ms |
rocketmq_brokeruntime_commitlog_disk_ratio | CommitLog 磁盘使用率 | > 80% |
2.2 Consumer 指标
| 指标名称 | 说明 | 告警阈值 |
|---|---|---|
rocketmq_consumer_tps | 消费 TPS | - |
rocketmq_consumer_latency | 消费延迟 | > 100ms |
rocketmq_group_diff | 消费堆积量 | > 10000 |
rocketmq_consumer_failed_count | 消费失败数 | > 0 |
2.3 Producer 指标
| 指标名称 | 说明 | 告警阈值 |
|---|---|---|
rocketmq_producer_tps | 生产 TPS | - |
rocketmq_producer_latency | 生产延迟 | > 100ms |
rocketmq_producer_failed_count | 生产失败数 | > 0 |
2.4 Topic/Queue 指标
| 指标名称 | 说明 | 告警阈值 |
|---|---|---|
rocketmq_topic_message_accumulate | 消息堆积量 | > 10000 |
rocketmq_queue_message_accumulate | 队列堆积量 | > 1000 |
rocketmq_brokeruntime_broker_membership | Broker 成员数 | < 预期数 |
三、Prometheus 配置
3.1 RocketMQ Exporter
# rocketmq-exporter 配置
# https://github.com/apache/rocketmq-exporter
# 启动命令
java -jar rocketmq-exporter-0.0.2-SNAPSHOT-exec.jar \
--rocketmq.config.namesrvAddr=ns1:9876;ns2:9876 \
--rocketmq.config.webTelemetryPath=/metrics \
--server.port=5557
3.2 Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'rocketmq-broker'
static_configs:
- targets:
- 'broker-1:5557'
- 'broker-2:5557'
metrics_path: '/metrics'
- job_name: 'rocketmq-namesrv'
static_configs:
- targets:
- 'namesrv-1:5558'
- 'namesrv-2:5558'
- job_name: 'node'
static_configs:
- targets:
- 'broker-1:9100'
- 'broker-2:9100'
3.3 告警规则
# alerting_rules.yml
groups:
- name: rocketmq
rules:
# Broker 告警
- alert: RocketMQBrokerDown
expr: up{job="rocketmq-broker"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Broker 宕机:{{ $labels.instance }}"
- alert: RocketMQBrokerTPSDrop
expr: rate(rocketmq_broker_tps[5m]) < 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Broker TPS 过低:{{ $value }}"
- alert: RocketMQCommitLogDiskHigh
expr: rocketmq_brokeruntime_commitlog_disk_ratio > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CommitLog 磁盘使用率过高:{{ $value }}%"
# Consumer 告警
- alert: RocketMQConsumerLag
expr: rocketmq_group_diff > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "消费滞后:{{ $labels.group }} - {{ $value }}"
- alert: RocketMQConsumerLagCritical
expr: rocketmq_group_diff > 100000
for: 5m
labels:
severity: critical
annotations:
summary: "消费滞后严重:{{ $labels.group }} - {{ $value }}"
- alert: RocketMQConsumerFailed
expr: rate(rocketmq_consumer_failed_count[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "消费失败:{{ $labels.group }}"
# Producer 告警
- alert: RocketMQProducerFailed
expr: rate(rocketmq_producer_failed_count[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "生产失败"
# 性能告警
- alert: RocketMQWriteLatency
expr: rocketmq_broker_put_latency > 100
for: 5m
labels:
severity: warning
annotations:
summary: "写入延迟过高:{{ $value }}ms"
四、Grafana 仪表盘
4.1 核心面板
{
"dashboard": {
"title": "RocketMQ 监控大盘",
"panels": [
{
"title": "生产/消费 TPS",
"targets": [
{
"expr": "sum(rate(rocketmq_broker_tps[1m]))",
"legendFormat": "Put TPS"
},
{
"expr": "sum(rate(rocketmq_consumer_tps[1m]))",
"legendFormat": "Consume TPS"
}
]
},
{
"title": "消费堆积",
"targets": [
{
"expr": "sum(rocketmq_group_diff) by (group)",
"legendFormat": "{{ group }}"
}
]
},
{
"title": "写入延迟",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(rocketmq_broker_put_latency_bucket[5m]))",
"legendFormat": "P99 Latency"
}
]
},
{
"title": "磁盘使用率",
"targets": [
{
"expr": "rocketmq_brokeruntime_commitlog_disk_ratio",
"legendFormat": "{{ instance }}"
}
]
}
]
}
}
4.2 推荐面板
| 面板名称 | 指标 | 图表类型 |
|---|---|---|
| 生产/消费 TPS | broker_tps, consumer_tps | 时间序列 |
| 消费堆积 | group_diff | 时间序列 |
| 写入延迟 | put_latency | 热力图 |
| 磁盘使用率 | commitlog_disk_ratio | 状态图 |
| Broker 状态 | broker_membership | 状态图 |
| 失败统计 | failed_count | 时间序列 |
五、故障排查
5.1 Broker 故障
症状:Broker 无法连接
排查步骤:
# 1. 检查进程状态
ps -ef | grep BrokerStartup
# 2. 检查日志
tail -f /var/log/rocketmq/broker.log
# 3. 检查端口
netstat -tlnp | grep 10911
# 4. 检查 NameServer 连接
telnet ns1:9876
# 5. 检查磁盘空间
df -h /data/rocketmq/store
# 6. 检查 GC 日志
tail -f /var/log/rocketmq/gc.log
常见原因:
- 磁盘空间不足
- NameServer 连接断开
- OOM 错误
- 网络问题
5.2 消费堆积
症状:消费滞后持续增长
排查步骤:
# 1. 查看消费组状态
mqadmin consumerProgress -n ns1:9876 -g my-group
# 2. 查看消费者状态
mqadmin consumerStatus -n ns1:9876 -g my-group
# 3. 查看堆积详情
mqadmin queryMsgByOffset -n ns1:9876 -t my-topic -o 1000
# 4. 检查消费者日志
tail -f /var/log/consumer/app.log
# 5. 检查消费者线程
jstack <pid> | grep -A 10 "consumer"
解决方案:
// 1. 增加消费者数量
// 2. 增加消费线程
consumer.setConsumeThreadMax(64);
// 3. 优化处理逻辑
// 异步处理、批量处理
// 4. 增加 Queue 数量
// 需要重新创建 Topic
5.3 消息丢失
症状:消息未到达消费者
排查步骤:
# 1. 检查 Producer 日志
tail -f /var/log/producer/app.log | grep "error"
# 2. 检查 Broker 日志
tail -f /var/log/rocketmq/broker.log | grep "error"
# 3. 查看消息轨迹
mqadmin queryMsgById -n ns1:9876 -i msgId
# 4. 检查副本状态
mqadmin brokerStatus -n ns1:9876 -b broker-1:10911
解决方案:
// Producer 配置
producer.setRetryTimesWhenSendFailed(3);
producer.setRetryTimesWhenSendAsyncFailed(3);
// 开启事务消息
TransactionMQProducer producer = new TransactionMQProducer(group);
producer.setTransactionListener(listener);
5.4 性能下降
症状:TPS 下降、延迟升高
排查步骤:
# 1. 检查 CPU 使用率
top -p $(ps -ef | grep BrokerStartup | awk '{print $2}')
# 2. 检查 IO 等待
iostat -x 1 5
# 3. 检查 GC 情况
jstat -gcutil <pid> 1000 10
# 4. 检查网络流量
iftop -P -n -i eth0
# 5. 检查磁盘 IO
iotop -o -P
解决方案:
# Broker 优化
flushDiskType=ASYNC_FLUSH
flushCommitLogThoroughInterval=200
# JVM 优化
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20
六、运维工具
6.1 官方工具
# Topic 管理
mqadmin updateTopic -n ns1:9876 -t my-topic -c DefaultCluster -p 8 -r 8 -w
# 消费组管理
mqadmin consumerProgress -n ns1:9876 -g my-group
mqadmin resetOffsetByTime -n ns1:9876 -t my-topic -g my-group -s 1609459200000
# 消息查询
mqadmin queryMsgById -n ns1:9876 -i msgId
mqadmin queryMsgByOffset -n ns1:9876 -t my-topic -o 1000
# 集群状态
mqadmin clusterList -n ns1:9876
mqadmin brokerStatus -n ns1:9876 -b broker-1:10911
# 名称服务器
mqadmin namesrvStatus -n ns1:9876
6.2 第三方工具
| 工具 | 说明 | 链接 |
|---|---|---|
| RocketMQ Dashboard | Web 管理界面 | GitHub |
| RocketMQ Console | 监控管理平台 | GitHub |
| RocketMQ Exporter | Prometheus Exporter | GitHub |
6.3 自定义脚本
#!/bin/bash
# RocketMQ 健康检查脚本
NAMESRV="ns1:9876;ns2:9876"
# 检查 NameServer 连接
for ns in $(echo $NAMESRV | tr ';' ' '); do
if ! nc -z ${ns%:*} ${ns#*:} &>/dev/null; then
echo "CRITICAL: NameServer $ns 无法连接"
exit 2
fi
done
# 检查 Broker 状态
broker_count=$(mqadmin clusterList -n $NAMESRV | grep -c "broker-id")
if [ "$broker_count" -lt 2 ]; then
echo "CRITICAL: Broker 数量不足"
exit 2
fi
# 检查磁盘使用率
disk_ratio=$(mqadmin brokerStatus -n $NAMESRV -b broker-1:10911 | \
grep "commitLogDiskRatio" | awk -F: '{print $2}' | tr -d ' ')
if [ "${disk_ratio%.*}" -gt 80 ]; then
echo "WARNING: CommitLog 磁盘使用率过高:$disk_ratio%"
exit 1
fi
echo "OK: RocketMQ 集群健康"
exit 0
七、最佳实践
7.1 监控配置
| 场景 | 采集间隔 | 保留时间 | 告警阈值 |
|---|---|---|---|
| 开发环境 | 60s | 7 天 | 宽松 |
| 测试环境 | 30s | 14 天 | 中等 |
| 生产环境 | 15s | 30 天 | 严格 |
7.2 日志管理
# logback.xml 配置
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/rocketmq/broker.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/rocketmq/broker.log.%d{yyyy-MM-dd}.%i</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
</rollingPolicy>
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
7.3 备份策略
#!/bin/bash
# 元数据备份脚本
BACKUP_DIR="/backup/rocketmq"
DATE=$(date +%Y%m%d_%H%M%S)
# 备份 Topic 配置
mqadmin updateTopic -n ns1:9876 -t all -c DefaultCluster > \
$BACKUP_DIR/topics_$DATE.txt
# 备份消费组配置
mqadmin consumerProgress -n ns1:9876 > \
$BACKUP_DIR/consumer-groups_$DATE.txt
# 备份 Broker 配置
cp /opt/rocketmq/conf/broker.conf $BACKUP_DIR/broker_$DATE.conf
# 保留 30 天
find $BACKUP_DIR -name "*.txt" -mtime +30 -delete
总结
RocketMQ 运维监控的核心要点:
- 监控架构:Exporter + Prometheus + Grafana
- 关键指标:Broker、Consumer、Producer、Topic/Queue
- 告警配置:分级告警、合理阈值
- 故障排查:日志分析、工具使用
- 最佳实践:监控配置、日志管理、备份策略
核心要点:
- 建立完整的监控体系
- 配置合理的告警阈值
- 掌握常见故障排查方法
- 定期备份元数据