Skip to content
清晨的一缕阳光
返回

RocketMQ 运维监控与告警指南

RocketMQ 运维监控是保障消息系统稳定运行的关键。本文将深入探讨 RocketMQ 监控指标、告警配置、故障排查等实战技巧。

一、监控架构

1.1 整体架构

graph TB
    subgraph RocketMQ 集群
        NS[NameServer]
        B1[Broker 1]
        B2[Broker 2]
    end
    
    subgraph 监控采集
        EXP[RocketMQ Exporter]
        PROM[Prometheus]
    end
    
    subgraph 展示告警
        GRAF[Grafana]
        ALERT[AlertManager]
    end
    
    subgraph 通知
        DING[钉钉]
        EMAIL[邮件]
        SMS[短信]
    end
    
    NS --> EXP
    B1 --> EXP
    B2 --> EXP
    EXP --> PROM
    PROM --> GRAF
    PROM --> ALERT
    ALERT --> DING
    ALERT --> EMAIL
    ALERT --> SMS

1.2 监控层次

层次监控内容工具
基础设施CPU、内存、磁盘、网络Node Exporter
JVMGC、堆内存、线程JMX Exporter
RocketMQTPS、延迟、堆积RocketMQ Exporter
业务消息量、消费进度自定义指标

二、关键指标

2.1 Broker 指标

指标名称说明告警阈值
rocketmq_broker_tps写入 TPS-
rocketmq_broker_qps查询 QPS-
rocketmq_broker_put_latency写入延迟> 100ms
rocketmq_broker_dispatch_behind分发延迟> 1000ms
rocketmq_brokeruntime_commitlog_disk_ratioCommitLog 磁盘使用率> 80%

2.2 Consumer 指标

指标名称说明告警阈值
rocketmq_consumer_tps消费 TPS-
rocketmq_consumer_latency消费延迟> 100ms
rocketmq_group_diff消费堆积量> 10000
rocketmq_consumer_failed_count消费失败数> 0

2.3 Producer 指标

指标名称说明告警阈值
rocketmq_producer_tps生产 TPS-
rocketmq_producer_latency生产延迟> 100ms
rocketmq_producer_failed_count生产失败数> 0

2.4 Topic/Queue 指标

指标名称说明告警阈值
rocketmq_topic_message_accumulate消息堆积量> 10000
rocketmq_queue_message_accumulate队列堆积量> 1000
rocketmq_brokeruntime_broker_membershipBroker 成员数< 预期数

三、Prometheus 配置

3.1 RocketMQ Exporter

# rocketmq-exporter 配置
# https://github.com/apache/rocketmq-exporter

# 启动命令
java -jar rocketmq-exporter-0.0.2-SNAPSHOT-exec.jar \
  --rocketmq.config.namesrvAddr=ns1:9876;ns2:9876 \
  --rocketmq.config.webTelemetryPath=/metrics \
  --server.port=5557

3.2 Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'rocketmq-broker'
    static_configs:
      - targets: 
          - 'broker-1:5557'
          - 'broker-2:5557'
    metrics_path: '/metrics'
  
  - job_name: 'rocketmq-namesrv'
    static_configs:
      - targets:
          - 'namesrv-1:5558'
          - 'namesrv-2:5558'
  
  - job_name: 'node'
    static_configs:
      - targets:
          - 'broker-1:9100'
          - 'broker-2:9100'

3.3 告警规则

# alerting_rules.yml
groups:
  - name: rocketmq
    rules:
      # Broker 告警
      - alert: RocketMQBrokerDown
        expr: up{job="rocketmq-broker"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Broker 宕机:{{ $labels.instance }}"
      
      - alert: RocketMQBrokerTPSDrop
        expr: rate(rocketmq_broker_tps[5m]) < 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Broker TPS 过低:{{ $value }}"
      
      - alert: RocketMQCommitLogDiskHigh
        expr: rocketmq_brokeruntime_commitlog_disk_ratio > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CommitLog 磁盘使用率过高:{{ $value }}%"
      
      # Consumer 告警
      - alert: RocketMQConsumerLag
        expr: rocketmq_group_diff > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "消费滞后:{{ $labels.group }} - {{ $value }}"
      
      - alert: RocketMQConsumerLagCritical
        expr: rocketmq_group_diff > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "消费滞后严重:{{ $labels.group }} - {{ $value }}"
      
      - alert: RocketMQConsumerFailed
        expr: rate(rocketmq_consumer_failed_count[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "消费失败:{{ $labels.group }}"
      
      # Producer 告警
      - alert: RocketMQProducerFailed
        expr: rate(rocketmq_producer_failed_count[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "生产失败"
      
      # 性能告警
      - alert: RocketMQWriteLatency
        expr: rocketmq_broker_put_latency > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "写入延迟过高:{{ $value }}ms"

四、Grafana 仪表盘

4.1 核心面板

{
  "dashboard": {
    "title": "RocketMQ 监控大盘",
    "panels": [
      {
        "title": "生产/消费 TPS",
        "targets": [
          {
            "expr": "sum(rate(rocketmq_broker_tps[1m]))",
            "legendFormat": "Put TPS"
          },
          {
            "expr": "sum(rate(rocketmq_consumer_tps[1m]))",
            "legendFormat": "Consume TPS"
          }
        ]
      },
      {
        "title": "消费堆积",
        "targets": [
          {
            "expr": "sum(rocketmq_group_diff) by (group)",
            "legendFormat": "{{ group }}"
          }
        ]
      },
      {
        "title": "写入延迟",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(rocketmq_broker_put_latency_bucket[5m]))",
            "legendFormat": "P99 Latency"
          }
        ]
      },
      {
        "title": "磁盘使用率",
        "targets": [
          {
            "expr": "rocketmq_brokeruntime_commitlog_disk_ratio",
            "legendFormat": "{{ instance }}"
          }
        ]
      }
    ]
  }
}

4.2 推荐面板

面板名称指标图表类型
生产/消费 TPSbroker_tps, consumer_tps时间序列
消费堆积group_diff时间序列
写入延迟put_latency热力图
磁盘使用率commitlog_disk_ratio状态图
Broker 状态broker_membership状态图
失败统计failed_count时间序列

五、故障排查

5.1 Broker 故障

症状:Broker 无法连接

排查步骤

# 1. 检查进程状态
ps -ef | grep BrokerStartup

# 2. 检查日志
tail -f /var/log/rocketmq/broker.log

# 3. 检查端口
netstat -tlnp | grep 10911

# 4. 检查 NameServer 连接
telnet ns1:9876

# 5. 检查磁盘空间
df -h /data/rocketmq/store

# 6. 检查 GC 日志
tail -f /var/log/rocketmq/gc.log

常见原因

5.2 消费堆积

症状:消费滞后持续增长

排查步骤

# 1. 查看消费组状态
mqadmin consumerProgress -n ns1:9876 -g my-group

# 2. 查看消费者状态
mqadmin consumerStatus -n ns1:9876 -g my-group

# 3. 查看堆积详情
mqadmin queryMsgByOffset -n ns1:9876 -t my-topic -o 1000

# 4. 检查消费者日志
tail -f /var/log/consumer/app.log

# 5. 检查消费者线程
jstack <pid> | grep -A 10 "consumer"

解决方案

// 1. 增加消费者数量
// 2. 增加消费线程
consumer.setConsumeThreadMax(64);

// 3. 优化处理逻辑
// 异步处理、批量处理

// 4. 增加 Queue 数量
// 需要重新创建 Topic

5.3 消息丢失

症状:消息未到达消费者

排查步骤

# 1. 检查 Producer 日志
tail -f /var/log/producer/app.log | grep "error"

# 2. 检查 Broker 日志
tail -f /var/log/rocketmq/broker.log | grep "error"

# 3. 查看消息轨迹
mqadmin queryMsgById -n ns1:9876 -i msgId

# 4. 检查副本状态
mqadmin brokerStatus -n ns1:9876 -b broker-1:10911

解决方案

// Producer 配置
producer.setRetryTimesWhenSendFailed(3);
producer.setRetryTimesWhenSendAsyncFailed(3);

// 开启事务消息
TransactionMQProducer producer = new TransactionMQProducer(group);
producer.setTransactionListener(listener);

5.4 性能下降

症状:TPS 下降、延迟升高

排查步骤

# 1. 检查 CPU 使用率
top -p $(ps -ef | grep BrokerStartup | awk '{print $2}')

# 2. 检查 IO 等待
iostat -x 1 5

# 3. 检查 GC 情况
jstat -gcutil <pid> 1000 10

# 4. 检查网络流量
iftop -P -n -i eth0

# 5. 检查磁盘 IO
iotop -o -P

解决方案

# Broker 优化
flushDiskType=ASYNC_FLUSH
flushCommitLogThoroughInterval=200

# JVM 优化
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20

六、运维工具

6.1 官方工具

# Topic 管理
mqadmin updateTopic -n ns1:9876 -t my-topic -c DefaultCluster -p 8 -r 8 -w

# 消费组管理
mqadmin consumerProgress -n ns1:9876 -g my-group
mqadmin resetOffsetByTime -n ns1:9876 -t my-topic -g my-group -s 1609459200000

# 消息查询
mqadmin queryMsgById -n ns1:9876 -i msgId
mqadmin queryMsgByOffset -n ns1:9876 -t my-topic -o 1000

# 集群状态
mqadmin clusterList -n ns1:9876
mqadmin brokerStatus -n ns1:9876 -b broker-1:10911

# 名称服务器
mqadmin namesrvStatus -n ns1:9876

6.2 第三方工具

工具说明链接
RocketMQ DashboardWeb 管理界面GitHub
RocketMQ Console监控管理平台GitHub
RocketMQ ExporterPrometheus ExporterGitHub

6.3 自定义脚本

#!/bin/bash
# RocketMQ 健康检查脚本

NAMESRV="ns1:9876;ns2:9876"

# 检查 NameServer 连接
for ns in $(echo $NAMESRV | tr ';' ' '); do
    if ! nc -z ${ns%:*} ${ns#*:} &>/dev/null; then
        echo "CRITICAL: NameServer $ns 无法连接"
        exit 2
    fi
done

# 检查 Broker 状态
broker_count=$(mqadmin clusterList -n $NAMESRV | grep -c "broker-id")
if [ "$broker_count" -lt 2 ]; then
    echo "CRITICAL: Broker 数量不足"
    exit 2
fi

# 检查磁盘使用率
disk_ratio=$(mqadmin brokerStatus -n $NAMESRV -b broker-1:10911 | \
    grep "commitLogDiskRatio" | awk -F: '{print $2}' | tr -d ' ')

if [ "${disk_ratio%.*}" -gt 80 ]; then
    echo "WARNING: CommitLog 磁盘使用率过高:$disk_ratio%"
    exit 1
fi

echo "OK: RocketMQ 集群健康"
exit 0

七、最佳实践

7.1 监控配置

场景采集间隔保留时间告警阈值
开发环境60s7 天宽松
测试环境30s14 天中等
生产环境15s30 天严格

7.2 日志管理

# logback.xml 配置

<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>/var/log/rocketmq/broker.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
        <fileNamePattern>/var/log/rocketmq/broker.log.%d{yyyy-MM-dd}.%i</fileNamePattern>
        <maxFileSize>100MB</maxFileSize>
        <maxHistory>30</maxHistory>
    </rollingPolicy>
    <encoder>
        <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level %logger{36} - %msg%n</pattern>
    </encoder>
</appender>

7.3 备份策略

#!/bin/bash
# 元数据备份脚本

BACKUP_DIR="/backup/rocketmq"
DATE=$(date +%Y%m%d_%H%M%S)

# 备份 Topic 配置
mqadmin updateTopic -n ns1:9876 -t all -c DefaultCluster > \
  $BACKUP_DIR/topics_$DATE.txt

# 备份消费组配置
mqadmin consumerProgress -n ns1:9876 > \
  $BACKUP_DIR/consumer-groups_$DATE.txt

# 备份 Broker 配置
cp /opt/rocketmq/conf/broker.conf $BACKUP_DIR/broker_$DATE.conf

# 保留 30 天
find $BACKUP_DIR -name "*.txt" -mtime +30 -delete

总结

RocketMQ 运维监控的核心要点:

  1. 监控架构:Exporter + Prometheus + Grafana
  2. 关键指标:Broker、Consumer、Producer、Topic/Queue
  3. 告警配置:分级告警、合理阈值
  4. 故障排查:日志分析、工具使用
  5. 最佳实践:监控配置、日志管理、备份策略

核心要点

参考资料


分享这篇文章到:

上一篇文章
Kafka 集群迁移与升级实战指南
下一篇文章
Agent 架构设计模式详解