Redis 哨兵集群实战

哨兵集群是 Redis 高可用的核心方案。通过部署多个哨兵节点，实现故障检测、自动故障转移。本文将深入哨兵集群的部署、配置和问题排查。

一、哨兵集群架构

1.1 推荐架构

生产环境推荐：
- 3 个哨兵节点（容忍 1 个故障）
- 或 5 个哨兵节点（容忍 2 个故障）
- 哨兵部署在不同机器

┌─────────────┐
│   Master    │  192.168.1.1:6379
│  192.168.1.1│
└──────┬──────┘
       │
   ┌───┴───┐
   ▼       ▼
┌─────┐ ┌─────┐
│Slave│ │Slave│  192.168.1.2:6379, 192.168.1.3:6379
│192.168│ │192.168│
│.1.2  │ │.1.3  │
└─────┘ └─────┘
       ▲
   ┌───┼───┬───┐
   │   │   │   │
┌─────┐ ┌─────┐ ┌─────┐
│ S1  │ │ S2  │ │ S3  │  哨兵集群
│192.168│ │192.168│ │192.168│
│.2.1  │ │.2.2  │ │.2.3  │
│:26379│ │:26379│ │:26379│
└─────┘ └─────┘ └─────┘

1.2 节点角色

角色	数量	端口	说明
Master	1	6379	主节点（写）
Slave	2+	6379	从节点（读）
Sentinel	3/5	26379	哨兵节点

二、部署配置

2.1 Redis 节点配置

# redis.conf (Master)
port 6379
bind 0.0.0.0
daemonize yes
pidfile /var/run/redis/redis-server.pid
logfile /var/log/redis/redis-server.log

# 密码
requirepass your_password
masterauth your_password

# 持久化
save 900 1
save 300 10
appendonly yes
appendfsync everysec

# 哨兵相关
sentinel announce-ip 192.168.1.1
sentinel announce-port 6379

# redis.conf (Slave)
port 6379
bind 0.0.0.0

# 密码
requirepass your_password
masterauth your_password

# 主从配置
replicaof 192.168.1.1 6379

# 只读
replica-read-only yes

# 持久化
appendonly yes

2.2 哨兵配置

# sentinel.conf (S1/S2/S3)
port 26379
bind 0.0.0.0
daemonize yes
pidfile /var/run/redis/redis-sentinel.pid
logfile /var/log/redis/sentinel.log

# 监控配置
sentinel monitor mymaster 192.168.1.1 6379 2

# 密码
sentinel auth-pass mymaster your_password

# 主观下线时间
sentinel down-after-milliseconds mymaster 5000

# 故障转移超时
sentinel failover-timeout mymaster 60000

# 并行同步数
sentinel parallel-syncs mymaster 1

# 通知脚本
sentinel notification-script mymaster /var/redis/notify.sh

# 客户端重连脚本
sentinel client-reconfig-script mymaster /var/redis/reconfig.sh

2.3 启动脚本

#!/bin/bash
# start_redis.sh

REDIS_DIR="/opt/redis"
REDIS_BIN="$REDIS_DIR/redis-server"
SENTINEL_BIN="$REDIS_DIR/redis-sentinel"

# 启动 Redis
$REDIS_BIN /etc/redis/redis.conf

# 启动哨兵
$SENTINEL_BIN /etc/redis/sentinel.conf

# 检查状态
sleep 5
redis-cli -p 6379 INFO replication
redis-cli -p 26379 INFO sentinel

三、故障转移

3.1 故障检测流程

Master 故障
    ↓
哨兵 A 检测到无响应
    ↓
标记为主观下线 (SDOWN)
    ↓
哨兵间通信确认
    ↓
达到 quorum，标记客观下线 (ODOWN)
    ↓
选举领头哨兵
    ↓
领头哨兵执行故障转移
    ↓
选择新主节点
    ↓
通知其他从节点切换
    ↓
更新配置

3.2 手动故障转移

# 手动触发故障转移
redis-cli -p 26379 SENTINEL FAILOVER mymaster

# 检查状态
redis-cli -p 26379 SENTINEL MASTER mymaster

# 输出示例
# name=mymaster
# ip=192.168.1.2  # 新主节点
# port=6379
# flags=master
# num-slaves=2
# num-other-sentinels=2
# state=ok

3.3 故障转移日志

# sentinel.log
+switch-master mymaster 192.168.1.1 6379 192.168.1.2 6379
+new-epoch=1
+vote-for-leader abc123 1
+subjective-down 192.168.1.1 6379
+odown 192.168.1.1 6379 # quorum 2/2
+failover-end-for-timeout
+failover-ended: slave 192.168.1.2 6379

四、脑裂问题

4.1 什么是脑裂

网络分区导致：

分区 A                分区 B
┌─────────┐          ┌─────────┐
│ Master  │          │ Slave   │
│ 192.168 │          │ 192.168 │
│  .1.1   │          │  .1.2   │
└────┬────┘          └────┬────┘
     │                    │
┌────┴────┐          ┌────┴────┐
│Sentinel │          │Sentinel │
│   S1    │          │ S2, S3  │
└─────────┘          └─────────┘

问题：
- 分区 B 的哨兵认为 Master 故障
- 选举 S2 为新的 Master
- 出现两个 Master（脑裂）

4.2 解决方案

方案 1：配置 min-slaves-to-write

# Master 配置
min-slaves-to-write 1
min-slaves-max-lag 10

# 含义：
# 至少 1 个从节点
# 延迟不超过 10 秒
# 否则 Master 拒绝写

方案 2：配置哨兵 quorum

# quorum 设置
sentinel monitor mymaster 192.168.1.1 6379 2
# 至少 2 个哨兵确认才故障转移

# 3 个哨兵：quorum = 2
# 5 个哨兵：quorum = 3

方案 3：网络隔离检测

# 减少超时时间
sentinel down-after-milliseconds mymaster 3000

# 增加故障转移超时
sentinel failover-timeout mymaster 120000

4.3 脑裂恢复

# 1. 检查当前状态
redis-cli -p 26379 SENTINEL MASTER mymaster

# 2. 找到旧 Master
redis-cli -h 192.168.1.1 INFO replication

# 3. 将旧 Master 降级为从节点
redis-cli -h 192.168.1.1 SLAVEOF 192.168.1.2 6379

# 4. 等待数据同步
redis-cli -h 192.168.1.1 INFO replication

# 5. 哨兵会自动更新配置

五、监控与告警

5.1 监控指标

# 哨兵状态
redis-cli -p 26379 INFO sentinel

# 输出示例
# sentinel
# sentinel_masters:1
# sentinel_tilt:0
# sentinel_running_scripts:0
# sentinel_scripts_queue_length:0
# master0:name=mymaster,status=ok,address=192.168.1.1:6379,slaves=2,sentinels=3

5.2 告警脚本

#!/bin/bash
# notify.sh

master_name=$1
subject=$2
type=$3
address=$4
port=$5

LOG_FILE="/var/log/redis/sentinel-alerts.log"
WEBHOOK="https://hooks.slack.com/services/XXX/YYY/ZZZ"

# 记录日志
echo "$(date '+%Y-%m-%d %H:%M:%S') - $master_name $subject $type $address:$port" >> $LOG_FILE

# 发送 Slack 告警
if [ "$subject" == "down" ] || [ "$subject" == "failover" ]; then
    curl -X POST -H 'Content-type: application/json' \
        --data "{
            \"text\": \"Redis Alert: $master_name $subject $type $address:$port\"
        }" \
        $WEBHOOK
fi

5.3 Prometheus 监控

# prometheus.yml
scrape_configs:
  - job_name: 'redis-sentinel'
    static_configs:
      - targets:
        - '192.168.2.1:26379'
        - '192.168.2.2:26379'
        - '192.168.2.3:26379'

# Grafana 告警规则
- alert: RedisSentinelMasterDown
  expr: redis_sentinel_master_status{status="odown"} == 1
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Redis Master is down"

六、客户端配置

6.1 Java（Jedis）

import redis.clients.jedis.JedisSentinelPool;
import redis.clients.jedis.Jedis;

import java.util.HashSet;
import java.util.Set;

public class RedisSentinelExample {
    public static void main(String[] args) {
        // 哨兵节点
        Set<String> sentinels = new HashSet<>();
        sentinels.add("192.168.2.1:26379");
        sentinels.add("192.168.2.2:26379");
        sentinels.add("192.168.2.3:26379");
        
        // 连接池配置
        JedisPoolConfig poolConfig = new JedisPoolConfig();
        poolConfig.setMaxTotal(10);
        poolConfig.setMaxIdle(5);
        poolConfig.setMinIdle(1);
        
        // 创建连接池
        JedisSentinelPool pool = new JedisSentinelPool(
            "mymaster",           // 主节点名称
            sentinels,            // 哨兵节点
            poolConfig,           // 连接池配置
            "your_password"       // 密码
        );
        
        // 获取连接（自动连接 Master）
        try (Jedis jedis = pool.getResource()) {
            jedis.set("key", "value");
            System.out.println(jedis.get("key"));
        }
    }
}

6.2 Python（redis-py）

from redis.sentinel import Sentinel
from redis.exceptions import ConnectionError

# 哨兵节点
sentinel = Sentinel(
    [('192.168.2.1', 26379),
     ('192.168.2.2', 26379),
     ('192.168.2.3', 26379)],
    socket_timeout=0.1,
    password='your_password'
)

# 获取 Master 连接
master = sentinel.master_for(
    'mymaster',
    password='your_password',
    socket_timeout=0.1
)

# 获取 Slave 连接（只读）
slave = sentinel.slave_for(
    'mymaster',
    password='your_password',
    socket_timeout=0.1
)

# 写操作（Master）
master.set('key', 'value')

# 读操作（Slave）
value = slave.get('key')
print(value)

6.3 Go（go-redis）

package main

import (
    "fmt"
    "github.com/go-redis/redis/v8"
    "context"
)

func main() {
    ctx := context.Background()
    
    // 哨兵客户端
    rdb := redis.NewFailoverClient(&redis.FailoverOptions{
        MasterName:    "mymaster",
        SentinelAddrs: []string{
            "192.168.2.1:26379",
            "192.168.2.2:26379",
            "192.168.2.3:26379",
        },
        Password: "your_password",
    })
    
    // 写操作
    err := rdb.Set(ctx, "key", "value", 0).Err()
    if err != nil {
        panic(err)
    }
    
    // 读操作
    val, err := rdb.Get(ctx, "key").Result()
    if err != nil {
        panic(err)
    }
    fmt.Println("key:", val)
}

七、最佳实践

7.1 部署建议

推荐配置：
- 3 或 5 个哨兵（奇数）
- 哨兵与 Redis 节点分离
- 跨机房部署
- 使用内网通信

7.2 配置优化

# 生产环境配置
sentinel monitor mymaster 192.168.1.1 6379 2

# 主观下线时间（3-5 秒）
sentinel down-after-milliseconds mymaster 5000

# 故障转移超时（60-120 秒）
sentinel failover-timeout mymaster 60000

# 并行同步数（1-2）
sentinel parallel-syncs mymaster 1

# 最小从节点数
min-slaves-to-write 1
min-slaves-max-lag 10

7.3 故障转移测试

#!/bin/bash
# test_failover.sh

SENTINEL_HOST="192.168.2.1"
MASTER_NAME="mymaster"

echo "=== 故障转移测试 ==="

# 1. 获取当前 Master
echo "当前 Master:"
redis-cli -h $SENTINEL_HOST -p 26379 \
    SENTINEL GET-MASTER-ADDR-BY-NAME $MASTER_NAME

# 2. 手动触发故障转移
echo "触发故障转移..."
redis-cli -h $SENTINEL_HOST -p 26379 \
    SENTINEL FAILOVER $MASTER_NAME

# 3. 等待完成
sleep 10

# 4. 检查新 Master
echo "新 Master:"
redis-cli -h $SENTINEL_HOST -p 26379 \
    SENTINEL GET-MASTER-ADDR-BY-NAME $MASTER_NAME

# 5. 验证数据
echo "验证数据..."
NEW_MASTER=$(redis-cli -h $SENTINEL_HOST -p 26379 \
    SENTINEL GET-MASTER-ADDR-BY-NAME $MASTER_NAME | head -1)
redis-cli -h $NEW_MASTER -a your_password GET test_key

八、常见问题

Q1: 哨兵数量多少合适？

推荐：3 或 5 个
- 奇数个，避免投票平局
- 3 个哨兵可容忍 1 个故障
- 5 个哨兵可容忍 2 个故障
- quorum = (N/2) + 1

Q2: 故障转移失败？

# 检查日志
tail -f /var/log/redis/sentinel.log

# 常见原因：
# 1. quorum 设置过高
# 2. 网络不通
# 3. 从节点不可用
# 4. 密码错误

Q3: 哨兵配置不更新？

# 手动刷新配置
redis-cli -p 26379 SENTINEL FLUSHCONFIG

# 检查配置文件权限
ls -la /etc/redis/sentinel.conf
chmod 640 /etc/redis/sentinel.conf

Q4: 客户端连接失败？

# 检查哨兵状态
redis-cli -p 26379 SENTINEL MASTER mymaster

# 检查密码
# 确保 masterauth 和 requirepass 一致

# 检查网络
telnet 192.168.2.1 26379

总结

Redis 哨兵集群核心要点：

组件	推荐配置	说明
哨兵数	3 或 5	奇数，避免脑裂
quorum	2 或 3	故障确认数
超时	5000ms	主观下线时间
故障转移	60s	超时时间

最佳实践：

部署 3 或 5 个哨兵
哨兵与 Redis 分离部署
配置合理的超时时间
配置告警通知
定期测试故障转移
配置 min-slaves-to-write 防止脑裂

掌握哨兵集群，构建高可用 Redis 架构！

Redis 哨兵集群实战

Redis 哨兵集群实战

一、哨兵集群架构

1.1 推荐架构

1.2 节点角色

二、部署配置

2.1 Redis 节点配置

2.2 哨兵配置

2.3 启动脚本

三、故障转移

3.1 故障检测流程

3.2 手动故障转移

3.3 故障转移日志

四、脑裂问题

4.1 什么是脑裂

4.2 解决方案

4.3 脑裂恢复

五、监控与告警

5.1 监控指标

5.2 告警脚本

5.3 Prometheus 监控

六、客户端配置

6.1 Java（Jedis）

6.2 Python（redis-py）

6.3 Go（go-redis）

七、最佳实践

7.1 部署建议

7.2 配置优化

7.3 故障转移测试

八、常见问题

Q1: 哨兵数量多少合适？

Q2: 故障转移失败？

Q3: 哨兵配置不更新？

Q4: 客户端连接失败？

总结

参考资料