多云部署实战：跨云架构设计与落地

引言

企业上云后，单一云服务商存在风险：

云服务商故障导致业务中断
厂商锁定，议价能力弱
合规要求数据本地化
成本优化需求

多云架构通过在多个云服务商部署应用，实现高可用、风险分散和成本优化。

一、多云架构模式

架构分类

graph TB
    subgraph ActiveActive["主动 - 主动模式"]
        A1[云 A 集群] <--> LB1[全局负载均衡]
        A2[云 B 集群] <--> LB1
        LB1 --> Users[用户]
        A1 <-->|数据同步 | A2
    end
    
    subgraph ActivePassive["主动 - 被动模式"]
        P1[主云集群] --> P2[备云集群]
        P1 --> LB2[负载均衡]
        P2 -.->|热备 | LB2
        LB2 --> Users2[用户]
    end

模式对比

模式	优点	缺点	适用场景
主动 - 主动	资源利用率高、低延迟	数据一致性复杂	全球业务
主动 - 被动	架构简单、成本低	资源闲置	灾难恢复
分区域部署	合规、低延迟	运维复杂	跨国业务

二、DNS 流量调度

1. Route53 配置（AWS）

# 创建健康检查
aws route53 create-health-check \
  --caller-reference "app-health-check" \
  --health-check-config \
    Type=HTTPS,\
    FullyQualifiedDomainName=app.example.com,\
    Port=443,\
    ResourcePath=/health,\
    RequestInterval=30,\
    FailureThreshold=3

# 创建延迟路由策略
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "aws-cn",
        "Weight": 70,
        "HealthCheckId": "health-check-id",
        "ResourceRecords": [{"Value": "1.2.3.4"}]
      }
    }, {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "aliyun-cn",
        "Weight": 30,
        "HealthCheckId": "health-check-id-2",
        "ResourceRecords": [{"Value": "5.6.7.8"}]
      }
    }]
  }'

2. 阿里云 DNS 配置

# DNS 解析配置
dns:
  - domain: app.example.com
    records:
      - type: A
        line: 默认
        value: 1.2.3.4  # AWS
        weight: 70
        status: enable
      
      - type: A
        line: 默认
        value: 5.6.7.8  # 阿里云
        weight: 30
        status: enable
  
  # 健康检查
  health_check:
    interval: 60s
    timeout: 10s
    threshold: 3
    notify: true

3. 全局负载均衡器

# nginx upstream 配置
upstream backend {
    least_conn;
    
    # AWS 集群
    server aws-app.example.com weight=7 max_fails=3 fail_timeout=30s;
    
    # 阿里云集群
    server aliyun-app.example.com weight=3 max_fails=3 fail_timeout=30s;
    
    # 健康检查
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "HEAD /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

server {
    listen 80;
    server_name app.example.com;
    
    location / {
        proxy_pass http://backend;
        proxy_connect_timeout 10s;
        proxy_read_timeout 30s;
    }
}

三、数据同步方案

1. 数据库主从同步

-- AWS MySQL（主库）
-- my.cnf 配置
[mysqld]
server-id = 1
log-bin = mysql-bin
binlog-format = ROW
gtid-mode = ON
enforce-gtid-consistency = ON

-- 创建复制用户
CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';

-- 阿里云 MySQL（从库）
-- my.cnf 配置
[mysqld]
server-id = 2
read-only = ON
relay-log = relay-bin

# 配置主从复制
CHANGE MASTER TO
  MASTER_HOST='aws-mysql.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='password',
  MASTER_AUTO_POSITION=1;

START SLAVE;

-- 检查同步状态
SHOW SLAVE STATUS\G

2. Redis 跨云同步

# 使用 RedisShake 同步
# redis-shake.conf
source:
  address: "aws-redis:6379"
  password: "aws-password"
  type: "redis"

target:
  address: "aliyun-redis:6379"
  password: "aliyun-password"
  type: "redis"

advanced:
  rdb_input_channel: 10
  rewrite: true

3. 对象存储同步

# 使用 rclone 同步 S3 和 OSS
# .rclone.conf
[aws-s3]
type = s3
provider = AWS
access_key_id = AKIAIOSFODNN7EXAMPLE
secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
region = ap-northeast-1

[aliyun-oss]
type = s3
provider = Alibaba
access_key_id = LTAI5t...
access_key_secret = ...
endpoint = oss-ap-southeast-1.aliyuncs.com

# 同步命令
rclone sync aws-s3:my-bucket aliyun-oss:my-bucket \
  --transfers=16 \
  --checkers=32 \
  --bwlimit=10M \
  --progress

四、应用层多活

1. 配置中心多活

# Apollo 配置中心多活
app:
  id: user-service

apollo:
  meta: http://aws-apollo:8080,http://aliyun-apollo:8080
  cluster: default
  
  # 多活配置
  loadBalance:
    enable: true
    strategy: weight
  
  # 灾备
  backup:
    enable: true
    path: /opt/data/apollo/backup

2. 消息队列多活

// RocketMQ 多活配置
@Configuration
public class RocketMQConfig {
    
    @Bean
    public Producer producer() throws MQClientException {
        DefaultLitePullProducer producer = new DefaultLitePullProducer("user-service");
        
        // 多 NameServer 配置
        producer.setNamesrvAddr(
            "aws-nameserver:9876;aliyun-nameserver:9876"
        );
        
        // 重试配置
        producer.setRetryTimesWhenSendFailed(3);
        producer.setRetryTimesWhenSendAsyncFailed(3);
        
        producer.start();
        return producer;
    }
    
    @Bean
    public Consumer consumer() throws MQClientException {
        DefaultLitePullConsumer consumer = new DefaultLitePullConsumer("user-group");
        
        consumer.setNamesrvAddr(
            "aws-nameserver:9876;aliyun-nameserver:9876"
        );
        
        consumer.subscribe("user-topic", "*");
        consumer.start();
        
        return consumer;
    }
}

3. 分布式事务

// Seata 多活配置
@Configuration
public class SeataConfig {
    
    @Bean
    public GlobalTransactionScanner globalTransactionScanner() {
        // TM 配置
        Configuration configuration = ConfigurationFactory.getInstance();
        
        // 多 TC 配置
        String tcCluster = "seata-tc:aws-tc:8091,aliyun-tc:8091";
        configuration.setConfig("service.vgroupMapping.user-tx-group", tcCluster);
        
        return new GlobalTransactionScanner("user-service", "user-tx-group");
    }
}

五、灾难恢复

1. RTO/RPO 定义

指标	定义	目标值
RTO	恢复时间目标	< 30 分钟
RPO	恢复点目标	< 5 分钟

2. 故障切换流程

sequenceDiagram
    participant M as 监控系统
    participant D as DNS
    participant A as AWS 集群
    participant L as 阿里云集群
    
    A->>M: 健康检查失败
    M->>D: 触发 DNS 切换
    D->>D: 更新解析记录
    D->>L: 流量切换到阿里云
    L->>L: 提升为主库
    L->>M: 切换完成告警

3. 切换脚本

#!/bin/bash
# failover.sh - 灾难恢复切换脚本

set -e

CLOUD_PROVIDER=$1  # aws 或 aliyun
ACTION=$2          # switch 或 rollback

if [ "$ACTION" == "switch" ]; then
    echo "开始故障切换..."
    
    # 1. 更新 DNS
    if [ "$CLOUD_PROVIDER" == "aliyun" ]; then
        # 切换到阿里云
        aws route53 change-resource-record-sets \
          --hosted-zone-id $HOSTED_ZONE_ID \
          --change-batch file://dns-switch-aliyun.json
        
        # 2. 提升数据库
        mysql -h aliyun-mysql -e "STOP SLAVE; RESET SLAVE ALL;"
        
        # 3. 更新配置中心
        curl -X POST http://apollo/admin/config/switch \
          -d '{"activeCloud":"aliyun"}'
    
    else
        # 切换到 AWS
        # ... 类似操作
    fi
    
    # 4. 发送告警
    curl -X POST $ALERT_WEBHOOK \
      -d "{\"text\":\"已切换到$CLOUD_PROVIDER\"}"
    
    echo "切换完成"

elif [ "$ACTION" == "rollback" ]; then
    echo "开始回滚..."
    # 回滚逻辑
fi

六、成本优化

1. 多云成本对比

资源类型	AWS	阿里云	优化策略
EC2/ECS	$0.10/小时	$0.08/小时	阿里云为主
RDS	$0.20/小时	$0.15/小时	阿里云为主
流量	$0.09/GB	$0.07/GB	就近访问
存储	$0.023/GB	$0.02/GB	热数据 AWS，冷数据阿里云

2. 弹性伸缩策略

# HPA 多云配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  
  # 多云伸缩策略
  behavior:
    scaleUp:
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60

七、监控与告警

1. 统一监控平台

# Prometheus 联邦集群
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: global-prometheus
spec:
  version: v2.40.0
  replicas: 2
  
  # 联邦抓取
  remoteWrite:
    - url: "http://aws-prometheus:9090/api/v1/write"
    - url: "http://aliyun-prometheus:9090/api/v1/write"
  
  # 全局视图
  externalLabels:
    cluster: global
    environment: production

2. 告警规则

# 多云告警规则
groups:
  - name: multi-cloud-alerts
    rules:
      - alert: CloudProviderDown
        expr: up{job="aws-app"} == 0 or up{job="aliyun-app"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "云服务商故障"
          description: "{{ $labels.job }} 已宕机 5 分钟"
      
      - alert: CrossCloudLatencyHigh
        expr: cross_cloud_latency > 200ms
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "跨云延迟过高"
          description: "跨云延迟 {{ $value }}ms"
      
      - alert: DataSyncLag
        expr: data_sync_lag_seconds > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "数据同步延迟"
          description: "同步延迟 {{ $value }} 秒"

八、总结

多云架构收益

✅ 高可用：避免单点故障
✅ 风险分散：降低云服务商风险
✅ 成本优化：利用各云商优势
✅ 合规：满足数据本地化要求

实施建议

从主动 - 被动开始
- 先建立灾备能力
- 逐步过渡到主动 - 主动
统一技术标准
- Kubernetes 容器化
- 统一中间件版本
自动化运维
- IaC（Terraform）
- CI/CD 跨云部署
定期演练
- 故障切换演练
- 数据恢复测试

参考资料：

AWS 多区域架构
阿里云混合云方案
《云原生架构设计》
CNCF 多云白皮书