引言
企业上云后,单一云服务商存在风险:
- 云服务商故障导致业务中断
- 厂商锁定,议价能力弱
- 合规要求数据本地化
- 成本优化需求
多云架构通过在多个云服务商部署应用,实现高可用、风险分散和成本优化。
一、多云架构模式
架构分类
graph TB
subgraph ActiveActive["主动 - 主动模式"]
A1[云 A 集群] <--> LB1[全局负载均衡]
A2[云 B 集群] <--> LB1
LB1 --> Users[用户]
A1 <-->|数据同步 | A2
end
subgraph ActivePassive["主动 - 被动模式"]
P1[主云集群] --> P2[备云集群]
P1 --> LB2[负载均衡]
P2 -.->|热备 | LB2
LB2 --> Users2[用户]
end
模式对比
| 模式 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| 主动 - 主动 | 资源利用率高、低延迟 | 数据一致性复杂 | 全球业务 |
| 主动 - 被动 | 架构简单、成本低 | 资源闲置 | 灾难恢复 |
| 分区域部署 | 合规、低延迟 | 运维复杂 | 跨国业务 |
二、DNS 流量调度
1. Route53 配置(AWS)
# 创建健康检查
aws route53 create-health-check \
--caller-reference "app-health-check" \
--health-check-config \
Type=HTTPS,\
FullyQualifiedDomainName=app.example.com,\
Port=443,\
ResourcePath=/health,\
RequestInterval=30,\
FailureThreshold=3
# 创建延迟路由策略
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "aws-cn",
"Weight": 70,
"HealthCheckId": "health-check-id",
"ResourceRecords": [{"Value": "1.2.3.4"}]
}
}, {
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "aliyun-cn",
"Weight": 30,
"HealthCheckId": "health-check-id-2",
"ResourceRecords": [{"Value": "5.6.7.8"}]
}
}]
}'
2. 阿里云 DNS 配置
# DNS 解析配置
dns:
- domain: app.example.com
records:
- type: A
line: 默认
value: 1.2.3.4 # AWS
weight: 70
status: enable
- type: A
line: 默认
value: 5.6.7.8 # 阿里云
weight: 30
status: enable
# 健康检查
health_check:
interval: 60s
timeout: 10s
threshold: 3
notify: true
3. 全局负载均衡器
# nginx upstream 配置
upstream backend {
least_conn;
# AWS 集群
server aws-app.example.com weight=7 max_fails=3 fail_timeout=30s;
# 阿里云集群
server aliyun-app.example.com weight=3 max_fails=3 fail_timeout=30s;
# 健康检查
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "HEAD /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}
server {
listen 80;
server_name app.example.com;
location / {
proxy_pass http://backend;
proxy_connect_timeout 10s;
proxy_read_timeout 30s;
}
}
三、数据同步方案
1. 数据库主从同步
-- AWS MySQL(主库)
-- my.cnf 配置
[mysqld]
server-id = 1
log-bin = mysql-bin
binlog-format = ROW
gtid-mode = ON
enforce-gtid-consistency = ON
-- 创建复制用户
CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
-- 阿里云 MySQL(从库)
-- my.cnf 配置
[mysqld]
server-id = 2
read-only = ON
relay-log = relay-bin
# 配置主从复制
CHANGE MASTER TO
MASTER_HOST='aws-mysql.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_AUTO_POSITION=1;
START SLAVE;
-- 检查同步状态
SHOW SLAVE STATUS\G
2. Redis 跨云同步
# 使用 RedisShake 同步
# redis-shake.conf
source:
address: "aws-redis:6379"
password: "aws-password"
type: "redis"
target:
address: "aliyun-redis:6379"
password: "aliyun-password"
type: "redis"
advanced:
rdb_input_channel: 10
rewrite: true
3. 对象存储同步
# 使用 rclone 同步 S3 和 OSS
# .rclone.conf
[aws-s3]
type = s3
provider = AWS
access_key_id = AKIAIOSFODNN7EXAMPLE
secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
region = ap-northeast-1
[aliyun-oss]
type = s3
provider = Alibaba
access_key_id = LTAI5t...
access_key_secret = ...
endpoint = oss-ap-southeast-1.aliyuncs.com
# 同步命令
rclone sync aws-s3:my-bucket aliyun-oss:my-bucket \
--transfers=16 \
--checkers=32 \
--bwlimit=10M \
--progress
四、应用层多活
1. 配置中心多活
# Apollo 配置中心多活
app:
id: user-service
apollo:
meta: http://aws-apollo:8080,http://aliyun-apollo:8080
cluster: default
# 多活配置
loadBalance:
enable: true
strategy: weight
# 灾备
backup:
enable: true
path: /opt/data/apollo/backup
2. 消息队列多活
// RocketMQ 多活配置
@Configuration
public class RocketMQConfig {
@Bean
public Producer producer() throws MQClientException {
DefaultLitePullProducer producer = new DefaultLitePullProducer("user-service");
// 多 NameServer 配置
producer.setNamesrvAddr(
"aws-nameserver:9876;aliyun-nameserver:9876"
);
// 重试配置
producer.setRetryTimesWhenSendFailed(3);
producer.setRetryTimesWhenSendAsyncFailed(3);
producer.start();
return producer;
}
@Bean
public Consumer consumer() throws MQClientException {
DefaultLitePullConsumer consumer = new DefaultLitePullConsumer("user-group");
consumer.setNamesrvAddr(
"aws-nameserver:9876;aliyun-nameserver:9876"
);
consumer.subscribe("user-topic", "*");
consumer.start();
return consumer;
}
}
3. 分布式事务
// Seata 多活配置
@Configuration
public class SeataConfig {
@Bean
public GlobalTransactionScanner globalTransactionScanner() {
// TM 配置
Configuration configuration = ConfigurationFactory.getInstance();
// 多 TC 配置
String tcCluster = "seata-tc:aws-tc:8091,aliyun-tc:8091";
configuration.setConfig("service.vgroupMapping.user-tx-group", tcCluster);
return new GlobalTransactionScanner("user-service", "user-tx-group");
}
}
五、灾难恢复
1. RTO/RPO 定义
| 指标 | 定义 | 目标值 |
|---|---|---|
| RTO | 恢复时间目标 | < 30 分钟 |
| RPO | 恢复点目标 | < 5 分钟 |
2. 故障切换流程
sequenceDiagram
participant M as 监控系统
participant D as DNS
participant A as AWS 集群
participant L as 阿里云集群
A->>M: 健康检查失败
M->>D: 触发 DNS 切换
D->>D: 更新解析记录
D->>L: 流量切换到阿里云
L->>L: 提升为主库
L->>M: 切换完成告警
3. 切换脚本
#!/bin/bash
# failover.sh - 灾难恢复切换脚本
set -e
CLOUD_PROVIDER=$1 # aws 或 aliyun
ACTION=$2 # switch 或 rollback
if [ "$ACTION" == "switch" ]; then
echo "开始故障切换..."
# 1. 更新 DNS
if [ "$CLOUD_PROVIDER" == "aliyun" ]; then
# 切换到阿里云
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch file://dns-switch-aliyun.json
# 2. 提升数据库
mysql -h aliyun-mysql -e "STOP SLAVE; RESET SLAVE ALL;"
# 3. 更新配置中心
curl -X POST http://apollo/admin/config/switch \
-d '{"activeCloud":"aliyun"}'
else
# 切换到 AWS
# ... 类似操作
fi
# 4. 发送告警
curl -X POST $ALERT_WEBHOOK \
-d "{\"text\":\"已切换到$CLOUD_PROVIDER\"}"
echo "切换完成"
elif [ "$ACTION" == "rollback" ]; then
echo "开始回滚..."
# 回滚逻辑
fi
六、成本优化
1. 多云成本对比
| 资源类型 | AWS | 阿里云 | 优化策略 |
|---|---|---|---|
| EC2/ECS | $0.10/小时 | $0.08/小时 | 阿里云为主 |
| RDS | $0.20/小时 | $0.15/小时 | 阿里云为主 |
| 流量 | $0.09/GB | $0.07/GB | 就近访问 |
| 存储 | $0.023/GB | $0.02/GB | 热数据 AWS,冷数据阿里云 |
2. 弹性伸缩策略
# HPA 多云配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: user-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# 多云伸缩策略
behavior:
scaleUp:
policies:
- type: Percent
value: 100
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
七、监控与告警
1. 统一监控平台
# Prometheus 联邦集群
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: global-prometheus
spec:
version: v2.40.0
replicas: 2
# 联邦抓取
remoteWrite:
- url: "http://aws-prometheus:9090/api/v1/write"
- url: "http://aliyun-prometheus:9090/api/v1/write"
# 全局视图
externalLabels:
cluster: global
environment: production
2. 告警规则
# 多云告警规则
groups:
- name: multi-cloud-alerts
rules:
- alert: CloudProviderDown
expr: up{job="aws-app"} == 0 or up{job="aliyun-app"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "云服务商故障"
description: "{{ $labels.job }} 已宕机 5 分钟"
- alert: CrossCloudLatencyHigh
expr: cross_cloud_latency > 200ms
for: 10m
labels:
severity: warning
annotations:
summary: "跨云延迟过高"
description: "跨云延迟 {{ $value }}ms"
- alert: DataSyncLag
expr: data_sync_lag_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "数据同步延迟"
description: "同步延迟 {{ $value }} 秒"
八、总结
多云架构收益
- ✅ 高可用:避免单点故障
- ✅ 风险分散:降低云服务商风险
- ✅ 成本优化:利用各云商优势
- ✅ 合规:满足数据本地化要求
实施建议
-
从主动 - 被动开始
- 先建立灾备能力
- 逐步过渡到主动 - 主动
-
统一技术标准
- Kubernetes 容器化
- 统一中间件版本
-
自动化运维
- IaC(Terraform)
- CI/CD 跨云部署
-
定期演练
- 故障切换演练
- 数据恢复测试
参考资料:
- AWS 多区域架构
- 阿里云混合云方案
- 《云原生架构设计》
- CNCF 多云白皮书