status: building

cloudfeatured

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Learn how to architect highly available, multi-region AWS infrastructure using Terraform, Transit Gateway, Network Load Balancers, and intelligent routing strategies for enterprise-grade applications.

Milan Dangol

Sr DevOps & DevSecOps Engineer

Jul 6, 2025

9 min read

Introduction

When managing enterprise-grade AWS infrastructure across multiple regions with dozens of clients connecting via VPNs, Direct Connect, and Transit Gateway attachments, visibility becomes critical. You need to answer questions like:

Who are the top talkers on each client's connection?
Is traffic being rejected anywhere?
What's the latency between client sites and AWS?
How much bandwidth is each client consuming?

This post walks through how I built automated, per-client CloudWatch dashboards using Terraform that query VPC Flow Logs to provide real-time network observability. The approach scales to hundreds of clients without manual dashboard creation.

Architecture Overview

flowchart TB A["VPC Flow Logs"] --> B["CloudWatch Logs"] B --> C["Insights Queries"] C --> D["CloudWatch Dashboards"] E["Terraform State"] --> F["Dynamic Dashboard Generation"] F --> D

1. VPC Flow Logs Setup

First, ensure VPC Flow Logs are enabled on all relevant VPCs and Transit Gateway attachments:

resource "aws_flow_log" "vpc_tgw" {
  iam_role_arn    = aws_iam_role.flow_logs.arn
  log_destination = aws_cloudwatch_log_group.tgw.arn
  traffic_type    = "ALL"
  
  transit_gateway_id = aws_ec2_transit_gateway.main.id

  tags = {
    Name = "tgw-flow-logs"
  }
}

resource "aws_cloudwatch_log_group" "tgw" {
  name              = "/aws/tgw/flow-logs"
  retention_in_days = 30

  tags = {
    Name = "tgw-flow-logs-group"
  }
}

2. Multi-Region Network Architecture

flowchart TB subgraph Primary["Primary Region - us-east-1"] VPC1["Production VPC"] TGW1["Transit Gateway"] NLB1["Network Load Balancer"] EKS1["EKS Cluster"] RDS1["RDS Primary"] end subgraph Secondary["Secondary Region - us-west-2"] VPC2["DR VPC"] TGW2["Transit Gateway"] NLB2["Network Load Balancer"] EKS2["EKS Cluster"] RDS2["RDS Read Replica"] end subgraph Global["Global Services"] R53["Route 53"] CF["CloudFront"] GAcc["Global Accelerator"] end R53 --> NLB1 R53 --> NLB2 CF --> NLB1 GAcc --> NLB1 GAcc --> NLB2 TGW1 <--> TGW2 VPC1 --> TGW1 VPC2 --> TGW2 NLB1 --> EKS1 NLB2 --> EKS2 EKS1 --> RDS1 EKS2 --> RDS2 RDS1 -.-> RDS2

3. Transit Gateway Peering

Cross-region Transit Gateway peering enables private connectivity between regions without traversing the public internet:

# Primary region Transit Gateway
resource "aws_ec2_transit_gateway" "primary" {
  provider = aws.primary
  
  description                     = "Primary region TGW"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "tgw-primary-us-east-1"
  }
}

# Secondary region Transit Gateway
resource "aws_ec2_transit_gateway" "secondary" {
  provider = aws.secondary
  
  description                     = "Secondary region TGW"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "tgw-secondary-us-west-2"
  }
}

# Peering attachment (initiated from primary)
resource "aws_ec2_transit_gateway_peering_attachment" "cross_region" {
  provider = aws.primary
  
  transit_gateway_id      = aws_ec2_transit_gateway.primary.id
  peer_transit_gateway_id = aws_ec2_transit_gateway.secondary.id
  peer_region             = "us-west-2"
  
  tags = {
    Name = "tgw-peering-east-west"
  }
}

# Accept peering in secondary region
resource "aws_ec2_transit_gateway_peering_attachment_accepter" "secondary" {
  provider = aws.secondary
  
  transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.cross_region.id
  
  tags = {
    Name = "tgw-peering-accept"
  }
}

4. Route Tables and Propagation

# Primary region route table
resource "aws_ec2_transit_gateway_route_table" "primary" {
  provider           = aws.primary
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  
  tags = {
    Name = "tgw-rt-primary"
  }
}

# Route to secondary region via peering
resource "aws_ec2_transit_gateway_route" "to_secondary" {
  provider                       = aws.primary
  destination_cidr_block         = "10.1.0.0/16"  # Secondary VPC CIDR
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_peering_attachment.cross_region.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}

# VPC attachment to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "primary_vpc" {
  provider           = aws.primary
  subnet_ids         = module.vpc_primary.private_subnet_ids
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  vpc_id             = module.vpc_primary.vpc_id
  
  transit_gateway_default_route_table_association = false
  transit_gateway_default_route_table_propagation = false
  
  tags = {
    Name = "tgw-attach-primary-vpc"
  }
}

# Associate VPC attachment with route table
resource "aws_ec2_transit_gateway_route_table_association" "primary" {
  provider                       = aws.primary
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.primary_vpc.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}

5. Network Load Balancer Configuration

NLBs provide ultra-low latency Layer 4 load balancing with static IPs:

resource "aws_lb" "primary" {
  provider           = aws.primary
  name               = "nlb-primary"
  internal           = false
  load_balancer_type = "network"
  
  enable_cross_zone_load_balancing = true
  enable_deletion_protection       = true
  
  dynamic "subnet_mapping" {
    for_each = module.vpc_primary.public_subnet_ids
    content {
      subnet_id     = subnet_mapping.value
      allocation_id = aws_eip.nlb[subnet_mapping.key].id
    }
  }
  
  tags = {
    Name        = "nlb-primary"
    Environment = "production"
  }
}

# Elastic IPs for static addressing
resource "aws_eip" "nlb" {
  provider = aws.primary
  count    = length(module.vpc_primary.public_subnet_ids)
  domain   = "vpc"
  
  tags = {
    Name = "eip-nlb-${count.index}"
  }
}

# Target group for EKS nodes
resource "aws_lb_target_group" "eks" {
  provider    = aws.primary
  name        = "tg-eks-primary"
  port        = 443
  protocol    = "TCP"
  vpc_id      = module.vpc_primary.vpc_id
  target_type = "ip"
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = "traffic-port"
    protocol            = "TCP"
  }
  
  tags = {
    Name = "tg-eks-primary"
  }
}

# Listener
resource "aws_lb_listener" "https" {
  provider          = aws.primary
  load_balancer_arn = aws_lb.primary.arn
  port              = 443
  protocol          = "TCP"
  
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.eks.arn
  }
}

6. Route 53 Health Checks and Failover

flowchart LR subgraph DNS["Route 53"] HC1["Health Check Primary"] HC2["Health Check Secondary"] RR["Failover Record Set"] end subgraph Routing["Traffic Flow"] User["User Request"] Primary["Primary NLB"] Secondary["Secondary NLB"] end User --> RR RR --> HC1 RR --> HC2 HC1 -->|Healthy| Primary HC2 -->|Standby| Secondary HC1 -->|Unhealthy| Secondary

# Health check for primary region
resource "aws_route53_health_check" "primary" {
  fqdn              = aws_lb.primary.dns_name
  port              = 443
  type              = "TCP"
  request_interval  = 10
  failure_threshold = 2
  
  tags = {
    Name = "hc-primary-nlb"
  }
}

# Health check for secondary region
resource "aws_route53_health_check" "secondary" {
  fqdn              = aws_lb.secondary.dns_name
  port              = 443
  type              = "TCP"
  request_interval  = 10
  failure_threshold = 2
  
  tags = {
    Name = "hc-secondary-nlb"
  }
}

# Primary failover record
resource "aws_route53_record" "primary" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id
  
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

# Secondary failover record
resource "aws_route53_record" "secondary" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier  = "secondary"
  health_check_id = aws_route53_health_check.secondary.id
  
  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
}

7. Global Accelerator for Performance

AWS Global Accelerator provides static anycast IPs and routes traffic through the AWS backbone:

resource "aws_globalaccelerator_accelerator" "main" {
  name            = "global-accelerator-prod"
  ip_address_type = "IPV4"
  enabled         = true
  
  attributes {
    flow_logs_enabled   = true
    flow_logs_s3_bucket = aws_s3_bucket.accelerator_logs.id
    flow_logs_s3_prefix = "flow-logs/"
  }
  
  tags = {
    Name = "ga-prod"
  }
}

resource "aws_globalaccelerator_listener" "https" {
  accelerator_arn = aws_globalaccelerator_accelerator.main.id
  protocol        = "TCP"
  
  port_range {
    from_port = 443
    to_port   = 443
  }
}

resource "aws_globalaccelerator_endpoint_group" "primary" {
  listener_arn                  = aws_globalaccelerator_listener.https.id
  endpoint_group_region         = "us-east-1"
  health_check_interval_seconds = 10
  health_check_path             = "/health"
  health_check_port             = 443
  health_check_protocol         = "TCP"
  threshold_count               = 2
  traffic_dial_percentage       = 100
  
  endpoint_configuration {
    endpoint_id                    = aws_lb.primary.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

resource "aws_globalaccelerator_endpoint_group" "secondary" {
  listener_arn                  = aws_globalaccelerator_listener.https.id
  endpoint_group_region         = "us-west-2"
  health_check_interval_seconds = 10
  health_check_path             = "/health"
  health_check_port             = 443
  health_check_protocol         = "TCP"
  threshold_count               = 2
  traffic_dial_percentage       = 0  # Standby
  
  endpoint_configuration {
    endpoint_id                    = aws_lb.secondary.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

8. RDS Cross-Region Replication

# Primary RDS instance
resource "aws_db_instance" "primary" {
  provider = aws.primary
  
  identifier     = "db-primary"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.xlarge"
  
  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"
  storage_encrypted     = true
  kms_key_id            = aws_kms_key.rds_primary.arn
  
  db_name  = "appdb"
  username = "admin"
  password = random_password.db_password.result
  
  multi_az               = true
  db_subnet_group_name   = aws_db_subnet_group.primary.name
  vpc_security_group_ids = [aws_security_group.rds_primary.id]
  
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "Mon:04:00-Mon:05:00"
  
  performance_insights_enabled    = true
  monitoring_interval             = 60
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  
  deletion_protection = true
  skip_final_snapshot = false
  
  tags = {
    Name = "rds-primary"
  }
}

# Cross-region read replica
resource "aws_db_instance" "replica" {
  provider = aws.secondary
  
  identifier          = "db-replica"
  replicate_source_db = aws_db_instance.primary.arn
  instance_class      = "db.r6g.xlarge"
  
  storage_encrypted = true
  kms_key_id        = aws_kms_key.rds_secondary.arn
  
  vpc_security_group_ids = [aws_security_group.rds_secondary.id]
  db_subnet_group_name   = aws_db_subnet_group.secondary.name
  
  performance_insights_enabled = true
  monitoring_interval          = 60
  
  tags = {
    Name = "rds-replica"
  }
}

9. Disaster Recovery Runbook

flowchart TD subgraph Detection["Failure Detection"] A["Health Check Fails"] --> B{"Automated or Manual?"} B -->|Automated| C["Route 53 Failover"] B -->|Manual| D["Operator Decision"] end subgraph Failover["Failover Process"] C --> E["Traffic Routes to Secondary"] D --> F["Promote RDS Replica"] F --> G["Update DNS TTL"] G --> E end subgraph Recovery["Recovery"] E --> H["Monitor Secondary"] H --> I{"Primary Recovered?"} I -->|Yes| J["Plan Failback"] I -->|No| K["Continue on Secondary"] J --> L["Sync Data"] L --> M["Failback to Primary"] end

Failover Steps

Verify primary region failure via CloudWatch alarms
Route 53 automatically fails over DNS if health checks fail
For database failover, promote the read replica:

# Promote RDS replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier db-replica \
  --region us-west-2

# Update application connection strings
kubectl set env deployment/api \
  DATABASE_HOST=db-replica.xxxx.us-west-2.rds.amazonaws.com

Failback Steps

Ensure primary region is stable
Create new replica from secondary (now primary)
Wait for replication lag to reach zero
Perform controlled failback during maintenance window

10. Monitoring and Alerting

# CloudWatch alarm for cross-region latency
resource "aws_cloudwatch_metric_alarm" "cross_region_latency" {
  alarm_name          = "cross-region-latency-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "Latency"
  namespace           = "AWS/GlobalAccelerator"
  period              = 60
  statistic           = "Average"
  threshold           = 100
  alarm_description   = "Cross-region latency exceeds 100ms"
  
  dimensions = {
    Accelerator = aws_globalaccelerator_accelerator.main.id
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Alarm for RDS replication lag
resource "aws_cloudwatch_metric_alarm" "rds_replication_lag" {
  provider = aws.secondary
  
  alarm_name          = "rds-replication-lag-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ReplicaLag"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Average"
  threshold           = 60
  alarm_description   = "RDS replication lag exceeds 60 seconds"
  
  dimensions = {
    DBInstanceIdentifier = aws_db_instance.replica.id
  }
  
  alarm_actions = [aws_sns_topic.alerts_secondary.arn]
}

# Dashboard for multi-region overview
resource "aws_cloudwatch_dashboard" "multi_region" {
  dashboard_name = "multi-region-overview"
  
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 12
        height = 6
        properties = {
          title  = "NLB Healthy Hosts"
          region = "us-east-1"
          metrics = [
            ["AWS/NetworkELB", "HealthyHostCount", "TargetGroup", aws_lb_target_group.eks.arn_suffix, "LoadBalancer", aws_lb.primary.arn_suffix]
          ]
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 0
        width  = 12
        height = 6
        properties = {
          title  = "RDS Replication Lag"
          region = "us-west-2"
          metrics = [
            ["AWS/RDS", "ReplicaLag", "DBInstanceIdentifier", "db-replica"]
          ]
        }
      }
    ]
  })
}

Best Practices

Area	Recommendation
Network	Use Transit Gateway peering instead of VPC peering for scalability
DNS	Set low TTLs (60s) on failover records for faster propagation
Database	Enable Multi-AZ in both regions for local HA
Monitoring	Create cross-region dashboards in a central account
Testing	Run quarterly DR drills to validate runbooks
Cost	Use reserved capacity for predictable baseline, on-demand for DR

Conclusion

Multi-region architecture on AWS requires careful orchestration of networking, DNS, databases, and monitoring. The key components covered here provide:

Private cross-region connectivity via Transit Gateway peering
Automatic DNS failover with Route 53 health checks
Global traffic optimization with Global Accelerator
Database resilience with cross-region read replicas
Comprehensive monitoring and alerting

The Terraform modules shown can be adapted to your specific requirements. Start with a single secondary region and expand as your resilience requirements grow.

system-design13 min read

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

Building a payment processing system handling millions of daily transactions - featuring EventBridge for event-driven orchestration, Lambda for serverless processing, DynamoDB for transaction state, idempotency guarantees, and real-time fraud detection with Kinesis.

system-design12 min read

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

Designing a multi-channel AI chatbot system handling 5M+ conversations monthly - featuring AWS Bedrock for conversational AI, SQS for message queuing, DynamoDB for conversation state, and Lambda for serverless processing across WhatsApp and Facebook Messenger.

cloud11 min read

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance

Architecting a FinOps framework that reduced cloud costs by 30% and delivered predictable spend - featuring Cost Intelligence Dashboard, automated anomaly detection, chargeback mechanisms, and executive-level cost visibility.

back to blog

cloudfeatured

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Milan Dangol

Sr DevOps & DevSecOps Engineer

Jul 6, 2025

9 min read

Introduction

Who are the top talkers on each client's connection?
Is traffic being rejected anywhere?
What's the latency between client sites and AWS?
How much bandwidth is each client consuming?

Architecture Overview

flowchart TB A["VPC Flow Logs"] --> B["CloudWatch Logs"] B --> C["Insights Queries"] C --> D["CloudWatch Dashboards"] E["Terraform State"] --> F["Dynamic Dashboard Generation"] F --> D

1. VPC Flow Logs Setup

First, ensure VPC Flow Logs are enabled on all relevant VPCs and Transit Gateway attachments:

resource "aws_flow_log" "vpc_tgw" {
  iam_role_arn    = aws_iam_role.flow_logs.arn
  log_destination = aws_cloudwatch_log_group.tgw.arn
  traffic_type    = "ALL"
  
  transit_gateway_id = aws_ec2_transit_gateway.main.id

  tags = {
    Name = "tgw-flow-logs"
  }
}

resource "aws_cloudwatch_log_group" "tgw" {
  name              = "/aws/tgw/flow-logs"
  retention_in_days = 30

  tags = {
    Name = "tgw-flow-logs-group"
  }
}

2. Multi-Region Network Architecture

3. Transit Gateway Peering

Cross-region Transit Gateway peering enables private connectivity between regions without traversing the public internet:

# Primary region Transit Gateway
resource "aws_ec2_transit_gateway" "primary" {
  provider = aws.primary
  
  description                     = "Primary region TGW"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "tgw-primary-us-east-1"
  }
}

# Secondary region Transit Gateway
resource "aws_ec2_transit_gateway" "secondary" {
  provider = aws.secondary
  
  description                     = "Secondary region TGW"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "tgw-secondary-us-west-2"
  }
}

# Peering attachment (initiated from primary)
resource "aws_ec2_transit_gateway_peering_attachment" "cross_region" {
  provider = aws.primary
  
  transit_gateway_id      = aws_ec2_transit_gateway.primary.id
  peer_transit_gateway_id = aws_ec2_transit_gateway.secondary.id
  peer_region             = "us-west-2"
  
  tags = {
    Name = "tgw-peering-east-west"
  }
}

# Accept peering in secondary region
resource "aws_ec2_transit_gateway_peering_attachment_accepter" "secondary" {
  provider = aws.secondary
  
  transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.cross_region.id
  
  tags = {
    Name = "tgw-peering-accept"
  }
}

4. Route Tables and Propagation

# Primary region route table
resource "aws_ec2_transit_gateway_route_table" "primary" {
  provider           = aws.primary
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  
  tags = {
    Name = "tgw-rt-primary"
  }
}

# Route to secondary region via peering
resource "aws_ec2_transit_gateway_route" "to_secondary" {
  provider                       = aws.primary
  destination_cidr_block         = "10.1.0.0/16"  # Secondary VPC CIDR
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_peering_attachment.cross_region.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}

# VPC attachment to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "primary_vpc" {
  provider           = aws.primary
  subnet_ids         = module.vpc_primary.private_subnet_ids
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  vpc_id             = module.vpc_primary.vpc_id
  
  transit_gateway_default_route_table_association = false
  transit_gateway_default_route_table_propagation = false
  
  tags = {
    Name = "tgw-attach-primary-vpc"
  }
}

# Associate VPC attachment with route table
resource "aws_ec2_transit_gateway_route_table_association" "primary" {
  provider                       = aws.primary
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.primary_vpc.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}

5. Network Load Balancer Configuration

NLBs provide ultra-low latency Layer 4 load balancing with static IPs:

resource "aws_lb" "primary" {
  provider           = aws.primary
  name               = "nlb-primary"
  internal           = false
  load_balancer_type = "network"
  
  enable_cross_zone_load_balancing = true
  enable_deletion_protection       = true
  
  dynamic "subnet_mapping" {
    for_each = module.vpc_primary.public_subnet_ids
    content {
      subnet_id     = subnet_mapping.value
      allocation_id = aws_eip.nlb[subnet_mapping.key].id
    }
  }
  
  tags = {
    Name        = "nlb-primary"
    Environment = "production"
  }
}

# Elastic IPs for static addressing
resource "aws_eip" "nlb" {
  provider = aws.primary
  count    = length(module.vpc_primary.public_subnet_ids)
  domain   = "vpc"
  
  tags = {
    Name = "eip-nlb-${count.index}"
  }
}

# Target group for EKS nodes
resource "aws_lb_target_group" "eks" {
  provider    = aws.primary
  name        = "tg-eks-primary"
  port        = 443
  protocol    = "TCP"
  vpc_id      = module.vpc_primary.vpc_id
  target_type = "ip"
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = "traffic-port"
    protocol            = "TCP"
  }
  
  tags = {
    Name = "tg-eks-primary"
  }
}

# Listener
resource "aws_lb_listener" "https" {
  provider          = aws.primary
  load_balancer_arn = aws_lb.primary.arn
  port              = 443
  protocol          = "TCP"
  
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.eks.arn
  }
}

6. Route 53 Health Checks and Failover

# Health check for primary region
resource "aws_route53_health_check" "primary" {
  fqdn              = aws_lb.primary.dns_name
  port              = 443
  type              = "TCP"
  request_interval  = 10
  failure_threshold = 2
  
  tags = {
    Name = "hc-primary-nlb"
  }
}

# Health check for secondary region
resource "aws_route53_health_check" "secondary" {
  fqdn              = aws_lb.secondary.dns_name
  port              = 443
  type              = "TCP"
  request_interval  = 10
  failure_threshold = 2
  
  tags = {
    Name = "hc-secondary-nlb"
  }
}

# Primary failover record
resource "aws_route53_record" "primary" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id
  
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

# Secondary failover record
resource "aws_route53_record" "secondary" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier  = "secondary"
  health_check_id = aws_route53_health_check.secondary.id
  
  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
}

7. Global Accelerator for Performance

AWS Global Accelerator provides static anycast IPs and routes traffic through the AWS backbone:

resource "aws_globalaccelerator_accelerator" "main" {
  name            = "global-accelerator-prod"
  ip_address_type = "IPV4"
  enabled         = true
  
  attributes {
    flow_logs_enabled   = true
    flow_logs_s3_bucket = aws_s3_bucket.accelerator_logs.id
    flow_logs_s3_prefix = "flow-logs/"
  }
  
  tags = {
    Name = "ga-prod"
  }
}

resource "aws_globalaccelerator_listener" "https" {
  accelerator_arn = aws_globalaccelerator_accelerator.main.id
  protocol        = "TCP"
  
  port_range {
    from_port = 443
    to_port   = 443
  }
}

resource "aws_globalaccelerator_endpoint_group" "primary" {
  listener_arn                  = aws_globalaccelerator_listener.https.id
  endpoint_group_region         = "us-east-1"
  health_check_interval_seconds = 10
  health_check_path             = "/health"
  health_check_port             = 443
  health_check_protocol         = "TCP"
  threshold_count               = 2
  traffic_dial_percentage       = 100
  
  endpoint_configuration {
    endpoint_id                    = aws_lb.primary.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

resource "aws_globalaccelerator_endpoint_group" "secondary" {
  listener_arn                  = aws_globalaccelerator_listener.https.id
  endpoint_group_region         = "us-west-2"
  health_check_interval_seconds = 10
  health_check_path             = "/health"
  health_check_port             = 443
  health_check_protocol         = "TCP"
  threshold_count               = 2
  traffic_dial_percentage       = 0  # Standby
  
  endpoint_configuration {
    endpoint_id                    = aws_lb.secondary.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

8. RDS Cross-Region Replication

# Primary RDS instance
resource "aws_db_instance" "primary" {
  provider = aws.primary
  
  identifier     = "db-primary"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.xlarge"
  
  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"
  storage_encrypted     = true
  kms_key_id            = aws_kms_key.rds_primary.arn
  
  db_name  = "appdb"
  username = "admin"
  password = random_password.db_password.result
  
  multi_az               = true
  db_subnet_group_name   = aws_db_subnet_group.primary.name
  vpc_security_group_ids = [aws_security_group.rds_primary.id]
  
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "Mon:04:00-Mon:05:00"
  
  performance_insights_enabled    = true
  monitoring_interval             = 60
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  
  deletion_protection = true
  skip_final_snapshot = false
  
  tags = {
    Name = "rds-primary"
  }
}

# Cross-region read replica
resource "aws_db_instance" "replica" {
  provider = aws.secondary
  
  identifier          = "db-replica"
  replicate_source_db = aws_db_instance.primary.arn
  instance_class      = "db.r6g.xlarge"
  
  storage_encrypted = true
  kms_key_id        = aws_kms_key.rds_secondary.arn
  
  vpc_security_group_ids = [aws_security_group.rds_secondary.id]
  db_subnet_group_name   = aws_db_subnet_group.secondary.name
  
  performance_insights_enabled = true
  monitoring_interval          = 60
  
  tags = {
    Name = "rds-replica"
  }
}

9. Disaster Recovery Runbook

Failover Steps

Verify primary region failure via CloudWatch alarms
Route 53 automatically fails over DNS if health checks fail
For database failover, promote the read replica:

# Promote RDS replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier db-replica \
  --region us-west-2

# Update application connection strings
kubectl set env deployment/api \
  DATABASE_HOST=db-replica.xxxx.us-west-2.rds.amazonaws.com

Failback Steps

Ensure primary region is stable
Create new replica from secondary (now primary)
Wait for replication lag to reach zero
Perform controlled failback during maintenance window

10. Monitoring and Alerting

# CloudWatch alarm for cross-region latency
resource "aws_cloudwatch_metric_alarm" "cross_region_latency" {
  alarm_name          = "cross-region-latency-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "Latency"
  namespace           = "AWS/GlobalAccelerator"
  period              = 60
  statistic           = "Average"
  threshold           = 100
  alarm_description   = "Cross-region latency exceeds 100ms"
  
  dimensions = {
    Accelerator = aws_globalaccelerator_accelerator.main.id
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Alarm for RDS replication lag
resource "aws_cloudwatch_metric_alarm" "rds_replication_lag" {
  provider = aws.secondary
  
  alarm_name          = "rds-replication-lag-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ReplicaLag"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Average"
  threshold           = 60
  alarm_description   = "RDS replication lag exceeds 60 seconds"
  
  dimensions = {
    DBInstanceIdentifier = aws_db_instance.replica.id
  }
  
  alarm_actions = [aws_sns_topic.alerts_secondary.arn]
}

# Dashboard for multi-region overview
resource "aws_cloudwatch_dashboard" "multi_region" {
  dashboard_name = "multi-region-overview"
  
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 12
        height = 6
        properties = {
          title  = "NLB Healthy Hosts"
          region = "us-east-1"
          metrics = [
            ["AWS/NetworkELB", "HealthyHostCount", "TargetGroup", aws_lb_target_group.eks.arn_suffix, "LoadBalancer", aws_lb.primary.arn_suffix]
          ]
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 0
        width  = 12
        height = 6
        properties = {
          title  = "RDS Replication Lag"
          region = "us-west-2"
          metrics = [
            ["AWS/RDS", "ReplicaLag", "DBInstanceIdentifier", "db-replica"]
          ]
        }
      }
    ]
  })
}

Best Practices

Area	Recommendation
Network	Use Transit Gateway peering instead of VPC peering for scalability
DNS	Set low TTLs (60s) on failover records for faster propagation
Database	Enable Multi-AZ in both regions for local HA
Monitoring	Create cross-region dashboards in a central account
Testing	Run quarterly DR drills to validate runbooks
Cost	Use reserved capacity for predictable baseline, on-demand for DR

Conclusion

Multi-region architecture on AWS requires careful orchestration of networking, DNS, databases, and monitoring. The key components covered here provide:

Private cross-region connectivity via Transit Gateway peering
Automatic DNS failover with Route 53 health checks
Global traffic optimization with Global Accelerator
Database resilience with cross-region read replicas
Comprehensive monitoring and alerting

The Terraform modules shown can be adapted to your specific requirements. Start with a single secondary region and expand as your resilience requirements grow.

system-design13 min read

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

system-design12 min read

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

cloud11 min read

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Introduction

Architecture Overview

1. VPC Flow Logs Setup

2. Multi-Region Network Architecture

3. Transit Gateway Peering

4. Route Tables and Propagation

5. Network Load Balancer Configuration

6. Route 53 Health Checks and Failover

7. Global Accelerator for Performance

8. RDS Cross-Region Replication

9. Disaster Recovery Runbook

Failover Steps

Failback Steps

10. Monitoring and Alerting

Best Practices

Conclusion

Related Articles

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Introduction

Architecture Overview

1. VPC Flow Logs Setup

2. Multi-Region Network Architecture

3. Transit Gateway Peering

4. Route Tables and Propagation

5. Network Load Balancer Configuration

6. Route 53 Health Checks and Failover

7. Global Accelerator for Performance

8. RDS Cross-Region Replication

9. Disaster Recovery Runbook

Failover Steps

Failback Steps

10. Monitoring and Alerting

Best Practices

Conclusion

Related Articles

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance