⚡
Milan.dev
>Home>Projects>Experience>Blog
GitHubLinkedIn
status: building
>Home>Projects>Experience>Blog
status: building

Connect

Let's collaborate on infrastructure challenges

Open to discussing DevOps strategies, cloud architecture optimization, security implementations, and interesting infrastructure problems.

send a message→

Find me elsewhere

GitHub
@milandangol
LinkedIn
/in/milan-dangol
Email
milandangol57@gmail.com
Forged with& code

© 2026 Milan Dangol — All systems reserved

back to blog
cloudfeatured

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Learn how to architect highly available, multi-region AWS infrastructure using Terraform, Transit Gateway, Network Load Balancers, and intelligent routing strategies for enterprise-grade applications.

M

Milan Dangol

Sr DevOps & DevSecOps Engineer

Jul 6, 2025
9 min read

Introduction

When managing enterprise-grade AWS infrastructure across multiple regions with dozens of clients connecting via VPNs, Direct Connect, and Transit Gateway attachments, visibility becomes critical. You need to answer questions like:

  • Who are the top talkers on each client's connection?
  • Is traffic being rejected anywhere?
  • What's the latency between client sites and AWS?
  • How much bandwidth is each client consuming?

This post walks through how I built automated, per-client CloudWatch dashboards using Terraform that query VPC Flow Logs to provide real-time network observability. The approach scales to hundreds of clients without manual dashboard creation.

Architecture Overview

flowchart TB A["VPC Flow Logs"] --> B["CloudWatch Logs"] B --> C["Insights Queries"] C --> D["CloudWatch Dashboards"] E["Terraform State"] --> F["Dynamic Dashboard Generation"] F --> D

1. VPC Flow Logs Setup

First, ensure VPC Flow Logs are enabled on all relevant VPCs and Transit Gateway attachments:

resource "aws_flow_log" "vpc_tgw" {
  iam_role_arn    = aws_iam_role.flow_logs.arn
  log_destination = aws_cloudwatch_log_group.tgw.arn
  traffic_type    = "ALL"
  
  transit_gateway_id = aws_ec2_transit_gateway.main.id

  tags = {
    Name = "tgw-flow-logs"
  }
}

resource "aws_cloudwatch_log_group" "tgw" {
  name              = "/aws/tgw/flow-logs"
  retention_in_days = 30

  tags = {
    Name = "tgw-flow-logs-group"
  }
}

2. Multi-Region Network Architecture

flowchart TB subgraph Primary["Primary Region - us-east-1"] VPC1["Production VPC"] TGW1["Transit Gateway"] NLB1["Network Load Balancer"] EKS1["EKS Cluster"] RDS1["RDS Primary"] end subgraph Secondary["Secondary Region - us-west-2"] VPC2["DR VPC"] TGW2["Transit Gateway"] NLB2["Network Load Balancer"] EKS2["EKS Cluster"] RDS2["RDS Read Replica"] end subgraph Global["Global Services"] R53["Route 53"] CF["CloudFront"] GAcc["Global Accelerator"] end R53 --> NLB1 R53 --> NLB2 CF --> NLB1 GAcc --> NLB1 GAcc --> NLB2 TGW1 <--> TGW2 VPC1 --> TGW1 VPC2 --> TGW2 NLB1 --> EKS1 NLB2 --> EKS2 EKS1 --> RDS1 EKS2 --> RDS2 RDS1 -.-> RDS2

3. Transit Gateway Peering

Cross-region Transit Gateway peering enables private connectivity between regions without traversing the public internet:

# Primary region Transit Gateway
resource "aws_ec2_transit_gateway" "primary" {
  provider = aws.primary
  
  description                     = "Primary region TGW"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "tgw-primary-us-east-1"
  }
}

# Secondary region Transit Gateway
resource "aws_ec2_transit_gateway" "secondary" {
  provider = aws.secondary
  
  description                     = "Secondary region TGW"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "tgw-secondary-us-west-2"
  }
}

# Peering attachment (initiated from primary)
resource "aws_ec2_transit_gateway_peering_attachment" "cross_region" {
  provider = aws.primary
  
  transit_gateway_id      = aws_ec2_transit_gateway.primary.id
  peer_transit_gateway_id = aws_ec2_transit_gateway.secondary.id
  peer_region             = "us-west-2"
  
  tags = {
    Name = "tgw-peering-east-west"
  }
}

# Accept peering in secondary region
resource "aws_ec2_transit_gateway_peering_attachment_accepter" "secondary" {
  provider = aws.secondary
  
  transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.cross_region.id
  
  tags = {
    Name = "tgw-peering-accept"
  }
}

4. Route Tables and Propagation

# Primary region route table
resource "aws_ec2_transit_gateway_route_table" "primary" {
  provider           = aws.primary
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  
  tags = {
    Name = "tgw-rt-primary"
  }
}

# Route to secondary region via peering
resource "aws_ec2_transit_gateway_route" "to_secondary" {
  provider                       = aws.primary
  destination_cidr_block         = "10.1.0.0/16"  # Secondary VPC CIDR
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_peering_attachment.cross_region.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}

# VPC attachment to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "primary_vpc" {
  provider           = aws.primary
  subnet_ids         = module.vpc_primary.private_subnet_ids
  transit_gateway_id = aws_ec2_transit_gateway.primary.id
  vpc_id             = module.vpc_primary.vpc_id
  
  transit_gateway_default_route_table_association = false
  transit_gateway_default_route_table_propagation = false
  
  tags = {
    Name = "tgw-attach-primary-vpc"
  }
}

# Associate VPC attachment with route table
resource "aws_ec2_transit_gateway_route_table_association" "primary" {
  provider                       = aws.primary
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.primary_vpc.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}

5. Network Load Balancer Configuration

NLBs provide ultra-low latency Layer 4 load balancing with static IPs:

resource "aws_lb" "primary" {
  provider           = aws.primary
  name               = "nlb-primary"
  internal           = false
  load_balancer_type = "network"
  
  enable_cross_zone_load_balancing = true
  enable_deletion_protection       = true
  
  dynamic "subnet_mapping" {
    for_each = module.vpc_primary.public_subnet_ids
    content {
      subnet_id     = subnet_mapping.value
      allocation_id = aws_eip.nlb[subnet_mapping.key].id
    }
  }
  
  tags = {
    Name        = "nlb-primary"
    Environment = "production"
  }
}

# Elastic IPs for static addressing
resource "aws_eip" "nlb" {
  provider = aws.primary
  count    = length(module.vpc_primary.public_subnet_ids)
  domain   = "vpc"
  
  tags = {
    Name = "eip-nlb-${count.index}"
  }
}

# Target group for EKS nodes
resource "aws_lb_target_group" "eks" {
  provider    = aws.primary
  name        = "tg-eks-primary"
  port        = 443
  protocol    = "TCP"
  vpc_id      = module.vpc_primary.vpc_id
  target_type = "ip"
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
    port                = "traffic-port"
    protocol            = "TCP"
  }
  
  tags = {
    Name = "tg-eks-primary"
  }
}

# Listener
resource "aws_lb_listener" "https" {
  provider          = aws.primary
  load_balancer_arn = aws_lb.primary.arn
  port              = 443
  protocol          = "TCP"
  
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.eks.arn
  }
}

6. Route 53 Health Checks and Failover

flowchart LR subgraph DNS["Route 53"] HC1["Health Check Primary"] HC2["Health Check Secondary"] RR["Failover Record Set"] end subgraph Routing["Traffic Flow"] User["User Request"] Primary["Primary NLB"] Secondary["Secondary NLB"] end User --> RR RR --> HC1 RR --> HC2 HC1 -->|Healthy| Primary HC2 -->|Standby| Secondary HC1 -->|Unhealthy| Secondary
# Health check for primary region
resource "aws_route53_health_check" "primary" {
  fqdn              = aws_lb.primary.dns_name
  port              = 443
  type              = "TCP"
  request_interval  = 10
  failure_threshold = 2
  
  tags = {
    Name = "hc-primary-nlb"
  }
}

# Health check for secondary region
resource "aws_route53_health_check" "secondary" {
  fqdn              = aws_lb.secondary.dns_name
  port              = 443
  type              = "TCP"
  request_interval  = 10
  failure_threshold = 2
  
  tags = {
    Name = "hc-secondary-nlb"
  }
}

# Primary failover record
resource "aws_route53_record" "primary" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id
  
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

# Secondary failover record
resource "aws_route53_record" "secondary" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier  = "secondary"
  health_check_id = aws_route53_health_check.secondary.id
  
  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
}

7. Global Accelerator for Performance

AWS Global Accelerator provides static anycast IPs and routes traffic through the AWS backbone:

resource "aws_globalaccelerator_accelerator" "main" {
  name            = "global-accelerator-prod"
  ip_address_type = "IPV4"
  enabled         = true
  
  attributes {
    flow_logs_enabled   = true
    flow_logs_s3_bucket = aws_s3_bucket.accelerator_logs.id
    flow_logs_s3_prefix = "flow-logs/"
  }
  
  tags = {
    Name = "ga-prod"
  }
}

resource "aws_globalaccelerator_listener" "https" {
  accelerator_arn = aws_globalaccelerator_accelerator.main.id
  protocol        = "TCP"
  
  port_range {
    from_port = 443
    to_port   = 443
  }
}

resource "aws_globalaccelerator_endpoint_group" "primary" {
  listener_arn                  = aws_globalaccelerator_listener.https.id
  endpoint_group_region         = "us-east-1"
  health_check_interval_seconds = 10
  health_check_path             = "/health"
  health_check_port             = 443
  health_check_protocol         = "TCP"
  threshold_count               = 2
  traffic_dial_percentage       = 100
  
  endpoint_configuration {
    endpoint_id                    = aws_lb.primary.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

resource "aws_globalaccelerator_endpoint_group" "secondary" {
  listener_arn                  = aws_globalaccelerator_listener.https.id
  endpoint_group_region         = "us-west-2"
  health_check_interval_seconds = 10
  health_check_path             = "/health"
  health_check_port             = 443
  health_check_protocol         = "TCP"
  threshold_count               = 2
  traffic_dial_percentage       = 0  # Standby
  
  endpoint_configuration {
    endpoint_id                    = aws_lb.secondary.arn
    weight                         = 100
    client_ip_preservation_enabled = true
  }
}

8. RDS Cross-Region Replication

# Primary RDS instance
resource "aws_db_instance" "primary" {
  provider = aws.primary
  
  identifier     = "db-primary"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.xlarge"
  
  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"
  storage_encrypted     = true
  kms_key_id            = aws_kms_key.rds_primary.arn
  
  db_name  = "appdb"
  username = "admin"
  password = random_password.db_password.result
  
  multi_az               = true
  db_subnet_group_name   = aws_db_subnet_group.primary.name
  vpc_security_group_ids = [aws_security_group.rds_primary.id]
  
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "Mon:04:00-Mon:05:00"
  
  performance_insights_enabled    = true
  monitoring_interval             = 60
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
  
  deletion_protection = true
  skip_final_snapshot = false
  
  tags = {
    Name = "rds-primary"
  }
}

# Cross-region read replica
resource "aws_db_instance" "replica" {
  provider = aws.secondary
  
  identifier          = "db-replica"
  replicate_source_db = aws_db_instance.primary.arn
  instance_class      = "db.r6g.xlarge"
  
  storage_encrypted = true
  kms_key_id        = aws_kms_key.rds_secondary.arn
  
  vpc_security_group_ids = [aws_security_group.rds_secondary.id]
  db_subnet_group_name   = aws_db_subnet_group.secondary.name
  
  performance_insights_enabled = true
  monitoring_interval          = 60
  
  tags = {
    Name = "rds-replica"
  }
}

9. Disaster Recovery Runbook

flowchart TD subgraph Detection["Failure Detection"] A["Health Check Fails"] --> B{"Automated or Manual?"} B -->|Automated| C["Route 53 Failover"] B -->|Manual| D["Operator Decision"] end subgraph Failover["Failover Process"] C --> E["Traffic Routes to Secondary"] D --> F["Promote RDS Replica"] F --> G["Update DNS TTL"] G --> E end subgraph Recovery["Recovery"] E --> H["Monitor Secondary"] H --> I{"Primary Recovered?"} I -->|Yes| J["Plan Failback"] I -->|No| K["Continue on Secondary"] J --> L["Sync Data"] L --> M["Failback to Primary"] end

Failover Steps

  1. Verify primary region failure via CloudWatch alarms
  2. Route 53 automatically fails over DNS if health checks fail
  3. For database failover, promote the read replica:
# Promote RDS replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier db-replica \
  --region us-west-2

# Update application connection strings
kubectl set env deployment/api \
  DATABASE_HOST=db-replica.xxxx.us-west-2.rds.amazonaws.com

Failback Steps

  1. Ensure primary region is stable
  2. Create new replica from secondary (now primary)
  3. Wait for replication lag to reach zero
  4. Perform controlled failback during maintenance window

10. Monitoring and Alerting

# CloudWatch alarm for cross-region latency
resource "aws_cloudwatch_metric_alarm" "cross_region_latency" {
  alarm_name          = "cross-region-latency-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "Latency"
  namespace           = "AWS/GlobalAccelerator"
  period              = 60
  statistic           = "Average"
  threshold           = 100
  alarm_description   = "Cross-region latency exceeds 100ms"
  
  dimensions = {
    Accelerator = aws_globalaccelerator_accelerator.main.id
  }
  
  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Alarm for RDS replication lag
resource "aws_cloudwatch_metric_alarm" "rds_replication_lag" {
  provider = aws.secondary
  
  alarm_name          = "rds-replication-lag-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ReplicaLag"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Average"
  threshold           = 60
  alarm_description   = "RDS replication lag exceeds 60 seconds"
  
  dimensions = {
    DBInstanceIdentifier = aws_db_instance.replica.id
  }
  
  alarm_actions = [aws_sns_topic.alerts_secondary.arn]
}

# Dashboard for multi-region overview
resource "aws_cloudwatch_dashboard" "multi_region" {
  dashboard_name = "multi-region-overview"
  
  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 12
        height = 6
        properties = {
          title  = "NLB Healthy Hosts"
          region = "us-east-1"
          metrics = [
            ["AWS/NetworkELB", "HealthyHostCount", "TargetGroup", aws_lb_target_group.eks.arn_suffix, "LoadBalancer", aws_lb.primary.arn_suffix]
          ]
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 0
        width  = 12
        height = 6
        properties = {
          title  = "RDS Replication Lag"
          region = "us-west-2"
          metrics = [
            ["AWS/RDS", "ReplicaLag", "DBInstanceIdentifier", "db-replica"]
          ]
        }
      }
    ]
  })
}

Best Practices

Area Recommendation
Network Use Transit Gateway peering instead of VPC peering for scalability
DNS Set low TTLs (60s) on failover records for faster propagation
Database Enable Multi-AZ in both regions for local HA
Monitoring Create cross-region dashboards in a central account
Testing Run quarterly DR drills to validate runbooks
Cost Use reserved capacity for predictable baseline, on-demand for DR

Conclusion

Multi-region architecture on AWS requires careful orchestration of networking, DNS, databases, and monitoring. The key components covered here provide:

  • Private cross-region connectivity via Transit Gateway peering
  • Automatic DNS failover with Route 53 health checks
  • Global traffic optimization with Global Accelerator
  • Database resilience with cross-region read replicas
  • Comprehensive monitoring and alerting

The Terraform modules shown can be adapted to your specific requirements. Start with a single secondary region and expand as your resilience requirements grow.

Share this article

Tags

#aws#terraform#multi-region#transit-gateway#high-availability#disaster-recovery

Related Articles

system-design13 min read

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

Building a payment processing system handling millions of daily transactions - featuring EventBridge for event-driven orchestration, Lambda for serverless processing, DynamoDB for transaction state, idempotency guarantees, and real-time fraud detection with Kinesis.

system-design12 min read

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

Designing a multi-channel AI chatbot system handling 5M+ conversations monthly - featuring AWS Bedrock for conversational AI, SQS for message queuing, DynamoDB for conversation state, and Lambda for serverless processing across WhatsApp and Facebook Messenger.

cloud11 min read

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance

Architecting a FinOps framework that reduced cloud costs by 30% and delivered predictable spend - featuring Cost Intelligence Dashboard, automated anomaly detection, chargeback mechanisms, and executive-level cost visibility.