Introduction
When managing enterprise-grade AWS infrastructure across multiple regions with dozens of clients connecting via VPNs, Direct Connect, and Transit Gateway attachments, visibility becomes critical. You need to answer questions like:
- Who are the top talkers on each client's connection?
- Is traffic being rejected anywhere?
- What's the latency between client sites and AWS?
- How much bandwidth is each client consuming?
This post walks through how I built automated, per-client CloudWatch dashboards using Terraform that query VPC Flow Logs to provide real-time network observability. The approach scales to hundreds of clients without manual dashboard creation.
Architecture Overview
flowchart TB
A["VPC Flow Logs"] --> B["CloudWatch Logs"]
B --> C["Insights Queries"]
C --> D["CloudWatch Dashboards"]
E["Terraform State"] --> F["Dynamic Dashboard Generation"]
F --> D
1. VPC Flow Logs Setup
First, ensure VPC Flow Logs are enabled on all relevant VPCs and Transit Gateway attachments:
resource "aws_flow_log" "vpc_tgw" {
iam_role_arn = aws_iam_role.flow_logs.arn
log_destination = aws_cloudwatch_log_group.tgw.arn
traffic_type = "ALL"
transit_gateway_id = aws_ec2_transit_gateway.main.id
tags = {
Name = "tgw-flow-logs"
}
}
resource "aws_cloudwatch_log_group" "tgw" {
name = "/aws/tgw/flow-logs"
retention_in_days = 30
tags = {
Name = "tgw-flow-logs-group"
}
}2. Multi-Region Network Architecture
flowchart TB
subgraph Primary["Primary Region - us-east-1"]
VPC1["Production VPC"]
TGW1["Transit Gateway"]
NLB1["Network Load Balancer"]
EKS1["EKS Cluster"]
RDS1["RDS Primary"]
end
subgraph Secondary["Secondary Region - us-west-2"]
VPC2["DR VPC"]
TGW2["Transit Gateway"]
NLB2["Network Load Balancer"]
EKS2["EKS Cluster"]
RDS2["RDS Read Replica"]
end
subgraph Global["Global Services"]
R53["Route 53"]
CF["CloudFront"]
GAcc["Global Accelerator"]
end
R53 --> NLB1
R53 --> NLB2
CF --> NLB1
GAcc --> NLB1
GAcc --> NLB2
TGW1 <--> TGW2
VPC1 --> TGW1
VPC2 --> TGW2
NLB1 --> EKS1
NLB2 --> EKS2
EKS1 --> RDS1
EKS2 --> RDS2
RDS1 -.-> RDS2
3. Transit Gateway Peering
Cross-region Transit Gateway peering enables private connectivity between regions without traversing the public internet:
# Primary region Transit Gateway
resource "aws_ec2_transit_gateway" "primary" {
provider = aws.primary
description = "Primary region TGW"
default_route_table_association = "disable"
default_route_table_propagation = "disable"
tags = {
Name = "tgw-primary-us-east-1"
}
}
# Secondary region Transit Gateway
resource "aws_ec2_transit_gateway" "secondary" {
provider = aws.secondary
description = "Secondary region TGW"
default_route_table_association = "disable"
default_route_table_propagation = "disable"
tags = {
Name = "tgw-secondary-us-west-2"
}
}
# Peering attachment (initiated from primary)
resource "aws_ec2_transit_gateway_peering_attachment" "cross_region" {
provider = aws.primary
transit_gateway_id = aws_ec2_transit_gateway.primary.id
peer_transit_gateway_id = aws_ec2_transit_gateway.secondary.id
peer_region = "us-west-2"
tags = {
Name = "tgw-peering-east-west"
}
}
# Accept peering in secondary region
resource "aws_ec2_transit_gateway_peering_attachment_accepter" "secondary" {
provider = aws.secondary
transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.cross_region.id
tags = {
Name = "tgw-peering-accept"
}
}4. Route Tables and Propagation
# Primary region route table
resource "aws_ec2_transit_gateway_route_table" "primary" {
provider = aws.primary
transit_gateway_id = aws_ec2_transit_gateway.primary.id
tags = {
Name = "tgw-rt-primary"
}
}
# Route to secondary region via peering
resource "aws_ec2_transit_gateway_route" "to_secondary" {
provider = aws.primary
destination_cidr_block = "10.1.0.0/16" # Secondary VPC CIDR
transit_gateway_attachment_id = aws_ec2_transit_gateway_peering_attachment.cross_region.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}
# VPC attachment to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "primary_vpc" {
provider = aws.primary
subnet_ids = module.vpc_primary.private_subnet_ids
transit_gateway_id = aws_ec2_transit_gateway.primary.id
vpc_id = module.vpc_primary.vpc_id
transit_gateway_default_route_table_association = false
transit_gateway_default_route_table_propagation = false
tags = {
Name = "tgw-attach-primary-vpc"
}
}
# Associate VPC attachment with route table
resource "aws_ec2_transit_gateway_route_table_association" "primary" {
provider = aws.primary
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.primary_vpc.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.primary.id
}5. Network Load Balancer Configuration
NLBs provide ultra-low latency Layer 4 load balancing with static IPs:
resource "aws_lb" "primary" {
provider = aws.primary
name = "nlb-primary"
internal = false
load_balancer_type = "network"
enable_cross_zone_load_balancing = true
enable_deletion_protection = true
dynamic "subnet_mapping" {
for_each = module.vpc_primary.public_subnet_ids
content {
subnet_id = subnet_mapping.value
allocation_id = aws_eip.nlb[subnet_mapping.key].id
}
}
tags = {
Name = "nlb-primary"
Environment = "production"
}
}
# Elastic IPs for static addressing
resource "aws_eip" "nlb" {
provider = aws.primary
count = length(module.vpc_primary.public_subnet_ids)
domain = "vpc"
tags = {
Name = "eip-nlb-${count.index}"
}
}
# Target group for EKS nodes
resource "aws_lb_target_group" "eks" {
provider = aws.primary
name = "tg-eks-primary"
port = 443
protocol = "TCP"
vpc_id = module.vpc_primary.vpc_id
target_type = "ip"
health_check {
enabled = true
healthy_threshold = 2
unhealthy_threshold = 2
interval = 10
port = "traffic-port"
protocol = "TCP"
}
tags = {
Name = "tg-eks-primary"
}
}
# Listener
resource "aws_lb_listener" "https" {
provider = aws.primary
load_balancer_arn = aws_lb.primary.arn
port = 443
protocol = "TCP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.eks.arn
}
}6. Route 53 Health Checks and Failover
flowchart LR
subgraph DNS["Route 53"]
HC1["Health Check Primary"]
HC2["Health Check Secondary"]
RR["Failover Record Set"]
end
subgraph Routing["Traffic Flow"]
User["User Request"]
Primary["Primary NLB"]
Secondary["Secondary NLB"]
end
User --> RR
RR --> HC1
RR --> HC2
HC1 -->|Healthy| Primary
HC2 -->|Standby| Secondary
HC1 -->|Unhealthy| Secondary
# Health check for primary region
resource "aws_route53_health_check" "primary" {
fqdn = aws_lb.primary.dns_name
port = 443
type = "TCP"
request_interval = 10
failure_threshold = 2
tags = {
Name = "hc-primary-nlb"
}
}
# Health check for secondary region
resource "aws_route53_health_check" "secondary" {
fqdn = aws_lb.secondary.dns_name
port = 443
type = "TCP"
request_interval = 10
failure_threshold = 2
tags = {
Name = "hc-secondary-nlb"
}
}
# Primary failover record
resource "aws_route53_record" "primary" {
zone_id = data.aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}
# Secondary failover record
resource "aws_route53_record" "secondary" {
zone_id = data.aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary"
health_check_id = aws_route53_health_check.secondary.id
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
}7. Global Accelerator for Performance
AWS Global Accelerator provides static anycast IPs and routes traffic through the AWS backbone:
resource "aws_globalaccelerator_accelerator" "main" {
name = "global-accelerator-prod"
ip_address_type = "IPV4"
enabled = true
attributes {
flow_logs_enabled = true
flow_logs_s3_bucket = aws_s3_bucket.accelerator_logs.id
flow_logs_s3_prefix = "flow-logs/"
}
tags = {
Name = "ga-prod"
}
}
resource "aws_globalaccelerator_listener" "https" {
accelerator_arn = aws_globalaccelerator_accelerator.main.id
protocol = "TCP"
port_range {
from_port = 443
to_port = 443
}
}
resource "aws_globalaccelerator_endpoint_group" "primary" {
listener_arn = aws_globalaccelerator_listener.https.id
endpoint_group_region = "us-east-1"
health_check_interval_seconds = 10
health_check_path = "/health"
health_check_port = 443
health_check_protocol = "TCP"
threshold_count = 2
traffic_dial_percentage = 100
endpoint_configuration {
endpoint_id = aws_lb.primary.arn
weight = 100
client_ip_preservation_enabled = true
}
}
resource "aws_globalaccelerator_endpoint_group" "secondary" {
listener_arn = aws_globalaccelerator_listener.https.id
endpoint_group_region = "us-west-2"
health_check_interval_seconds = 10
health_check_path = "/health"
health_check_port = 443
health_check_protocol = "TCP"
threshold_count = 2
traffic_dial_percentage = 0 # Standby
endpoint_configuration {
endpoint_id = aws_lb.secondary.arn
weight = 100
client_ip_preservation_enabled = true
}
}8. RDS Cross-Region Replication
# Primary RDS instance
resource "aws_db_instance" "primary" {
provider = aws.primary
identifier = "db-primary"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.xlarge"
allocated_storage = 100
max_allocated_storage = 500
storage_type = "gp3"
storage_encrypted = true
kms_key_id = aws_kms_key.rds_primary.arn
db_name = "appdb"
username = "admin"
password = random_password.db_password.result
multi_az = true
db_subnet_group_name = aws_db_subnet_group.primary.name
vpc_security_group_ids = [aws_security_group.rds_primary.id]
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
performance_insights_enabled = true
monitoring_interval = 60
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
deletion_protection = true
skip_final_snapshot = false
tags = {
Name = "rds-primary"
}
}
# Cross-region read replica
resource "aws_db_instance" "replica" {
provider = aws.secondary
identifier = "db-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.r6g.xlarge"
storage_encrypted = true
kms_key_id = aws_kms_key.rds_secondary.arn
vpc_security_group_ids = [aws_security_group.rds_secondary.id]
db_subnet_group_name = aws_db_subnet_group.secondary.name
performance_insights_enabled = true
monitoring_interval = 60
tags = {
Name = "rds-replica"
}
}9. Disaster Recovery Runbook
flowchart TD
subgraph Detection["Failure Detection"]
A["Health Check Fails"] --> B{"Automated or Manual?"}
B -->|Automated| C["Route 53 Failover"]
B -->|Manual| D["Operator Decision"]
end
subgraph Failover["Failover Process"]
C --> E["Traffic Routes to Secondary"]
D --> F["Promote RDS Replica"]
F --> G["Update DNS TTL"]
G --> E
end
subgraph Recovery["Recovery"]
E --> H["Monitor Secondary"]
H --> I{"Primary Recovered?"}
I -->|Yes| J["Plan Failback"]
I -->|No| K["Continue on Secondary"]
J --> L["Sync Data"]
L --> M["Failback to Primary"]
end
Failover Steps
- Verify primary region failure via CloudWatch alarms
- Route 53 automatically fails over DNS if health checks fail
- For database failover, promote the read replica:
# Promote RDS replica to standalone
aws rds promote-read-replica \
--db-instance-identifier db-replica \
--region us-west-2
# Update application connection strings
kubectl set env deployment/api \
DATABASE_HOST=db-replica.xxxx.us-west-2.rds.amazonaws.comFailback Steps
- Ensure primary region is stable
- Create new replica from secondary (now primary)
- Wait for replication lag to reach zero
- Perform controlled failback during maintenance window
10. Monitoring and Alerting
# CloudWatch alarm for cross-region latency
resource "aws_cloudwatch_metric_alarm" "cross_region_latency" {
alarm_name = "cross-region-latency-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "Latency"
namespace = "AWS/GlobalAccelerator"
period = 60
statistic = "Average"
threshold = 100
alarm_description = "Cross-region latency exceeds 100ms"
dimensions = {
Accelerator = aws_globalaccelerator_accelerator.main.id
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Alarm for RDS replication lag
resource "aws_cloudwatch_metric_alarm" "rds_replication_lag" {
provider = aws.secondary
alarm_name = "rds-replication-lag-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ReplicaLag"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = 60
alarm_description = "RDS replication lag exceeds 60 seconds"
dimensions = {
DBInstanceIdentifier = aws_db_instance.replica.id
}
alarm_actions = [aws_sns_topic.alerts_secondary.arn]
}
# Dashboard for multi-region overview
resource "aws_cloudwatch_dashboard" "multi_region" {
dashboard_name = "multi-region-overview"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
x = 0
y = 0
width = 12
height = 6
properties = {
title = "NLB Healthy Hosts"
region = "us-east-1"
metrics = [
["AWS/NetworkELB", "HealthyHostCount", "TargetGroup", aws_lb_target_group.eks.arn_suffix, "LoadBalancer", aws_lb.primary.arn_suffix]
]
}
},
{
type = "metric"
x = 12
y = 0
width = 12
height = 6
properties = {
title = "RDS Replication Lag"
region = "us-west-2"
metrics = [
["AWS/RDS", "ReplicaLag", "DBInstanceIdentifier", "db-replica"]
]
}
}
]
})
}Best Practices
| Area | Recommendation |
|---|---|
| Network | Use Transit Gateway peering instead of VPC peering for scalability |
| DNS | Set low TTLs (60s) on failover records for faster propagation |
| Database | Enable Multi-AZ in both regions for local HA |
| Monitoring | Create cross-region dashboards in a central account |
| Testing | Run quarterly DR drills to validate runbooks |
| Cost | Use reserved capacity for predictable baseline, on-demand for DR |
Conclusion
Multi-region architecture on AWS requires careful orchestration of networking, DNS, databases, and monitoring. The key components covered here provide:
- Private cross-region connectivity via Transit Gateway peering
- Automatic DNS failover with Route 53 health checks
- Global traffic optimization with Global Accelerator
- Database resilience with cross-region read replicas
- Comprehensive monitoring and alerting
The Terraform modules shown can be adapted to your specific requirements. Start with a single secondary region and expand as your resilience requirements grow.