Introduction
EKS cluster upgrades are one of those things that sound simple until you're responsible for keeping production running during the process.
A version upgrade touches the control plane, node groups, add-ons, and potentially breaks workloads with deprecated APIs.
I developed a blue-green node group strategy that allows us to:
- Upgrade EKS clusters with zero downtime
- Test workloads on new nodes before cutting over
- Rollback instantly if issues are detected
- Automate the entire process with Terraform
Architecture Overview
flowchart TB
subgraph ControlPlane["EKS Control Plane"]
API[Kubernetes API Server]
ETCD[(etcd)]
SCHED[Scheduler]
CCM[Cloud Controller]
end
subgraph BlueNodeGroup["Blue Node Group - v1.28"]
B1[Node 1]
B2[Node 2]
B3[Node 3]
subgraph BluePods["Workloads"]
BP1[Pod A]
BP2[Pod B]
BP3[Pod C]
end
end
subgraph GreenNodeGroup["Green Node Group - v1.29"]
G1[Node 1]
G2[Node 2]
G3[Node 3]
subgraph GreenPods["Migrated Workloads"]
GP1[Pod A']
GP2[Pod B']
GP3[Pod C']
end
end
subgraph Services["Service Layer"]
ALB[Application Load Balancer]
SVC[Kubernetes Services]
end
ALB --> SVC
SVC --> BluePods
SVC --> GreenPods
API --> BlueNodeGroup
API --> GreenNodeGroup
style ControlPlane fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff
style BlueNodeGroup fill:#264653,stroke:#3a86ff,stroke-width:2px,color:#fff
style GreenNodeGroup fill:#264653,stroke:#2a9d8f,stroke-width:2px,color:#fff
style Services fill:#0d1b2a,stroke:#f77f00,stroke-width:2px,color:#fff
Upgrade Strategy Flow
flowchart TD
subgraph Phase1["Phase 1: Preparation"]
A[Audit current cluster state] --> B[Check API deprecations]
B --> C[Update PodDisruptionBudgets]
C --> D[Verify add-on compatibility]
end
subgraph Phase2["Phase 2: Control Plane Upgrade"]
D --> E[Upgrade EKS control plane]
E --> F[Wait for control plane ready]
F --> G[Verify API server health]
end
subgraph Phase3["Phase 3: Green Node Group"]
G --> H[Create green node group]
H --> I[Wait for nodes ready]
I --> J[Taint blue nodes]
end
subgraph Phase4["Phase 4: Workload Migration"]
J --> K[Cordon blue nodes]
K --> L[Drain blue nodes]
L --> M[Verify pods on green]
end
subgraph Phase5["Phase 5: Validation"]
M --> N{Health checks pass?}
N -->|Yes| O[Delete blue node group]
N -->|No| P[Rollback: Uncordon blue]
P --> Q[Delete green node group]
end
subgraph Phase6["Phase 6: Cleanup"]
O --> R[Update add-ons]
R --> S[Remove old taints]
S --> T[Upgrade complete]
end
style Phase1 fill:#3a86ff,stroke:#fff,stroke-width:2px,color:#fff
style Phase2 fill:#8338ec,stroke:#fff,stroke-width:2px,color:#fff
style Phase3 fill:#ff006e,stroke:#fff,stroke-width:2px,color:#fff
style Phase4 fill:#fb5607,stroke:#fff,stroke-width:2px,color:#fff
style Phase5 fill:#ffbe0b,stroke:#fff,stroke-width:2px,color:#000
style Phase6 fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
Terraform Module Structure
eks-upgrade/
├── modules/
│ ├── eks-cluster/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── node-group/
│ │ ├── main.tf
│ │ ├── launch-template.tf
│ │ └── variables.tf
│ └── addons/
│ ├── main.tf
│ └── versions.tf
├── environments/
│ ├── dev/
│ ├── staging/
│ └── prod/
└── scripts/
├── pre-upgrade-checks.sh
├── drain-nodes.sh
└── rollback.shNode Group Module
# modules/node-group/main.tf
resource "aws_eks_node_group" "this" {
cluster_name = var.cluster_name
node_group_name = "${var.cluster_name}-${var.color}-${var.kubernetes_version}"
node_role_arn = var.node_role_arn
subnet_ids = var.subnet_ids
scaling_config {
desired_size = var.desired_size
max_size = var.max_size
min_size = var.min_size
}
update_config {
max_unavailable_percentage = 25
}
launch_template {
id = aws_launch_template.this.id
version = aws_launch_template.this.latest_version
}
labels = merge(var.labels, {
"node-group" = var.color
"kubernetes-version" = var.kubernetes_version
})
dynamic "taint" {
for_each = var.taints
content {
key = taint.value.key
value = taint.value.value
effect = taint.value.effect
}
}
lifecycle {
create_before_destroy = true
ignore_changes = [scaling_config[0].desired_size]
}
tags = merge(var.tags, {
"Name" = "${var.cluster_name}-${var.color}"
"kubernetes.io/cluster/${var.cluster_name}" = "owned"
})
}
resource "aws_launch_template" "this" {
name_prefix = "${var.cluster_name}-${var.color}-"
instance_type = var.instance_type
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = var.disk_size
volume_type = "gp3"
encrypted = true
kms_key_id = var.kms_key_arn
delete_on_termination = true
}
}
metadata_options {
http_endpoint = "enabled"
http_tokens = "required" # IMDSv2
http_put_response_hop_limit = 1
}
monitoring {
enabled = true
}
tag_specifications {
resource_type = "instance"
tags = merge(var.tags, {
"Name" = "${var.cluster_name}-${var.color}-node"
})
}
user_data = base64encode(templatefile("${path.module}/userdata.tpl", {
cluster_name = var.cluster_name
cluster_endpoint = var.cluster_endpoint
cluster_ca = var.cluster_ca
kubelet_extra_args = var.kubelet_extra_args
}))
}Blue-Green Upgrade Orchestration
# environments/prod/main.tf
locals {
# Toggle this to switch between blue and green
active_color = "blue"
# Version configuration
current_version = "1.28"
target_version = "1.29"
# Determine which node group is active
blue_enabled = local.active_color == "blue"
green_enabled = local.active_color == "green" || var.upgrade_in_progress
}
# EKS Cluster
module "eks" {
source = "../../modules/eks-cluster"
cluster_name = "prod-cluster"
cluster_version = var.upgrade_in_progress ? local.target_version : local.current_version
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
cluster_endpoint_private_access = true
cluster_endpoint_public_access = false
# Enable control plane logging
cluster_enabled_log_types = [
"api",
"audit",
"authenticator",
"controllerManager",
"scheduler"
]
}
# Blue Node Group (current production)
module "blue_node_group" {
source = "../../modules/node-group"
count = local.blue_enabled ? 1 : 0
cluster_name = module.eks.cluster_name
cluster_endpoint = module.eks.cluster_endpoint
cluster_ca = module.eks.cluster_ca
kubernetes_version = local.current_version
color = "blue"
node_role_arn = module.eks.node_role_arn
subnet_ids = module.vpc.private_subnet_ids
instance_type = "m6i.xlarge"
desired_size = 3
min_size = 2
max_size = 10
disk_size = 100
# Taint blue nodes during upgrade to prevent new scheduling
taints = var.upgrade_in_progress ? [
{
key = "upgrade"
value = "in-progress"
effect = "NO_SCHEDULE"
}
] : []
labels = {
environment = "production"
workload = "general"
}
}
# Green Node Group (upgrade target)
module "green_node_group" {
source = "../../modules/node-group"
count = local.green_enabled ? 1 : 0
cluster_name = module.eks.cluster_name
cluster_endpoint = module.eks.cluster_endpoint
cluster_ca = module.eks.cluster_ca
kubernetes_version = local.target_version
color = "green"
node_role_arn = module.eks.node_role_arn
subnet_ids = module.vpc.private_subnet_ids
instance_type = "m6i.xlarge"
desired_size = 3
min_size = 2
max_size = 10
disk_size = 100
labels = {
environment = "production"
workload = "general"
}
}PodDisruptionBudget Configuration
Critical for safe node draining:
# pdb-critical-workloads.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-gateway-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: api-gateway
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-service-pdb
namespace: production
spec:
maxUnavailable: 1
selector:
matchLabels:
app: payment-serviceNode Drain Script
#!/bin/bash
# scripts/drain-nodes.sh
set -euo pipefail
CLUSTER_NAME="${1:?Cluster name required}"
NODE_GROUP_COLOR="${2:?Node group color required}"
DRAIN_TIMEOUT="${3:-300}"
echo "Starting drain process for ${NODE_GROUP_COLOR} nodes in ${CLUSTER_NAME}"
# Get nodes in the target node group
NODES=$(kubectl get nodes -l "node-group=${NODE_GROUP_COLOR}" -o jsonpath='{.items[*].metadata.name}')
if [[ -z "$NODES" ]]; then
echo "No nodes found with label node-group=${NODE_GROUP_COLOR}"
exit 1
fi
# Cordon all nodes first (prevent new scheduling)
echo "Cordoning nodes..."
for NODE in $NODES; do
echo " Cordoning ${NODE}"
kubectl cordon "$NODE"
done
# Drain nodes one by one
echo "Draining nodes..."
for NODE in $NODES; do
echo " Draining ${NODE}"
kubectl drain "$NODE" \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=60 \
--timeout="${DRAIN_TIMEOUT}s" \
|| {
echo "ERROR: Failed to drain ${NODE}"
echo "Rolling back - uncordoning all nodes"
for ROLLBACK_NODE in $NODES; do
kubectl uncordon "$ROLLBACK_NODE" || true
done
exit 1
}
echo " Successfully drained ${NODE}"
# Wait between drains to allow services to stabilize
sleep 30
done
echo "All nodes drained successfully"Upgrade Workflow Sequence
sequenceDiagram
participant Eng as Engineer
participant TF as Terraform
participant EKS as EKS Control Plane
participant Blue as Blue Nodes
participant Green as Green Nodes
participant K8s as Kubernetes API
Eng->>TF: Set upgrade_in_progress = true
TF->>EKS: Upgrade control plane to v1.29
EKS-->>TF: Control plane ready
TF->>Green: Create green node group (v1.29)
Green-->>TF: Nodes registered
TF->>Blue: Add NO_SCHEDULE taint
Blue-->>TF: Taint applied
Eng->>K8s: kubectl cordon blue nodes
K8s-->>Blue: Nodes cordoned
Eng->>K8s: kubectl drain blue nodes
K8s->>Blue: Evict pods
Blue->>Green: Pods rescheduled
Green-->>K8s: Pods running
Eng->>Eng: Run health checks
alt Health checks pass
Eng->>TF: Set active_color = green
TF->>Blue: Delete blue node group
Blue-->>TF: Deleted
Eng->>TF: Set upgrade_in_progress = false
else Health checks fail
Eng->>K8s: kubectl uncordon blue nodes
K8s-->>Blue: Nodes uncordoned
Eng->>TF: Delete green node group
Green-->>TF: Deleted
Eng->>TF: Rollback control plane
end
Add-on Compatibility Matrix
# modules/addons/versions.tf
locals {
addon_versions = {
"1.28" = {
vpc_cni = "v1.15.4-eksbuild.1"
coredns = "v1.10.1-eksbuild.6"
kube_proxy = "v1.28.4-eksbuild.1"
ebs_csi_driver = "v1.25.0-eksbuild.1"
aws_load_balancer = "v2.6.2"
}
"1.29" = {
vpc_cni = "v1.16.0-eksbuild.1"
coredns = "v1.11.1-eksbuild.4"
kube_proxy = "v1.29.0-eksbuild.1"
ebs_csi_driver = "v1.26.0-eksbuild.1"
aws_load_balancer = "v2.7.0"
}
"1.30" = {
vpc_cni = "v1.17.1-eksbuild.1"
coredns = "v1.11.1-eksbuild.8"
kube_proxy = "v1.30.0-eksbuild.3"
ebs_csi_driver = "v1.28.0-eksbuild.1"
aws_load_balancer = "v2.7.1"
}
}
}
resource "aws_eks_addon" "vpc_cni" {
cluster_name = var.cluster_name
addon_name = "vpc-cni"
addon_version = local.addon_versions[var.cluster_version].vpc_cni
resolve_conflicts_on_update = "OVERWRITE"
configuration_values = jsonencode({
enableNetworkPolicy = "true"
env = {
ENABLE_PREFIX_DELEGATION = "true"
WARM_PREFIX_TARGET = "1"
}
})
}
resource "aws_eks_addon" "coredns" {
cluster_name = var.cluster_name
addon_name = "coredns"
addon_version = local.addon_versions[var.cluster_version].coredns
resolve_conflicts_on_update = "OVERWRITE"
}
resource "aws_eks_addon" "kube_proxy" {
cluster_name = var.cluster_name
addon_name = "kube-proxy"
addon_version = local.addon_versions[var.cluster_version].kube_proxy
resolve_conflicts_on_update = "OVERWRITE"
}
resource "aws_eks_addon" "ebs_csi" {
cluster_name = var.cluster_name
addon_name = "aws-ebs-csi-driver"
addon_version = local.addon_versions[var.cluster_version].ebs_csi_driver
service_account_role_arn = var.ebs_csi_role_arn
resolve_conflicts_on_update = "OVERWRITE"
}Pre-Upgrade Validation Script
#!/bin/bash
# scripts/pre-upgrade-checks.sh
set -euo pipefail
TARGET_VERSION="${1:?Target Kubernetes version required}"
CLUSTER_NAME="${2:?Cluster name required}"
echo "Running pre-upgrade checks for ${CLUSTER_NAME} -> ${TARGET_VERSION}"
# Check for deprecated APIs
echo "Checking for deprecated APIs..."
DEPRECATED=$(kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis || true)
if [[ -n "$DEPRECATED" ]]; then
echo "WARNING: Deprecated APIs in use:"
echo "$DEPRECATED"
echo ""
echo "Run: kubectl api-resources --verbs=list -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -A"
fi
# Check PodDisruptionBudgets
echo "Checking PodDisruptionBudgets..."
PDBS=$(kubectl get pdb -A -o json | jq -r '.items[] | select(.status.disruptionsAllowed == 0) | "\(.metadata.namespace)/\(.metadata.name)"')
if [[ -n "$PDBS" ]]; then
echo "WARNING: PDBs with 0 disruptions allowed:"
echo "$PDBS"
echo "These may block node draining"
fi
# Check node capacity
echo "Checking node capacity..."
TOTAL_PODS=$(kubectl get pods -A --no-headers | wc -l)
TOTAL_CAPACITY=$(kubectl get nodes -o json | jq '[.items[].status.allocatable.pods | tonumber] | add')
echo "Current pods: ${TOTAL_PODS}, Total capacity: ${TOTAL_CAPACITY}"
if (( TOTAL_PODS * 2 > TOTAL_CAPACITY )); then
echo "WARNING: May not have enough capacity for blue-green migration"
echo "Consider scaling up before upgrade"
fi
# Check for stuck pods
echo "Checking for stuck pods..."
STUCK_PODS=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers 2>/dev/null || true)
if [[ -n "$STUCK_PODS" ]]; then
echo "WARNING: Pods not in Running/Succeeded state:"
echo "$STUCK_PODS"
fi
# Verify addon compatibility
echo "Checking addon versions..."
kubectl get pods -n kube-system -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.containers[0].image)"'
echo ""
echo "Pre-upgrade checks complete"Rollback Procedure
flowchart TD
subgraph Detection["Issue Detection"]
A[Health check fails] --> B{Severity?}
B -->|Critical| C[Immediate rollback]
B -->|Minor| D[Investigate first]
D --> E{Fixable quickly?}
E -->|Yes| F[Apply fix]
E -->|No| C
end
subgraph Rollback["Rollback Process"]
C --> G[Uncordon blue nodes]
G --> H[Remove taints from blue]
H --> I[Cordon green nodes]
I --> J[Drain green nodes]
J --> K[Pods migrate to blue]
end
subgraph Cleanup["Cleanup"]
K --> L[Verify workloads healthy]
L --> M[Delete green node group]
M --> N{Rollback control plane?}
N -->|Yes| O[Create support ticket]
N -->|No| P[Keep control plane]
O --> Q[AWS rolls back]
P --> R[Document incident]
Q --> R
end
style Detection fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
style Rollback fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style Cleanup fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
Monitoring During Upgrade
# CloudWatch alarms for upgrade monitoring
resource "aws_cloudwatch_metric_alarm" "node_not_ready" {
alarm_name = "${var.cluster_name}-node-not-ready"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "cluster_failed_node_count"
namespace = "ContainerInsights"
period = 60
statistic = "Maximum"
threshold = 0
alarm_description = "EKS node not ready during upgrade"
dimensions = {
ClusterName = var.cluster_name
}
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "pod_restart" {
alarm_name = "${var.cluster_name}-pod-restarts"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "pod_number_of_container_restarts"
namespace = "ContainerInsights"
period = 60
statistic = "Sum"
threshold = 10
alarm_description = "High pod restart rate during upgrade"
dimensions = {
ClusterName = var.cluster_name
Namespace = "production"
}
alarm_actions = [var.sns_topic_arn]
}Best Practices
| Practice | Why |
|---|---|
| Always test in staging first | Catch issues before production |
| Use PodDisruptionBudgets | Prevent service disruption during drain |
| Upgrade one minor version at a time | AWS doesn't support skipping versions |
| Keep add-ons compatible | Mismatched versions cause issues |
| Have rollback plan ready | Things will go wrong eventually |
| Monitor throughout | Catch issues early |
| Upgrade during low traffic | Minimize blast radius |
Troubleshooting
"Pods stuck in Pending after drain"
# Check for resource constraints
kubectl describe pod <pod-name> -n <namespace>
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check PDB status
kubectl get pdb -A"Node drain timeout"
- Increase grace period for slow-terminating pods
- Check if PDB is blocking eviction
- Force delete stuck pods if safe
"Control plane upgrade stuck"
- Check EKS console for detailed status
- Review CloudWatch logs for control plane
- Contact AWS support if stuck > 1 hour
Conclusion
Zero-downtime EKS upgrades are achievable with proper planning and automation. The blue-green node group strategy provides:
- Safety: Workloads keep running on stable nodes during upgrade
- Rollback capability: Instantly revert if issues arise
- Validation time: Test thoroughly before committing
The key is treating upgrades as a first-class operational concern, not an afterthought. With Terraform automation and proper PDBs, what used to be a nerve-wracking maintenance window becomes a routine, low-risk operation.