⚡
Milan.dev
>Home>Projects>Experience>Blog
GitHubLinkedIn
status: building
>Home>Projects>Experience>Blog
status: building

Connect

Let's collaborate on infrastructure challenges

Open to discussing DevOps strategies, cloud architecture optimization, security implementations, and interesting infrastructure problems.

send a message→

Find me elsewhere

GitHub
@milandangol
LinkedIn
/in/milan-dangol
Email
milandangol57@gmail.com
Forged with& code

© 2026 Milan Dangol — All systems reserved

back to blog
kubernetesfeatured

Zero-Downtime EKS Upgrades in Production

Implementing a blue-green node group strategy for EKS cluster upgrades with automated rollback, PodDisruptionBudgets, and Terraform orchestration - achieving zero customer impact during Kubernetes version upgrades.

M

Milan Dangol

Sr DevOps & DevSecOps Engineer

Jun 12, 2025
11 min read

Introduction

EKS cluster upgrades are one of those things that sound simple until you're responsible for keeping production running during the process.
A version upgrade touches the control plane, node groups, add-ons, and potentially breaks workloads with deprecated APIs.

I developed a blue-green node group strategy that allows us to:

  • Upgrade EKS clusters with zero downtime
  • Test workloads on new nodes before cutting over
  • Rollback instantly if issues are detected
  • Automate the entire process with Terraform

Architecture Overview

flowchart TB subgraph ControlPlane["EKS Control Plane"] API[Kubernetes API Server] ETCD[(etcd)] SCHED[Scheduler] CCM[Cloud Controller] end subgraph BlueNodeGroup["Blue Node Group - v1.28"] B1[Node 1] B2[Node 2] B3[Node 3] subgraph BluePods["Workloads"] BP1[Pod A] BP2[Pod B] BP3[Pod C] end end subgraph GreenNodeGroup["Green Node Group - v1.29"] G1[Node 1] G2[Node 2] G3[Node 3] subgraph GreenPods["Migrated Workloads"] GP1[Pod A'] GP2[Pod B'] GP3[Pod C'] end end subgraph Services["Service Layer"] ALB[Application Load Balancer] SVC[Kubernetes Services] end ALB --> SVC SVC --> BluePods SVC --> GreenPods API --> BlueNodeGroup API --> GreenNodeGroup style ControlPlane fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff style BlueNodeGroup fill:#264653,stroke:#3a86ff,stroke-width:2px,color:#fff style GreenNodeGroup fill:#264653,stroke:#2a9d8f,stroke-width:2px,color:#fff style Services fill:#0d1b2a,stroke:#f77f00,stroke-width:2px,color:#fff

Upgrade Strategy Flow

flowchart TD subgraph Phase1["Phase 1: Preparation"] A[Audit current cluster state] --> B[Check API deprecations] B --> C[Update PodDisruptionBudgets] C --> D[Verify add-on compatibility] end subgraph Phase2["Phase 2: Control Plane Upgrade"] D --> E[Upgrade EKS control plane] E --> F[Wait for control plane ready] F --> G[Verify API server health] end subgraph Phase3["Phase 3: Green Node Group"] G --> H[Create green node group] H --> I[Wait for nodes ready] I --> J[Taint blue nodes] end subgraph Phase4["Phase 4: Workload Migration"] J --> K[Cordon blue nodes] K --> L[Drain blue nodes] L --> M[Verify pods on green] end subgraph Phase5["Phase 5: Validation"] M --> N{Health checks pass?} N -->|Yes| O[Delete blue node group] N -->|No| P[Rollback: Uncordon blue] P --> Q[Delete green node group] end subgraph Phase6["Phase 6: Cleanup"] O --> R[Update add-ons] R --> S[Remove old taints] S --> T[Upgrade complete] end style Phase1 fill:#3a86ff,stroke:#fff,stroke-width:2px,color:#fff style Phase2 fill:#8338ec,stroke:#fff,stroke-width:2px,color:#fff style Phase3 fill:#ff006e,stroke:#fff,stroke-width:2px,color:#fff style Phase4 fill:#fb5607,stroke:#fff,stroke-width:2px,color:#fff style Phase5 fill:#ffbe0b,stroke:#fff,stroke-width:2px,color:#000 style Phase6 fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff

Terraform Module Structure

eks-upgrade/
├── modules/
│   ├── eks-cluster/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── node-group/
│   │   ├── main.tf
│   │   ├── launch-template.tf
│   │   └── variables.tf
│   └── addons/
│       ├── main.tf
│       └── versions.tf
├── environments/
│   ├── dev/
│   ├── staging/
│   └── prod/
└── scripts/
    ├── pre-upgrade-checks.sh
    ├── drain-nodes.sh
    └── rollback.sh

Node Group Module

# modules/node-group/main.tf

resource "aws_eks_node_group" "this" {
  cluster_name    = var.cluster_name
  node_group_name = "${var.cluster_name}-${var.color}-${var.kubernetes_version}"
  node_role_arn   = var.node_role_arn
  subnet_ids      = var.subnet_ids

  scaling_config {
    desired_size = var.desired_size
    max_size     = var.max_size
    min_size     = var.min_size
  }

  update_config {
    max_unavailable_percentage = 25
  }

  launch_template {
    id      = aws_launch_template.this.id
    version = aws_launch_template.this.latest_version
  }

  labels = merge(var.labels, {
    "node-group" = var.color
    "kubernetes-version" = var.kubernetes_version
  })

  dynamic "taint" {
    for_each = var.taints
    content {
      key    = taint.value.key
      value  = taint.value.value
      effect = taint.value.effect
    }
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes        = [scaling_config[0].desired_size]
  }

  tags = merge(var.tags, {
    "Name"                = "${var.cluster_name}-${var.color}"
    "kubernetes.io/cluster/${var.cluster_name}" = "owned"
  })
}

resource "aws_launch_template" "this" {
  name_prefix   = "${var.cluster_name}-${var.color}-"
  instance_type = var.instance_type

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size           = var.disk_size
      volume_type           = "gp3"
      encrypted             = true
      kms_key_id            = var.kms_key_arn
      delete_on_termination = true
    }
  }

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  # IMDSv2
    http_put_response_hop_limit = 1
  }

  monitoring {
    enabled = true
  }

  tag_specifications {
    resource_type = "instance"
    tags = merge(var.tags, {
      "Name" = "${var.cluster_name}-${var.color}-node"
    })
  }

  user_data = base64encode(templatefile("${path.module}/userdata.tpl", {
    cluster_name     = var.cluster_name
    cluster_endpoint = var.cluster_endpoint
    cluster_ca       = var.cluster_ca
    kubelet_extra_args = var.kubelet_extra_args
  }))
}

Blue-Green Upgrade Orchestration

# environments/prod/main.tf

locals {
  # Toggle this to switch between blue and green
  active_color = "blue"
  
  # Version configuration
  current_version = "1.28"
  target_version  = "1.29"
  
  # Determine which node group is active
  blue_enabled  = local.active_color == "blue"
  green_enabled = local.active_color == "green" || var.upgrade_in_progress
}

# EKS Cluster
module "eks" {
  source = "../../modules/eks-cluster"

  cluster_name    = "prod-cluster"
  cluster_version = var.upgrade_in_progress ? local.target_version : local.current_version
  
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = false

  # Enable control plane logging
  cluster_enabled_log_types = [
    "api",
    "audit", 
    "authenticator",
    "controllerManager",
    "scheduler"
  ]
}

# Blue Node Group (current production)
module "blue_node_group" {
  source = "../../modules/node-group"
  count  = local.blue_enabled ? 1 : 0

  cluster_name       = module.eks.cluster_name
  cluster_endpoint   = module.eks.cluster_endpoint
  cluster_ca         = module.eks.cluster_ca
  kubernetes_version = local.current_version
  
  color         = "blue"
  node_role_arn = module.eks.node_role_arn
  subnet_ids    = module.vpc.private_subnet_ids

  instance_type = "m6i.xlarge"
  desired_size  = 3
  min_size      = 2
  max_size      = 10
  disk_size     = 100

  # Taint blue nodes during upgrade to prevent new scheduling
  taints = var.upgrade_in_progress ? [
    {
      key    = "upgrade"
      value  = "in-progress"
      effect = "NO_SCHEDULE"
    }
  ] : []

  labels = {
    environment = "production"
    workload    = "general"
  }
}

# Green Node Group (upgrade target)
module "green_node_group" {
  source = "../../modules/node-group"
  count  = local.green_enabled ? 1 : 0

  cluster_name       = module.eks.cluster_name
  cluster_endpoint   = module.eks.cluster_endpoint
  cluster_ca         = module.eks.cluster_ca
  kubernetes_version = local.target_version
  
  color         = "green"
  node_role_arn = module.eks.node_role_arn
  subnet_ids    = module.vpc.private_subnet_ids

  instance_type = "m6i.xlarge"
  desired_size  = 3
  min_size      = 2
  max_size      = 10
  disk_size     = 100

  labels = {
    environment = "production"
    workload    = "general"
  }
}

PodDisruptionBudget Configuration

Critical for safe node draining:

# pdb-critical-workloads.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-gateway-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-gateway
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-service-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: payment-service

Node Drain Script

#!/bin/bash
# scripts/drain-nodes.sh

set -euo pipefail

CLUSTER_NAME="${1:?Cluster name required}"
NODE_GROUP_COLOR="${2:?Node group color required}"
DRAIN_TIMEOUT="${3:-300}"

echo "Starting drain process for ${NODE_GROUP_COLOR} nodes in ${CLUSTER_NAME}"

# Get nodes in the target node group
NODES=$(kubectl get nodes -l "node-group=${NODE_GROUP_COLOR}" -o jsonpath='{.items[*].metadata.name}')

if [[ -z "$NODES" ]]; then
    echo "No nodes found with label node-group=${NODE_GROUP_COLOR}"
    exit 1
fi

# Cordon all nodes first (prevent new scheduling)
echo "Cordoning nodes..."
for NODE in $NODES; do
    echo "  Cordoning ${NODE}"
    kubectl cordon "$NODE"
done

# Drain nodes one by one
echo "Draining nodes..."
for NODE in $NODES; do
    echo "  Draining ${NODE}"
    
    kubectl drain "$NODE" \
        --ignore-daemonsets \
        --delete-emptydir-data \
        --force \
        --grace-period=60 \
        --timeout="${DRAIN_TIMEOUT}s" \
        || {
            echo "ERROR: Failed to drain ${NODE}"
            echo "Rolling back - uncordoning all nodes"
            for ROLLBACK_NODE in $NODES; do
                kubectl uncordon "$ROLLBACK_NODE" || true
            done
            exit 1
        }
    
    echo "  Successfully drained ${NODE}"
    
    # Wait between drains to allow services to stabilize
    sleep 30
done

echo "All nodes drained successfully"

Upgrade Workflow Sequence

sequenceDiagram participant Eng as Engineer participant TF as Terraform participant EKS as EKS Control Plane participant Blue as Blue Nodes participant Green as Green Nodes participant K8s as Kubernetes API Eng->>TF: Set upgrade_in_progress = true TF->>EKS: Upgrade control plane to v1.29 EKS-->>TF: Control plane ready TF->>Green: Create green node group (v1.29) Green-->>TF: Nodes registered TF->>Blue: Add NO_SCHEDULE taint Blue-->>TF: Taint applied Eng->>K8s: kubectl cordon blue nodes K8s-->>Blue: Nodes cordoned Eng->>K8s: kubectl drain blue nodes K8s->>Blue: Evict pods Blue->>Green: Pods rescheduled Green-->>K8s: Pods running Eng->>Eng: Run health checks alt Health checks pass Eng->>TF: Set active_color = green TF->>Blue: Delete blue node group Blue-->>TF: Deleted Eng->>TF: Set upgrade_in_progress = false else Health checks fail Eng->>K8s: kubectl uncordon blue nodes K8s-->>Blue: Nodes uncordoned Eng->>TF: Delete green node group Green-->>TF: Deleted Eng->>TF: Rollback control plane end

Add-on Compatibility Matrix

# modules/addons/versions.tf

locals {
  addon_versions = {
    "1.28" = {
      vpc_cni            = "v1.15.4-eksbuild.1"
      coredns            = "v1.10.1-eksbuild.6"
      kube_proxy         = "v1.28.4-eksbuild.1"
      ebs_csi_driver     = "v1.25.0-eksbuild.1"
      aws_load_balancer  = "v2.6.2"
    }
    "1.29" = {
      vpc_cni            = "v1.16.0-eksbuild.1"
      coredns            = "v1.11.1-eksbuild.4"
      kube_proxy         = "v1.29.0-eksbuild.1"
      ebs_csi_driver     = "v1.26.0-eksbuild.1"
      aws_load_balancer  = "v2.7.0"
    }
    "1.30" = {
      vpc_cni            = "v1.17.1-eksbuild.1"
      coredns            = "v1.11.1-eksbuild.8"
      kube_proxy         = "v1.30.0-eksbuild.3"
      ebs_csi_driver     = "v1.28.0-eksbuild.1"
      aws_load_balancer  = "v2.7.1"
    }
  }
}

resource "aws_eks_addon" "vpc_cni" {
  cluster_name             = var.cluster_name
  addon_name               = "vpc-cni"
  addon_version            = local.addon_versions[var.cluster_version].vpc_cni
  resolve_conflicts_on_update = "OVERWRITE"
  
  configuration_values = jsonencode({
    enableNetworkPolicy = "true"
    env = {
      ENABLE_PREFIX_DELEGATION = "true"
      WARM_PREFIX_TARGET       = "1"
    }
  })
}

resource "aws_eks_addon" "coredns" {
  cluster_name             = var.cluster_name
  addon_name               = "coredns"
  addon_version            = local.addon_versions[var.cluster_version].coredns
  resolve_conflicts_on_update = "OVERWRITE"
}

resource "aws_eks_addon" "kube_proxy" {
  cluster_name             = var.cluster_name
  addon_name               = "kube-proxy"
  addon_version            = local.addon_versions[var.cluster_version].kube_proxy
  resolve_conflicts_on_update = "OVERWRITE"
}

resource "aws_eks_addon" "ebs_csi" {
  cluster_name             = var.cluster_name
  addon_name               = "aws-ebs-csi-driver"
  addon_version            = local.addon_versions[var.cluster_version].ebs_csi_driver
  service_account_role_arn = var.ebs_csi_role_arn
  resolve_conflicts_on_update = "OVERWRITE"
}

Pre-Upgrade Validation Script

#!/bin/bash
# scripts/pre-upgrade-checks.sh

set -euo pipefail

TARGET_VERSION="${1:?Target Kubernetes version required}"
CLUSTER_NAME="${2:?Cluster name required}"

echo "Running pre-upgrade checks for ${CLUSTER_NAME} -> ${TARGET_VERSION}"

# Check for deprecated APIs
echo "Checking for deprecated APIs..."
DEPRECATED=$(kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis || true)
if [[ -n "$DEPRECATED" ]]; then
    echo "WARNING: Deprecated APIs in use:"
    echo "$DEPRECATED"
    echo ""
    echo "Run: kubectl api-resources --verbs=list -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -A"
fi

# Check PodDisruptionBudgets
echo "Checking PodDisruptionBudgets..."
PDBS=$(kubectl get pdb -A -o json | jq -r '.items[] | select(.status.disruptionsAllowed == 0) | "\(.metadata.namespace)/\(.metadata.name)"')
if [[ -n "$PDBS" ]]; then
    echo "WARNING: PDBs with 0 disruptions allowed:"
    echo "$PDBS"
    echo "These may block node draining"
fi

# Check node capacity
echo "Checking node capacity..."
TOTAL_PODS=$(kubectl get pods -A --no-headers | wc -l)
TOTAL_CAPACITY=$(kubectl get nodes -o json | jq '[.items[].status.allocatable.pods | tonumber] | add')
echo "Current pods: ${TOTAL_PODS}, Total capacity: ${TOTAL_CAPACITY}"

if (( TOTAL_PODS * 2 > TOTAL_CAPACITY )); then
    echo "WARNING: May not have enough capacity for blue-green migration"
    echo "Consider scaling up before upgrade"
fi

# Check for stuck pods
echo "Checking for stuck pods..."
STUCK_PODS=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers 2>/dev/null || true)
if [[ -n "$STUCK_PODS" ]]; then
    echo "WARNING: Pods not in Running/Succeeded state:"
    echo "$STUCK_PODS"
fi

# Verify addon compatibility
echo "Checking addon versions..."
kubectl get pods -n kube-system -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.containers[0].image)"'

echo ""
echo "Pre-upgrade checks complete"

Rollback Procedure

flowchart TD subgraph Detection["Issue Detection"] A[Health check fails] --> B{Severity?} B -->|Critical| C[Immediate rollback] B -->|Minor| D[Investigate first] D --> E{Fixable quickly?} E -->|Yes| F[Apply fix] E -->|No| C end subgraph Rollback["Rollback Process"] C --> G[Uncordon blue nodes] G --> H[Remove taints from blue] H --> I[Cordon green nodes] I --> J[Drain green nodes] J --> K[Pods migrate to blue] end subgraph Cleanup["Cleanup"] K --> L[Verify workloads healthy] L --> M[Delete green node group] M --> N{Rollback control plane?} N -->|Yes| O[Create support ticket] N -->|No| P[Keep control plane] O --> Q[AWS rolls back] P --> R[Document incident] Q --> R end style Detection fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff style Rollback fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff style Cleanup fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff

Monitoring During Upgrade

# CloudWatch alarms for upgrade monitoring

resource "aws_cloudwatch_metric_alarm" "node_not_ready" {
  alarm_name          = "${var.cluster_name}-node-not-ready"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "cluster_failed_node_count"
  namespace           = "ContainerInsights"
  period              = 60
  statistic           = "Maximum"
  threshold           = 0
  alarm_description   = "EKS node not ready during upgrade"
  
  dimensions = {
    ClusterName = var.cluster_name
  }

  alarm_actions = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "pod_restart" {
  alarm_name          = "${var.cluster_name}-pod-restarts"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "pod_number_of_container_restarts"
  namespace           = "ContainerInsights"
  period              = 60
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "High pod restart rate during upgrade"
  
  dimensions = {
    ClusterName = var.cluster_name
    Namespace   = "production"
  }

  alarm_actions = [var.sns_topic_arn]
}

Best Practices

Practice Why
Always test in staging first Catch issues before production
Use PodDisruptionBudgets Prevent service disruption during drain
Upgrade one minor version at a time AWS doesn't support skipping versions
Keep add-ons compatible Mismatched versions cause issues
Have rollback plan ready Things will go wrong eventually
Monitor throughout Catch issues early
Upgrade during low traffic Minimize blast radius

Troubleshooting

"Pods stuck in Pending after drain"

# Check for resource constraints
kubectl describe pod <pod-name> -n <namespace>

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check PDB status
kubectl get pdb -A

"Node drain timeout"

  • Increase grace period for slow-terminating pods
  • Check if PDB is blocking eviction
  • Force delete stuck pods if safe

"Control plane upgrade stuck"

  • Check EKS console for detailed status
  • Review CloudWatch logs for control plane
  • Contact AWS support if stuck > 1 hour

Conclusion

Zero-downtime EKS upgrades are achievable with proper planning and automation. The blue-green node group strategy provides:

  • Safety: Workloads keep running on stable nodes during upgrade
  • Rollback capability: Instantly revert if issues arise
  • Validation time: Test thoroughly before committing

The key is treating upgrades as a first-class operational concern, not an afterthought. With Terraform automation and proper PDBs, what used to be a nerve-wracking maintenance window becomes a routine, low-risk operation.

Share this article

Tags

#eks#kubernetes#upgrades#blue-green#terraform#high-availability

Related Articles

cloud9 min read

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Learn how to architect highly available, multi-region AWS infrastructure using Terraform, Transit Gateway, Network Load Balancers, and intelligent routing strategies for enterprise-grade applications.

cloud12 min read

Engineering AWS NLB Infrastructure for Financial Services Proxy Networks

Designing a multi-environment AWS NLB infrastructure for financial services using Terraform - featuring dual internal/external load balancers, JSON-driven per-port IP whitelists, intelligent port-range routing, and Transit Gateway hybrid connectivity.

security11 min read

Mastering Secrets Management at Scale: Vault, AWS Secrets Manager, and Parameter Store

Unifying secrets management strategy combining HashiCorp Vault, AWS Secrets Manager, and Parameter Store - with cross-account sharing, automatic rotation, and Kubernetes integration via External Secrets Operator.