Introduction
EKS compute costs can spiral out of control quickly. I inherited a cluster with static node groups that were either over-provisioned (wasting money) or under-provisioned (causing scheduling failures). Karpenter changed everything - it provisions exactly the right nodes, at the right time, using the cheapest instance types.
This post covers how I built a multi-tenant EKS platform that:
- Reduced compute costs by 60% using Spot instances intelligently
- Provisions nodes in under 60 seconds (vs 5+ minutes with Cluster Autoscaler)
- Consolidates underutilized nodes automatically
- Allocates costs per team with accurate showback
Architecture Overview
flowchart TB
subgraph EKSCluster["EKS Cluster"]
subgraph ControlPlane["Control Plane"]
API[Kubernetes API]
KARPENTER[Karpenter Controller]
end
subgraph NodePools["Node Pools"]
subgraph OnDemand["On-Demand Pool"]
OD1[m6i.xlarge]
OD2[m6i.2xlarge]
end
subgraph Spot["Spot Pool"]
SP1[c6i.xlarge]
SP2[c6i.2xlarge]
SP3[m6i.xlarge]
SP4[r6i.xlarge]
end
subgraph GPU["GPU Pool"]
GPU1[g5.xlarge]
GPU2[g5.2xlarge]
end
end
subgraph Workloads["Workloads by Team"]
subgraph TeamA["Team A - Production"]
A1[Critical API]
A2[Payment Service]
end
subgraph TeamB["Team B - Analytics"]
B1[Spark Jobs]
B2[Data Pipeline]
end
subgraph TeamC["Team C - ML"]
C1[Training Jobs]
C2[Inference]
end
end
end
subgraph CostManagement["Cost Management"]
KUBECOST[Kubecost]
CUR[AWS Cost & Usage Report]
DASHBOARD[Cost Dashboard]
end
KARPENTER --> NodePools
TeamA --> OnDemand
TeamB --> Spot
TeamC --> GPU
NodePools --> KUBECOST
KUBECOST --> DASHBOARD
CUR --> DASHBOARD
style ControlPlane fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff
style OnDemand fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style Spot fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style GPU fill:#9b5de5,stroke:#fff,stroke-width:2px,color:#fff
style CostManagement fill:#264653,stroke:#e63946,stroke-width:2px,color:#fff
Karpenter Installation
# karpenter/main.tf
resource "helm_release" "karpenter" {
name = "karpenter"
repository = "oci://public.ecr.aws/karpenter"
chart = "karpenter"
version = "0.33.0"
namespace = "karpenter"
create_namespace = true
values = [
yamlencode({
settings = {
clusterName = var.cluster_name
clusterEndpoint = var.cluster_endpoint
interruptionQueue = aws_sqs_queue.karpenter.name
}
serviceAccount = {
annotations = {
"eks.amazonaws.com/role-arn" = aws_iam_role.karpenter.arn
}
}
controller = {
resources = {
requests = {
cpu = "500m"
memory = "512Mi"
}
limits = {
cpu = "1"
memory = "1Gi"
}
}
}
# Enable consolidation
replicas = 2
})
]
depends_on = [
aws_iam_role_policy_attachment.karpenter,
]
}
# SQS queue for Spot interruption handling
resource "aws_sqs_queue" "karpenter" {
name = "karpenter-${var.cluster_name}"
message_retention_seconds = 300
sqs_managed_sse_enabled = true
}
# EventBridge rules for Spot interruptions
resource "aws_cloudwatch_event_rule" "spot_interruption" {
name = "karpenter-spot-interruption"
description = "Spot instance interruption notice"
event_pattern = jsonencode({
source = ["aws.ec2"]
detail-type = ["EC2 Spot Instance Interruption Warning"]
})
}
resource "aws_cloudwatch_event_target" "spot_interruption" {
rule = aws_cloudwatch_event_rule.spot_interruption.name
target_id = "karpenter"
arn = aws_sqs_queue.karpenter.arn
}NodePool Configuration
flowchart TD
subgraph NodePoolStrategy["Node Pool Strategy"]
direction TB
subgraph Critical["Critical Workloads"]
CRIT_REQ["Requirements:<br/>- High availability<br/>- Predictable performance"]
CRIT_POOL["On-Demand NodePool<br/>- m6i, c6i families<br/>- No Spot"]
end
subgraph General["General Workloads"]
GEN_REQ["Requirements:<br/>- Cost efficient<br/>- Interruption tolerant"]
GEN_POOL["Spot NodePool<br/>- Diverse instance types<br/>- 70% cost savings"]
end
subgraph Batch["Batch/Analytics"]
BATCH_REQ["Requirements:<br/>- Checkpointing<br/>- Flexible scheduling"]
BATCH_POOL["Spot NodePool<br/>- Large instances<br/>- Consolidation enabled"]
end
subgraph MLWorkloads["ML Workloads"]
ML_REQ["Requirements:<br/>- GPU instances<br/>- Training/Inference"]
ML_POOL["GPU NodePool<br/>- g5 instances<br/>- Spot for training"]
end
end
style Critical fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
style General fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style Batch fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style MLWorkloads fill:#9b5de5,stroke:#fff,stroke-width:2px,color:#fff
On-Demand NodePool for Critical Workloads
# karpenter/nodepools/critical.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: critical
spec:
template:
metadata:
labels:
workload-type: critical
billing-team: platform
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # No Spot for critical
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["xlarge", "2xlarge", "4xlarge"]
nodeClassRef:
name: default
taints:
- key: workload-type
value: critical
effect: NoSchedule
limits:
cpu: 1000
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30s
budgets:
- nodes: "10%"
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "true"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "true"
instanceProfile: KarpenterNodeInstanceProfile
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
tags:
Environment: production
ManagedBy: karpenterSpot NodePool for General Workloads
# karpenter/nodepools/spot-general.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-general
spec:
template:
metadata:
labels:
workload-type: general
capacity-type: spot
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge"]
# Diversify across instance types for Spot availability
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["c6i", "c6a", "c7i", "m6i", "m6a", "m7i", "r6i", "r6a"]
nodeClassRef:
name: default
limits:
cpu: 2000
memory: 4000Gi
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "20%"
# Weight for scheduling preference (higher = preferred)
weight: 100GPU NodePool for ML Workloads
# karpenter/nodepools/gpu.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu
spec:
template:
metadata:
labels:
workload-type: gpu
nvidia.com/gpu: "true"
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["g"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5", "g4dn"]
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["xlarge", "2xlarge", "4xlarge"]
nodeClassRef:
name: gpu
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
cpu: 500
memory: 1000Gi
nvidia.com/gpu: 50
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: gpu
spec:
amiFamily: AL2
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "true"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "true"
instanceProfile: KarpenterNodeInstanceProfile
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 200Gi
volumeType: gp3
encrypted: true
# NVIDIA driver installation
userData: |
#!/bin/bash
set -e
# Install NVIDIA drivers
amazon-linux-extras install -y epel
yum install -y nvidia-driver-latest-dkms
# Install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.repo | \
tee /etc/yum.repos.d/nvidia-container-toolkit.repo
yum install -y nvidia-container-toolkit
# Configure containerd
nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerdNode Provisioning Flow
sequenceDiagram
participant Scheduler as Kubernetes Scheduler
participant Karpenter as Karpenter Controller
participant EC2 as AWS EC2
participant Node as New Node
Scheduler->>Scheduler: Pod pending (no capacity)
Scheduler->>Karpenter: Unschedulable pod event
Karpenter->>Karpenter: Evaluate NodePools
Karpenter->>Karpenter: Calculate optimal instance type
Note over Karpenter: Consider: CPU, memory, GPU,<br/>architecture, Spot availability
Karpenter->>EC2: CreateFleet (Spot or On-Demand)
EC2-->>Karpenter: Instance launched
Karpenter->>Node: Bootstrap node
Node->>Node: Join cluster
Node-->>Scheduler: Node ready
Scheduler->>Node: Schedule pending pods
Note over Karpenter: ~60 seconds total
Multi-Tenant Resource Quotas
# quotas/team-a.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-a-quota
namespace: team-a
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "50"
services.loadbalancers: "5"
---
apiVersion: v1
kind: LimitRange
metadata:
name: team-a-limits
namespace: team-a
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: 512Mi
defaultRequest:
cpu: "100m"
memory: 128Mi
max:
cpu: "8"
memory: 32Gi
- type: PersistentVolumeClaim
max:
storage: 100GiPriority Classes for Workload Scheduling
# priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical
value: 1000000
globalDefault: false
description: "Critical production workloads - never preempted"
preemptionPolicy: Never
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high
value: 100000
globalDefault: false
description: "High priority workloads"
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: default
value: 10000
globalDefault: true
description: "Default priority for general workloads"
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch
value: 1000
globalDefault: false
description: "Batch jobs - can be preempted"
preemptionPolicy: PreemptLowerPriorityNode Consolidation
flowchart TD
subgraph Before["Before Consolidation"]
N1["Node 1<br/>CPU: 20%<br/>Memory: 30%"]
N2["Node 2<br/>CPU: 25%<br/>Memory: 20%"]
N3["Node 3<br/>CPU: 15%<br/>Memory: 25%"]
P1[Pod A] --> N1
P2[Pod B] --> N1
P3[Pod C] --> N2
P4[Pod D] --> N3
end
subgraph After["After Consolidation"]
N1A["Node 1<br/>CPU: 60%<br/>Memory: 75%"]
P1A[Pod A] --> N1A
P2A[Pod B] --> N1A
P3A[Pod C] --> N1A
P4A[Pod D] --> N1A
end
Before -->|"Karpenter<br/>Consolidation"| After
SAVINGS["Cost Savings:<br/>2 nodes removed<br/>~66% reduction"]
style Before fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
style After fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style SAVINGS fill:#ffbe0b,stroke:#fff,stroke-width:2px,color:#000
Spot Instance Strategy
# spot-strategy.tf
# Diversify across instance types and AZs
locals {
spot_instance_types = [
# Compute optimized
"c6i.large", "c6i.xlarge", "c6i.2xlarge",
"c6a.large", "c6a.xlarge", "c6a.2xlarge",
"c7i.large", "c7i.xlarge", "c7i.2xlarge",
# General purpose
"m6i.large", "m6i.xlarge", "m6i.2xlarge",
"m6a.large", "m6a.xlarge", "m6a.2xlarge",
"m7i.large", "m7i.xlarge", "m7i.2xlarge",
# Memory optimized (for some workloads)
"r6i.large", "r6i.xlarge",
"r6a.large", "r6a.xlarge",
]
}
# Monitor Spot pricing and availability
resource "aws_cloudwatch_metric_alarm" "spot_interruptions" {
alarm_name = "high-spot-interruptions"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "SpotInterruptionRate"
namespace = "Karpenter"
period = 300
statistic = "Sum"
threshold = 5
alarm_description = "High rate of Spot interruptions"
alarm_actions = [var.sns_topic_arn]
}Cost Allocation with Kubecost
# kubecost/main.tf
resource "helm_release" "kubecost" {
name = "kubecost"
repository = "https://kubecost.github.io/cost-analyzer/"
chart = "cost-analyzer"
namespace = "kubecost"
version = "1.106.0"
create_namespace = true
values = [
yamlencode({
global = {
prometheus = {
enabled = false # Use existing Prometheus
fqdn = "http://prometheus-server.monitoring:80"
}
}
kubecostModel = {
etlCloudAsset = true
}
# AWS integration for accurate pricing
kubecostProductConfigs = {
cloudIntegrationJSON = jsonencode({
aws = [{
athenaBucketName = var.athena_bucket
athenaRegion = var.region
athenaDatabase = "athenacurcfn_cost_report"
athenaTable = "cost_report"
athenaWorkgroup = "primary"
masterPayerARN = var.master_payer_arn
}]
})
}
# Cost allocation
kubecostDeployment = {
labels = {
"app.kubernetes.io/component" = "cost-analyzer"
}
}
})
]
}Cost Allocation Dashboard
# kubecost/allocation-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: allocation-config
namespace: kubecost
data:
allocation.yaml: |
# Team-based allocation
teams:
- name: team-a
namespaces:
- team-a
- team-a-staging
labels:
team: team-a
- name: team-b
namespaces:
- team-b
- analytics
labels:
team: team-b
- name: platform
namespaces:
- kube-system
- monitoring
- karpenter
labels:
team: platform
# Shared cost distribution
sharedCosts:
- name: cluster-overhead
type: weighted
filter:
namespaces:
- kube-system
- monitoring
- name: networking
type: proportional
filter:
labels:
cost-type: networkingCost Savings Breakdown
flowchart LR
subgraph Before["Before Karpenter"]
B_NODES["Static Node Groups<br/>Always running<br/>Over-provisioned"]
B_COST["Monthly Cost<br/>$50,000"]
end
subgraph After["After Karpenter"]
subgraph Savings["Savings Sources"]
S1["Spot Instances<br/>-40%"]
S2["Right-sizing<br/>-15%"]
S3["Consolidation<br/>-10%"]
S4["Scale to Zero<br/>-5%"]
end
A_COST["Monthly Cost<br/>$20,000"]
end
Before --> After
TOTAL["Total Savings: 60%<br/>$30,000/month"]
style Before fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
style After fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style TOTAL fill:#ffbe0b,stroke:#fff,stroke-width:2px,color:#000
Workload Examples
Critical Workload (On-Demand)
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
namespace: team-a
spec:
replicas: 3
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
workload-type: critical
spec:
priorityClassName: critical
nodeSelector:
workload-type: critical
tolerations:
- key: workload-type
value: critical
effect: NoSchedule
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payment-service
containers:
- name: payment-service
image: payment-service:v1.2.3
resources:
requests:
cpu: "500m"
memory: 512Mi
limits:
cpu: "2"
memory: 2GiBatch Workload (Spot)
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
namespace: team-b
spec:
parallelism: 10
completions: 100
backoffLimit: 3
template:
metadata:
labels:
app: data-processing
workload-type: batch
spec:
priorityClassName: batch
nodeSelector:
capacity-type: spot
tolerations:
- key: karpenter.sh/disruption
operator: Exists
restartPolicy: OnFailure
containers:
- name: processor
image: data-processor:v2.0
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8GiMonitoring and Alerts
# monitoring/karpenter-alerts.tf
resource "aws_cloudwatch_metric_alarm" "karpenter_pending_pods" {
alarm_name = "karpenter-pending-pods"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "pending_pods"
namespace = "Karpenter"
period = 60
statistic = "Maximum"
threshold = 10
alarm_description = "Too many pending pods - Karpenter may be failing"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "node_launch_failures" {
alarm_name = "karpenter-launch-failures"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "nodeclaims_launch_failed"
namespace = "Karpenter"
period = 300
statistic = "Sum"
threshold = 3
alarm_description = "Karpenter failing to launch nodes"
alarm_actions = [var.sns_topic_arn]
}Best Practices
| Practice | Why |
|---|---|
| Diversify Spot instance types | Higher availability, fewer interruptions |
| Use consolidation wisely | Balance cost vs stability |
| Set appropriate limits | Prevent runaway scaling |
| Tag everything | Accurate cost allocation |
| Use PriorityClasses | Protect critical workloads |
| Monitor Spot interruptions | React to capacity issues |
Troubleshooting
"Pods stuck pending"
# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f
# Check NodePool status
kubectl get nodepools -o yaml
# Check if limits are reached
kubectl get nodepools -o jsonpath='{.items[*].status}'"Nodes not consolidating"
- Check PodDisruptionBudgets
- Verify consolidation policy is set
- Check for pods with
do-not-disruptannotation
"Spot interruptions causing issues"
- Increase instance type diversity
- Add fallback to on-demand
- Implement proper pod disruption handling
Conclusion
Karpenter transforms EKS cost management from a guessing game into a precise, automated process. The combination of:
- Intelligent provisioning - right instance at the right time
- Spot instances - 70% savings on interruptible workloads
- Automatic consolidation - no more wasted capacity
- Per-team cost allocation - accountability and showback
Delivers significant cost savings while actually improving cluster responsiveness. The key is matching workload requirements to the right NodePool and letting Karpenter handle the rest.