Introduction
Cloud costs can quickly spiral out of control without proper governance. I inherited a multi-account AWS environment spending $2M annually with no cost visibility, no accountability, and monthly surprises. I built a FinOps framework that brought costs under control and changed the culture around cloud spending.
The transformation delivered:
- 30% cost reduction ($600K annual savings)
- Predictable monthly spend with 95% forecast accuracy
- Per-team chargeback creating cost accountability
- Automated anomaly detection catching issues in hours, not weeks
Architecture Overview
flowchart TB
subgraph Sources["Cost Data Sources"]
CUR[Cost & Usage Report]
ORG[AWS Organizations]
TAGS[Resource Tags]
end
subgraph Processing["Data Processing"]
ATHENA[Athena Queries]
GLUE[Glue ETL Jobs]
LAMBDA[Lambda Processors]
end
subgraph Analysis["Cost Analysis"]
CID[Cost Intelligence Dashboard]
ANOMALY[Cost Anomaly Detection]
BUDGETS[AWS Budgets]
FORECAST[Cost Forecasting]
end
subgraph Reporting["Reporting & Actions"]
QS[QuickSight Dashboards]
SNS[SNS Notifications]
SLACK[Slack Integration]
TICKETS[Automated Tickets]
end
subgraph Governance["Governance"]
POLICIES[Cost Policies]
QUOTAS[Service Quotas]
TAGGING[Tagging Standards]
end
Sources --> Processing
Processing --> Analysis
Analysis --> Reporting
Governance --> Sources
style Sources fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff
style Processing fill:#264653,stroke:#2a9d8f,stroke-width:2px,color:#fff
style Analysis fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style Reporting fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style Governance fill:#9b5de5,stroke:#fff,stroke-width:2px,color:#fff
Cost & Usage Report Setup
# cost-reporting/cur.tf
resource "aws_cur_report_definition" "enterprise" {
report_name = "enterprise-cost-usage-report"
time_unit = "HOURLY"
format = "Parquet"
compression = "Parquet"
additional_schema_elements = ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"]
s3_bucket = aws_s3_bucket.cur.id
s3_region = "us-east-1"
s3_prefix = "cur"
additional_artifacts = ["ATHENA"]
report_versioning = "OVERWRITE_REPORT"
refresh_closed_reports = true
}
resource "aws_s3_bucket" "cur" {
bucket = "company-cost-usage-reports"
tags = {
Purpose = "Cost & Usage Reports"
Compliance = "required"
}
}
resource "aws_s3_bucket_policy" "cur" {
bucket = aws_s3_bucket.cur.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowCURDelivery"
Effect = "Allow"
Principal = {
Service = "billingreports.amazonaws.com"
}
Action = [
"s3:GetBucketAcl",
"s3:GetBucketPolicy"
]
Resource = aws_s3_bucket.cur.arn
},
{
Sid = "AllowCURWrite"
Effect = "Allow"
Principal = {
Service = "billingreports.amazonaws.com"
}
Action = "s3:PutObject"
Resource = "${aws_s3_bucket.cur.arn}/*"
}
]
})
}
# Athena setup for CUR queries
resource "aws_athena_workgroup" "cur" {
name = "cur-analysis"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.athena_results.bucket}/output/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
engine_version {
selected_engine_version = "Athena engine version 3"
}
}
tags = {
Team = "finops"
}
}
resource "aws_glue_catalog_database" "cur" {
name = "cur_database"
description = "Cost and Usage Report database"
}Cost Intelligence Dashboard
# quicksight/cid.tf
# Deploy Cost Intelligence Dashboard using CloudFormation
resource "aws_cloudformation_stack" "cid" {
name = "cost-intelligence-dashboard"
template_url = "https://aws-well-architected-labs.s3.amazonaws.com/Cost/Labs/400_Cost_Intelligence_Dashboard/cid-cfn.yaml"
parameters = {
QuickSightUserName = var.quicksight_admin_user
CURBucket = aws_s3_bucket.cur.id
CURDatabaseName = aws_glue_catalog_database.cur.name
CURTableName = "cost_and_usage_report"
OptimizationDataCollectionAccountID = var.management_account_id
}
capabilities = ["CAPABILITY_IAM", "CAPABILITY_NAMED_IAM"]
tags = {
Dashboard = "CID"
Team = "finops"
}
}
# QuickSight data source
resource "aws_quicksight_data_source" "athena_cur" {
data_source_id = "athena-cur"
name = "Athena CUR Data Source"
type = "ATHENA"
parameters {
athena {
work_group = aws_athena_workgroup.cur.id
}
}
ssl_properties {
disable_ssl = false
}
aws_account_id = data.aws_caller_identity.current.account_id
permission {
principal = aws_quicksight_group.finops.arn
actions = [
"quicksight:DescribeDataSource",
"quicksight:DescribeDataSourcePermissions",
"quicksight:PassDataSource",
"quicksight:UpdateDataSource",
"quicksight:UpdateDataSourcePermissions"
]
}
}
# Custom analysis for executive dashboard
resource "aws_quicksight_analysis" "executive_summary" {
analysis_id = "executive-cost-summary"
name = "Executive Cost Summary"
source_entity {
source_template {
arn = aws_quicksight_template.executive.arn
data_set_references {
data_set_arn = aws_quicksight_data_set.monthly_costs.arn
data_set_placeholder = "monthlycosts"
}
}
}
aws_account_id = data.aws_caller_identity.current.account_id
permissions {
principal = aws_quicksight_group.executives.arn
actions = [
"quicksight:RestoreAnalysis",
"quicksight:UpdateAnalysisPermissions",
"quicksight:DeleteAnalysis",
"quicksight:DescribeAnalysisPermissions",
"quicksight:QueryAnalysis",
"quicksight:DescribeAnalysis",
"quicksight:UpdateAnalysis"
]
}
}Cost Anomaly Detection
flowchart TD
subgraph Detection["Anomaly Detection Flow"]
COLLECT[Collect hourly cost data]
ML[ML Model analyzes patterns]
DETECT[Detect anomalies]
COLLECT --> ML
ML --> DETECT
end
subgraph Evaluation["Anomaly Evaluation"]
THRESHOLD{Cost increase > threshold?}
CONTEXT[Evaluate context]
CLASSIFY[Classify severity]
DETECT --> THRESHOLD
THRESHOLD -->|Yes| CONTEXT
CONTEXT --> CLASSIFY
end
subgraph Response["Response Actions"]
ALERT_LOW[Low: Email notification]
ALERT_MED[Medium: Slack + Email]
ALERT_HIGH[High: PagerDuty + Slack]
TICKET[Create Jira ticket]
CLASSIFY --> ALERT_LOW
CLASSIFY --> ALERT_MED
CLASSIFY --> ALERT_HIGH
ALERT_HIGH --> TICKET
end
style Detection fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff
style Evaluation fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style Response fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
# cost-anomaly/main.tf
resource "aws_ce_anomaly_monitor" "service_monitor" {
name = "service-cost-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
tags = {
Team = "finops"
}
}
resource "aws_ce_anomaly_monitor" "account_monitor" {
name = "account-cost-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "LINKED_ACCOUNT"
}
# High impact anomalies
resource "aws_ce_anomaly_subscription" "high_impact" {
name = "high-impact-anomalies"
frequency = "IMMEDIATE"
monitor_arn_list = [
aws_ce_anomaly_monitor.service_monitor.arn,
aws_ce_anomaly_monitor.account_monitor.arn,
]
subscriber {
type = "SNS"
address = aws_sns_topic.cost_alerts.arn
}
threshold_expression {
and {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["500"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
tags = {
Severity = "high"
}
}
# Daily summary of all anomalies
resource "aws_ce_anomaly_subscription" "daily_summary" {
name = "daily-anomaly-summary"
frequency = "DAILY"
monitor_arn_list = [
aws_ce_anomaly_monitor.service_monitor.arn,
aws_ce_anomaly_monitor.account_monitor.arn,
]
subscriber {
type = "EMAIL"
address = "finops-team@company.com"
}
threshold_expression {
and {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["100"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
}
# Lambda to process anomalies and create tickets
resource "aws_lambda_function" "anomaly_processor" {
filename = "anomaly_processor.zip"
function_name = "cost-anomaly-processor"
role = aws_iam_role.anomaly_processor.arn
handler = "index.handler"
runtime = "python3.11"
timeout = 60
environment {
variables = {
SLACK_WEBHOOK = var.slack_webhook_url
JIRA_API_URL = var.jira_api_url
JIRA_API_TOKEN = var.jira_api_token
SEVERITY_THRESHOLD = "500"
}
}
}
resource "aws_sns_topic_subscription" "anomaly_to_lambda" {
topic_arn = aws_sns_topic.cost_alerts.arn
protocol = "lambda"
endpoint = aws_lambda_function.anomaly_processor.arn
}Budget Management
# budgets/hierarchical.tf
locals {
teams = {
platform = {
monthly_budget = 15000
contacts = ["platform-leads@company.com"]
services = ["EC2", "EKS", "RDS"]
}
data = {
monthly_budget = 25000
contacts = ["data-leads@company.com"]
services = ["EMR", "Glue", "Athena", "S3"]
}
ml = {
monthly_budget = 30000
contacts = ["ml-leads@company.com"]
services = ["SageMaker", "Bedrock", "EC2"]
}
}
}
# Team-level budgets
resource "aws_budgets_budget" "team_budgets" {
for_each = local.teams
name = "${each.key}-monthly-budget"
budget_type = "COST"
limit_amount = each.value.monthly_budget
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["user:Team$${each.key}"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = each.value.contacts
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = concat(
each.value.contacts,
["cfo@company.com"]
)
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 120
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = concat(
each.value.contacts,
["cfo@company.com", "cto@company.com"]
)
subscriber_sns_topic_arns = [aws_sns_topic.budget_breach.arn]
}
}
# Organization-level budget
resource "aws_budgets_budget" "organizational" {
name = "organizational-monthly-budget"
budget_type = "COST"
limit_amount = "150000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = [
"cfo@company.com",
"cto@company.com"
]
}
}
# Service-specific budgets for high-cost services
resource "aws_budgets_budget" "ec2_compute" {
name = "ec2-compute-budget"
budget_type = "COST"
limit_amount = "50000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon Elastic Compute Cloud - Compute"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 85
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}Cost Allocation Tags
# tagging/cost-allocation.tf
# Activate cost allocation tags
resource "aws_ce_cost_allocation_tag" "team" {
tag_key = "Team"
status = "Active"
}
resource "aws_ce_cost_allocation_tag" "environment" {
tag_key = "Environment"
status = "Active"
}
resource "aws_ce_cost_allocation_tag" "project" {
tag_key = "Project"
status = "Active"
}
resource "aws_ce_cost_allocation_tag" "cost_center" {
tag_key = "CostCenter"
status = "Active"
}
# Tag policy for Organizations
resource "aws_organizations_policy" "tagging_policy" {
name = "RequiredTagsPolicy"
description = "Enforce required cost allocation tags"
type = "TAG_POLICY"
content = jsonencode({
tags = {
Team = {
tag_key = {
"@@assign" = "Team"
}
enforced_for = {
"@@assign" = [
"ec2:instance",
"ec2:volume",
"rds:db",
"s3:bucket",
"dynamodb:table",
"lambda:function"
]
}
}
Environment = {
tag_key = {
"@@assign" = "Environment"
}
tag_value = {
"@@assign" = ["production", "staging", "development", "sandbox"]
}
enforced_for = {
"@@assign" = [
"ec2:*",
"rds:*",
"s3:*"
]
}
}
CostCenter = {
tag_key = {
"@@assign" = "CostCenter"
}
enforced_for = {
"@@assign" = ["*"]
}
}
}
})
}
resource "aws_organizations_policy_attachment" "tagging_workloads" {
policy_id = aws_organizations_policy.tagging_policy.id
target_id = aws_organizations_organizational_unit.workloads.id
}Cost Optimization Automation
# lambda/cost-optimization-recommendations.py
import boto3
import json
from datetime import datetime, timedelta
ce_client = boto3.client('ce')
ec2_client = boto3.client('ec2')
rds_client = boto3.client('rds')
def lambda_handler(event, context):
"""Generate cost optimization recommendations"""
recommendations = []
# Find idle EC2 instances
recommendations.extend(find_idle_ec2_instances())
# Find unattached EBS volumes
recommendations.extend(find_unattached_volumes())
# Find old snapshots
recommendations.extend(find_old_snapshots())
# Find underutilized RDS instances
recommendations.extend(find_underutilized_rds())
# Calculate total potential savings
total_savings = sum(r['monthly_savings'] for r in recommendations)
# Send report
send_recommendations_report(recommendations, total_savings)
return {
'statusCode': 200,
'body': json.dumps({
'recommendations_count': len(recommendations),
'potential_monthly_savings': total_savings
})
}
def find_idle_ec2_instances():
"""Find EC2 instances with low CPU utilization"""
cloudwatch = boto3.client('cloudwatch')
recommendations = []
instances = ec2_client.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
# Get CPU utilization for last 7 days
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.now() - timedelta(days=7),
EndTime=datetime.now(),
Period=3600,
Statistics=['Average']
)
if metrics['Datapoints']:
avg_cpu = sum(d['Average'] for d in metrics['Datapoints']) / len(metrics['Datapoints'])
if avg_cpu < 5: # Less than 5% average CPU
# Calculate cost
instance_type = instance['InstanceType']
monthly_cost = get_instance_cost(instance_type)
recommendations.append({
'type': 'idle_ec2',
'resource_id': instance_id,
'instance_type': instance_type,
'avg_cpu': round(avg_cpu, 2),
'monthly_savings': monthly_cost,
'recommendation': 'Stop or terminate idle instance',
'priority': 'high'
})
return recommendations
def find_unattached_volumes():
"""Find EBS volumes not attached to any instance"""
recommendations = []
volumes = ec2_client.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
for volume in volumes['Volumes']:
volume_id = volume['VolumeId']
size_gb = volume['Size']
volume_type = volume['VolumeType']
# Calculate monthly cost (rough estimate)
cost_per_gb = {'gp3': 0.08, 'gp2': 0.10, 'io1': 0.125, 'io2': 0.125}
monthly_cost = size_gb * cost_per_gb.get(volume_type, 0.10)
recommendations.append({
'type': 'unattached_volume',
'resource_id': volume_id,
'size_gb': size_gb,
'volume_type': volume_type,
'monthly_savings': monthly_cost,
'recommendation': 'Delete unused volume or create snapshot',
'priority': 'medium'
})
return recommendations
def send_recommendations_report(recommendations, total_savings):
"""Send recommendations via email and Slack"""
sns = boto3.client('sns')
message = f"""
Cost Optimization Recommendations
Total Potential Monthly Savings: ${total_savings:,.2f}
Recommendations: {len(recommendations)}
- High Priority: {len([r for r in recommendations if r['priority'] == 'high'])}
- Medium Priority: {len([r for r in recommendations if r['priority'] == 'medium'])}
- Low Priority: {len([r for r in recommendations if r['priority'] == 'low'])}
View detailed report: https://quicksight.aws.amazon.com/cost-optimization
"""
sns.publish(
TopicArn=os.environ['SNS_TOPIC_ARN'],
Subject='Weekly Cost Optimization Recommendations',
Message=message
)Chargeback Dashboard
-- athena/queries/team-chargeback.sql
-- Monthly cost per team
CREATE OR REPLACE VIEW team_monthly_costs AS
SELECT
bill_payer_account_id,
line_item_usage_account_id as account_id,
resource_tags_user_team as team,
DATE_TRUNC('month', line_item_usage_start_date) as month,
line_item_product_code as service,
SUM(line_item_unblended_cost) as total_cost,
SUM(CASE WHEN line_item_line_item_type = 'Usage' THEN line_item_unblended_cost ELSE 0 END) as usage_cost,
SUM(CASE WHEN line_item_line_item_type = 'SavingsPlanCoveredUsage' THEN line_item_unblended_cost ELSE 0 END) as savings_plan_cost
FROM
cur_database.cost_and_usage_report
WHERE
line_item_line_item_type IN ('Usage', 'SavingsPlanCoveredUsage', 'DiscountedUsage')
AND resource_tags_user_team IS NOT NULL
GROUP BY
1, 2, 3, 4, 5;
-- Top 10 cost drivers per team
CREATE OR REPLACE VIEW team_top_costs AS
WITH ranked_costs AS (
SELECT
team,
month,
service,
total_cost,
ROW_NUMBER() OVER (PARTITION BY team, month ORDER BY total_cost DESC) as rank
FROM team_monthly_costs
)
SELECT *
FROM ranked_costs
WHERE rank <= 10;
-- Month-over-month cost change
CREATE OR REPLACE VIEW team_cost_trends AS
SELECT
curr.team,
curr.month as current_month,
curr.total_cost as current_cost,
prev.total_cost as previous_cost,
curr.total_cost - prev.total_cost as cost_change,
ROUND(((curr.total_cost - prev.total_cost) / NULLIF(prev.total_cost, 0)) * 100, 2) as percent_change
FROM team_monthly_costs curr
LEFT JOIN team_monthly_costs prev
ON curr.team = prev.team
AND curr.month = DATE_ADD('month', 1, prev.month)
WHERE curr.month = DATE_TRUNC('month', CURRENT_DATE);Cost Governance Policies
flowchart TD
subgraph Preventive["Preventive Controls"]
SCP[Service Control Policies]
QUOTA[Service Quotas]
BUDGET_ACTION[Budget Actions]
end
subgraph Detective["Detective Controls"]
ANOMALY[Anomaly Detection]
TAGGING[Tag Compliance]
UNUSED[Unused Resource Detection]
end
subgraph Corrective["Corrective Actions"]
AUTO_STOP[Auto-stop resources]
ALERT[Alert owners]
TICKET[Create remediation ticket]
end
Preventive --> Detective
Detective --> Corrective
style Preventive fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style Detective fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style Corrective fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
Results: 30% Cost Reduction
flowchart LR
subgraph Before["Before FinOps (Monthly)"]
B_COMPUTE["Compute: $80K<br/>Over-provisioned"]
B_STORAGE["Storage: $35K<br/>Unused volumes"]
B_DATA["Data Transfer: $15K<br/>Unoptimized"]
B_OTHER["Other: $20K"]
B_TOTAL["Total: $150K/month"]
end
subgraph After["After FinOps (Monthly)"]
A_COMPUTE["Compute: $52K<br/>Right-sized + Spot"]
A_STORAGE["Storage: $22K<br/>Cleaned up"]
A_DATA["Data Transfer: $10K<br/>Optimized"]
A_OTHER["Other: $21K"]
A_TOTAL["Total: $105K/month"]
end
Before ==> After
subgraph Savings["Annual Savings"]
COMPUTE_SAVE["Compute: $336K"]
STORAGE_SAVE["Storage: $156K"]
DATA_SAVE["Data: $60K"]
TOTAL_SAVE["Total: $540K/year<br/>30% reduction"]
end
After ==> Savings
style Before fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
style After fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style Savings fill:#ffbe0b,stroke:#fff,stroke-width:2px,color:#000
Best Practices
| Practice | Implementation | Impact |
|---|---|---|
| Tag everything | Enforce tag policies | 100% cost visibility |
| Right-size resources | Weekly recommendations | 20-30% savings |
| Use Savings Plans | Automated purchase | 40-70% discount |
| Delete unused resources | Automated cleanup | 10-15% savings |
| Monitor anomalies | ML-based detection | Catch issues early |
| Implement chargeback | Per-team dashboards | Accountability |
Troubleshooting
"CUR data not appearing in Athena"
# Check CUR delivery
aws cur describe-report-definitions
# Verify S3 bucket
aws s3 ls s3://company-cost-usage-reports/cur/
# Check Glue crawler
aws glue get-crawler --name cur-crawler"Budget notifications not working"
- Verify SNS topic subscriptions confirmed
- Check budget threshold configuration
- Ensure cost allocation tags are active
"QuickSight dashboard errors"
- Refresh SPICE datasets
- Check Athena query permissions
- Verify data source connections
Conclusion
Building a FinOps framework transforms cloud cost management from reactive firefighting to proactive optimization. The combination of:
- Cost & Usage Reports for detailed cost data
- Cost Intelligence Dashboard for executive visibility
- Anomaly Detection for early issue identification
- Budgets & Alerts for proactive governance
- Chargeback mechanisms for team accountability
Delivered 30% cost reduction ($540K annually) while creating a culture of cost awareness. The key is making cost data visible, actionable, and tied to team ownership.