status: building

cloudfeatured

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance

Architecting a FinOps framework that reduced cloud costs by 30% and delivered predictable spend - featuring Cost Intelligence Dashboard, automated anomaly detection, chargeback mechanisms, and executive-level cost visibility.

Milan Dangol

Sr DevOps & DevSecOps Engineer

Jun 22, 2025

11 min read

Introduction

Cloud costs can quickly spiral out of control without proper governance. I inherited a multi-account AWS environment spending $2M annually with no cost visibility, no accountability, and monthly surprises. I built a FinOps framework that brought costs under control and changed the culture around cloud spending.

The transformation delivered:

30% cost reduction ($600K annual savings)
Predictable monthly spend with 95% forecast accuracy
Per-team chargeback creating cost accountability
Automated anomaly detection catching issues in hours, not weeks

Architecture Overview

flowchart TB subgraph Sources["Cost Data Sources"] CUR[Cost & Usage Report] ORG[AWS Organizations] TAGS[Resource Tags] end subgraph Processing["Data Processing"] ATHENA[Athena Queries] GLUE[Glue ETL Jobs] LAMBDA[Lambda Processors] end subgraph Analysis["Cost Analysis"] CID[Cost Intelligence Dashboard] ANOMALY[Cost Anomaly Detection] BUDGETS[AWS Budgets] FORECAST[Cost Forecasting] end subgraph Reporting["Reporting & Actions"] QS[QuickSight Dashboards] SNS[SNS Notifications] SLACK[Slack Integration] TICKETS[Automated Tickets] end subgraph Governance["Governance"] POLICIES[Cost Policies] QUOTAS[Service Quotas] TAGGING[Tagging Standards] end Sources --> Processing Processing --> Analysis Analysis --> Reporting Governance --> Sources style Sources fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff style Processing fill:#264653,stroke:#2a9d8f,stroke-width:2px,color:#fff style Analysis fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff style Reporting fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff style Governance fill:#9b5de5,stroke:#fff,stroke-width:2px,color:#fff

Cost & Usage Report Setup

# cost-reporting/cur.tf

resource "aws_cur_report_definition" "enterprise" {
  report_name                = "enterprise-cost-usage-report"
  time_unit                  = "HOURLY"
  format                     = "Parquet"
  compression                = "Parquet"
  additional_schema_elements = ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"]

  s3_bucket = aws_s3_bucket.cur.id
  s3_region = "us-east-1"
  s3_prefix = "cur"

  additional_artifacts = ["ATHENA"]

  report_versioning = "OVERWRITE_REPORT"

  refresh_closed_reports = true
}

resource "aws_s3_bucket" "cur" {
  bucket = "company-cost-usage-reports"

  tags = {
    Purpose = "Cost & Usage Reports"
    Compliance = "required"
  }
}

resource "aws_s3_bucket_policy" "cur" {
  bucket = aws_s3_bucket.cur.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowCURDelivery"
        Effect = "Allow"
        Principal = {
          Service = "billingreports.amazonaws.com"
        }
        Action = [
          "s3:GetBucketAcl",
          "s3:GetBucketPolicy"
        ]
        Resource = aws_s3_bucket.cur.arn
      },
      {
        Sid    = "AllowCURWrite"
        Effect = "Allow"
        Principal = {
          Service = "billingreports.amazonaws.com"
        }
        Action   = "s3:PutObject"
        Resource = "${aws_s3_bucket.cur.arn}/*"
      }
    ]
  })
}

# Athena setup for CUR queries
resource "aws_athena_workgroup" "cur" {
  name = "cur-analysis"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena_results.bucket}/output/"

      encryption_configuration {
        encryption_option = "SSE_S3"
      }
    }

    engine_version {
      selected_engine_version = "Athena engine version 3"
    }
  }

  tags = {
    Team = "finops"
  }
}

resource "aws_glue_catalog_database" "cur" {
  name = "cur_database"

  description = "Cost and Usage Report database"
}

Cost Intelligence Dashboard

# quicksight/cid.tf

# Deploy Cost Intelligence Dashboard using CloudFormation
resource "aws_cloudformation_stack" "cid" {
  name = "cost-intelligence-dashboard"

  template_url = "https://aws-well-architected-labs.s3.amazonaws.com/Cost/Labs/400_Cost_Intelligence_Dashboard/cid-cfn.yaml"

  parameters = {
    QuickSightUserName        = var.quicksight_admin_user
    CURBucket                 = aws_s3_bucket.cur.id
    CURDatabaseName           = aws_glue_catalog_database.cur.name
    CURTableName              = "cost_and_usage_report"
    OptimizationDataCollectionAccountID = var.management_account_id
  }

  capabilities = ["CAPABILITY_IAM", "CAPABILITY_NAMED_IAM"]

  tags = {
    Dashboard = "CID"
    Team      = "finops"
  }
}

# QuickSight data source
resource "aws_quicksight_data_source" "athena_cur" {
  data_source_id = "athena-cur"
  name           = "Athena CUR Data Source"
  type           = "ATHENA"

  parameters {
    athena {
      work_group = aws_athena_workgroup.cur.id
    }
  }

  ssl_properties {
    disable_ssl = false
  }

  aws_account_id = data.aws_caller_identity.current.account_id

  permission {
    principal = aws_quicksight_group.finops.arn
    actions = [
      "quicksight:DescribeDataSource",
      "quicksight:DescribeDataSourcePermissions",
      "quicksight:PassDataSource",
      "quicksight:UpdateDataSource",
      "quicksight:UpdateDataSourcePermissions"
    ]
  }
}

# Custom analysis for executive dashboard
resource "aws_quicksight_analysis" "executive_summary" {
  analysis_id = "executive-cost-summary"
  name        = "Executive Cost Summary"

  source_entity {
    source_template {
      arn = aws_quicksight_template.executive.arn

      data_set_references {
        data_set_arn         = aws_quicksight_data_set.monthly_costs.arn
        data_set_placeholder = "monthlycosts"
      }
    }
  }

  aws_account_id = data.aws_caller_identity.current.account_id

  permissions {
    principal = aws_quicksight_group.executives.arn
    actions = [
      "quicksight:RestoreAnalysis",
      "quicksight:UpdateAnalysisPermissions",
      "quicksight:DeleteAnalysis",
      "quicksight:DescribeAnalysisPermissions",
      "quicksight:QueryAnalysis",
      "quicksight:DescribeAnalysis",
      "quicksight:UpdateAnalysis"
    ]
  }
}

Cost Anomaly Detection

flowchart TD subgraph Detection["Anomaly Detection Flow"] COLLECT[Collect hourly cost data] ML[ML Model analyzes patterns] DETECT[Detect anomalies] COLLECT --> ML ML --> DETECT end subgraph Evaluation["Anomaly Evaluation"] THRESHOLD{Cost increase > threshold?} CONTEXT[Evaluate context] CLASSIFY[Classify severity] DETECT --> THRESHOLD THRESHOLD -->|Yes| CONTEXT CONTEXT --> CLASSIFY end subgraph Response["Response Actions"] ALERT_LOW[Low: Email notification] ALERT_MED[Medium: Slack + Email] ALERT_HIGH[High: PagerDuty + Slack] TICKET[Create Jira ticket] CLASSIFY --> ALERT_LOW CLASSIFY --> ALERT_MED CLASSIFY --> ALERT_HIGH ALERT_HIGH --> TICKET end style Detection fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff style Evaluation fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff style Response fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff

# cost-anomaly/main.tf

resource "aws_ce_anomaly_monitor" "service_monitor" {
  name              = "service-cost-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"

  tags = {
    Team = "finops"
  }
}

resource "aws_ce_anomaly_monitor" "account_monitor" {
  name              = "account-cost-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "LINKED_ACCOUNT"
}

# High impact anomalies
resource "aws_ce_anomaly_subscription" "high_impact" {
  name      = "high-impact-anomalies"
  frequency = "IMMEDIATE"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn,
    aws_ce_anomaly_monitor.account_monitor.arn,
  ]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_alerts.arn
  }

  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        values        = ["500"]
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }

  tags = {
    Severity = "high"
  }
}

# Daily summary of all anomalies
resource "aws_ce_anomaly_subscription" "daily_summary" {
  name      = "daily-anomaly-summary"
  frequency = "DAILY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn,
    aws_ce_anomaly_monitor.account_monitor.arn,
  ]

  subscriber {
    type    = "EMAIL"
    address = "finops-team@company.com"
  }

  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        values        = ["100"]
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }
}

# Lambda to process anomalies and create tickets
resource "aws_lambda_function" "anomaly_processor" {
  filename         = "anomaly_processor.zip"
  function_name    = "cost-anomaly-processor"
  role            = aws_iam_role.anomaly_processor.arn
  handler         = "index.handler"
  runtime         = "python3.11"
  timeout         = 60

  environment {
    variables = {
      SLACK_WEBHOOK   = var.slack_webhook_url
      JIRA_API_URL    = var.jira_api_url
      JIRA_API_TOKEN  = var.jira_api_token
      SEVERITY_THRESHOLD = "500"
    }
  }
}

resource "aws_sns_topic_subscription" "anomaly_to_lambda" {
  topic_arn = aws_sns_topic.cost_alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.anomaly_processor.arn
}

Budget Management

# budgets/hierarchical.tf

locals {
  teams = {
    platform = {
      monthly_budget = 15000
      contacts       = ["platform-leads@company.com"]
      services      = ["EC2", "EKS", "RDS"]
    }
    data = {
      monthly_budget = 25000
      contacts       = ["data-leads@company.com"]
      services      = ["EMR", "Glue", "Athena", "S3"]
    }
    ml = {
      monthly_budget = 30000
      contacts       = ["ml-leads@company.com"]
      services      = ["SageMaker", "Bedrock", "EC2"]
    }
  }
}

# Team-level budgets
resource "aws_budgets_budget" "team_budgets" {
  for_each = local.teams

  name         = "${each.key}-monthly-budget"
  budget_type  = "COST"
  limit_amount = each.value.monthly_budget
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:Team$${each.key}"]
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 80
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_email_addresses = each.value.contacts
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 100
    threshold_type      = "PERCENTAGE"
    notification_type   = "FORECASTED"

    subscriber_email_addresses = concat(
      each.value.contacts,
      ["cfo@company.com"]
    )
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 120
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_email_addresses = concat(
      each.value.contacts,
      ["cfo@company.com", "cto@company.com"]
    )

    subscriber_sns_topic_arns = [aws_sns_topic.budget_breach.arn]
  }
}

# Organization-level budget
resource "aws_budgets_budget" "organizational" {
  name         = "organizational-monthly-budget"
  budget_type  = "COST"
  limit_amount = "150000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 90
    threshold_type      = "PERCENTAGE"
    notification_type   = "FORECASTED"

    subscriber_email_addresses = [
      "cfo@company.com",
      "cto@company.com"
    ]
  }
}

# Service-specific budgets for high-cost services
resource "aws_budgets_budget" "ec2_compute" {
  name         = "ec2-compute-budget"
  budget_type  = "COST"
  limit_amount = "50000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name = "Service"
    values = ["Amazon Elastic Compute Cloud - Compute"]
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 85
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
  }
}

Cost Allocation Tags

# tagging/cost-allocation.tf

# Activate cost allocation tags
resource "aws_ce_cost_allocation_tag" "team" {
  tag_key = "Team"
  status  = "Active"
}

resource "aws_ce_cost_allocation_tag" "environment" {
  tag_key = "Environment"
  status  = "Active"
}

resource "aws_ce_cost_allocation_tag" "project" {
  tag_key = "Project"
  status  = "Active"
}

resource "aws_ce_cost_allocation_tag" "cost_center" {
  tag_key = "CostCenter"
  status  = "Active"
}

# Tag policy for Organizations
resource "aws_organizations_policy" "tagging_policy" {
  name        = "RequiredTagsPolicy"
  description = "Enforce required cost allocation tags"
  type        = "TAG_POLICY"

  content = jsonencode({
    tags = {
      Team = {
        tag_key = {
          "@@assign" = "Team"
        }
        enforced_for = {
          "@@assign" = [
            "ec2:instance",
            "ec2:volume",
            "rds:db",
            "s3:bucket",
            "dynamodb:table",
            "lambda:function"
          ]
        }
      }
      Environment = {
        tag_key = {
          "@@assign" = "Environment"
        }
        tag_value = {
          "@@assign" = ["production", "staging", "development", "sandbox"]
        }
        enforced_for = {
          "@@assign" = [
            "ec2:*",
            "rds:*",
            "s3:*"
          ]
        }
      }
      CostCenter = {
        tag_key = {
          "@@assign" = "CostCenter"
        }
        enforced_for = {
          "@@assign" = ["*"]
        }
      }
    }
  })
}

resource "aws_organizations_policy_attachment" "tagging_workloads" {
  policy_id = aws_organizations_policy.tagging_policy.id
  target_id = aws_organizations_organizational_unit.workloads.id
}

Cost Optimization Automation

# lambda/cost-optimization-recommendations.py
import boto3
import json
from datetime import datetime, timedelta

ce_client = boto3.client('ce')
ec2_client = boto3.client('ec2')
rds_client = boto3.client('rds')

def lambda_handler(event, context):
    """Generate cost optimization recommendations"""

    recommendations = []

    # Find idle EC2 instances
    recommendations.extend(find_idle_ec2_instances())

    # Find unattached EBS volumes
    recommendations.extend(find_unattached_volumes())

    # Find old snapshots
    recommendations.extend(find_old_snapshots())

    # Find underutilized RDS instances
    recommendations.extend(find_underutilized_rds())

    # Calculate total potential savings
    total_savings = sum(r['monthly_savings'] for r in recommendations)

    # Send report
    send_recommendations_report(recommendations, total_savings)

    return {
        'statusCode': 200,
        'body': json.dumps({
            'recommendations_count': len(recommendations),
            'potential_monthly_savings': total_savings
        })
    }

def find_idle_ec2_instances():
    """Find EC2 instances with low CPU utilization"""
    cloudwatch = boto3.client('cloudwatch')
    recommendations = []

    instances = ec2_client.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']

            # Get CPU utilization for last 7 days
            metrics = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=datetime.now() - timedelta(days=7),
                EndTime=datetime.now(),
                Period=3600,
                Statistics=['Average']
            )

            if metrics['Datapoints']:
                avg_cpu = sum(d['Average'] for d in metrics['Datapoints']) / len(metrics['Datapoints'])

                if avg_cpu < 5:  # Less than 5% average CPU
                    # Calculate cost
                    instance_type = instance['InstanceType']
                    monthly_cost = get_instance_cost(instance_type)

                    recommendations.append({
                        'type': 'idle_ec2',
                        'resource_id': instance_id,
                        'instance_type': instance_type,
                        'avg_cpu': round(avg_cpu, 2),
                        'monthly_savings': monthly_cost,
                        'recommendation': 'Stop or terminate idle instance',
                        'priority': 'high'
                    })

    return recommendations

def find_unattached_volumes():
    """Find EBS volumes not attached to any instance"""
    recommendations = []

    volumes = ec2_client.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )

    for volume in volumes['Volumes']:
        volume_id = volume['VolumeId']
        size_gb = volume['Size']
        volume_type = volume['VolumeType']

        # Calculate monthly cost (rough estimate)
        cost_per_gb = {'gp3': 0.08, 'gp2': 0.10, 'io1': 0.125, 'io2': 0.125}
        monthly_cost = size_gb * cost_per_gb.get(volume_type, 0.10)

        recommendations.append({
            'type': 'unattached_volume',
            'resource_id': volume_id,
            'size_gb': size_gb,
            'volume_type': volume_type,
            'monthly_savings': monthly_cost,
            'recommendation': 'Delete unused volume or create snapshot',
            'priority': 'medium'
        })

    return recommendations

def send_recommendations_report(recommendations, total_savings):
    """Send recommendations via email and Slack"""
    sns = boto3.client('sns')

    message = f"""
Cost Optimization Recommendations

Total Potential Monthly Savings: ${total_savings:,.2f}

Recommendations: {len(recommendations)}
- High Priority: {len([r for r in recommendations if r['priority'] == 'high'])}
- Medium Priority: {len([r for r in recommendations if r['priority'] == 'medium'])}
- Low Priority: {len([r for r in recommendations if r['priority'] == 'low'])}

View detailed report: https://quicksight.aws.amazon.com/cost-optimization
"""

    sns.publish(
        TopicArn=os.environ['SNS_TOPIC_ARN'],
        Subject='Weekly Cost Optimization Recommendations',
        Message=message
    )

Chargeback Dashboard

-- athena/queries/team-chargeback.sql

-- Monthly cost per team
CREATE OR REPLACE VIEW team_monthly_costs AS
SELECT
    bill_payer_account_id,
    line_item_usage_account_id as account_id,
    resource_tags_user_team as team,
    DATE_TRUNC('month', line_item_usage_start_date) as month,
    line_item_product_code as service,
    SUM(line_item_unblended_cost) as total_cost,
    SUM(CASE WHEN line_item_line_item_type = 'Usage' THEN line_item_unblended_cost ELSE 0 END) as usage_cost,
    SUM(CASE WHEN line_item_line_item_type = 'SavingsPlanCoveredUsage' THEN line_item_unblended_cost ELSE 0 END) as savings_plan_cost
FROM
    cur_database.cost_and_usage_report
WHERE
    line_item_line_item_type IN ('Usage', 'SavingsPlanCoveredUsage', 'DiscountedUsage')
    AND resource_tags_user_team IS NOT NULL
GROUP BY
    1, 2, 3, 4, 5;

-- Top 10 cost drivers per team
CREATE OR REPLACE VIEW team_top_costs AS
WITH ranked_costs AS (
    SELECT
        team,
        month,
        service,
        total_cost,
        ROW_NUMBER() OVER (PARTITION BY team, month ORDER BY total_cost DESC) as rank
    FROM team_monthly_costs
)
SELECT *
FROM ranked_costs
WHERE rank <= 10;

-- Month-over-month cost change
CREATE OR REPLACE VIEW team_cost_trends AS
SELECT
    curr.team,
    curr.month as current_month,
    curr.total_cost as current_cost,
    prev.total_cost as previous_cost,
    curr.total_cost - prev.total_cost as cost_change,
    ROUND(((curr.total_cost - prev.total_cost) / NULLIF(prev.total_cost, 0)) * 100, 2) as percent_change
FROM team_monthly_costs curr
LEFT JOIN team_monthly_costs prev
    ON curr.team = prev.team
    AND curr.month = DATE_ADD('month', 1, prev.month)
WHERE curr.month = DATE_TRUNC('month', CURRENT_DATE);

Cost Governance Policies

flowchart TD subgraph Preventive["Preventive Controls"] SCP[Service Control Policies] QUOTA[Service Quotas] BUDGET_ACTION[Budget Actions] end subgraph Detective["Detective Controls"] ANOMALY[Anomaly Detection] TAGGING[Tag Compliance] UNUSED[Unused Resource Detection] end subgraph Corrective["Corrective Actions"] AUTO_STOP[Auto-stop resources] ALERT[Alert owners] TICKET[Create remediation ticket] end Preventive --> Detective Detective --> Corrective style Preventive fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff style Detective fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff style Corrective fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff

Results: 30% Cost Reduction

flowchart LR subgraph Before["Before FinOps (Monthly)"] B_COMPUTE["Compute: $80K Over-provisioned"] B_STORAGE["Storage: $35K Unused volumes"] B_DATA["Data Transfer: $15K Unoptimized"] B_OTHER["Other: $20K"] B_TOTAL["Total: $150K/month"] end subgraph After["After FinOps (Monthly)"] A_COMPUTE["Compute: $52K Right-sized + Spot"] A_STORAGE["Storage: $22K Cleaned up"] A_DATA["Data Transfer: $10K Optimized"] A_OTHER["Other: $21K"] A_TOTAL["Total: $105K/month"] end Before ==> After subgraph Savings["Annual Savings"] COMPUTE_SAVE["Compute: $336K"] STORAGE_SAVE["Storage: $156K"] DATA_SAVE["Data: $60K"] TOTAL_SAVE["Total: $540K/year 30% reduction"] end After ==> Savings style Before fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff style After fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff style Savings fill:#ffbe0b,stroke:#fff,stroke-width:2px,color:#000

Best Practices

Practice	Implementation	Impact
Tag everything	Enforce tag policies	100% cost visibility
Right-size resources	Weekly recommendations	20-30% savings
Use Savings Plans	Automated purchase	40-70% discount
Delete unused resources	Automated cleanup	10-15% savings
Monitor anomalies	ML-based detection	Catch issues early
Implement chargeback	Per-team dashboards	Accountability

Troubleshooting

"CUR data not appearing in Athena"

# Check CUR delivery
aws cur describe-report-definitions

# Verify S3 bucket
aws s3 ls s3://company-cost-usage-reports/cur/

# Check Glue crawler
aws glue get-crawler --name cur-crawler

"Budget notifications not working"

Verify SNS topic subscriptions confirmed
Check budget threshold configuration
Ensure cost allocation tags are active

"QuickSight dashboard errors"

Refresh SPICE datasets
Check Athena query permissions
Verify data source connections

Conclusion

Building a FinOps framework transforms cloud cost management from reactive firefighting to proactive optimization. The combination of:

Cost & Usage Reports for detailed cost data
Cost Intelligence Dashboard for executive visibility
Anomaly Detection for early issue identification
Budgets & Alerts for proactive governance
Chargeback mechanisms for team accountability

Delivered 30% cost reduction ($540K annually) while creating a culture of cost awareness. The key is making cost data visible, actionable, and tied to team ownership.

system-design13 min read

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

Building a payment processing system handling millions of daily transactions - featuring EventBridge for event-driven orchestration, Lambda for serverless processing, DynamoDB for transaction state, idempotency guarantees, and real-time fraud detection with Kinesis.

system-design12 min read

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

Designing a multi-channel AI chatbot system handling 5M+ conversations monthly - featuring AWS Bedrock for conversational AI, SQS for message queuing, DynamoDB for conversation state, and Lambda for serverless processing across WhatsApp and Facebook Messenger.

cloud9 min read

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Learn how to architect highly available, multi-region AWS infrastructure using Terraform, Transit Gateway, Network Load Balancers, and intelligent routing strategies for enterprise-grade applications.

back to blog

cloudfeatured

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance

Milan Dangol

Sr DevOps & DevSecOps Engineer

Jun 22, 2025

11 min read

Introduction

The transformation delivered:

30% cost reduction ($600K annual savings)
Predictable monthly spend with 95% forecast accuracy
Per-team chargeback creating cost accountability
Automated anomaly detection catching issues in hours, not weeks

Architecture Overview

Cost & Usage Report Setup

# cost-reporting/cur.tf

resource "aws_cur_report_definition" "enterprise" {
  report_name                = "enterprise-cost-usage-report"
  time_unit                  = "HOURLY"
  format                     = "Parquet"
  compression                = "Parquet"
  additional_schema_elements = ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"]

  s3_bucket = aws_s3_bucket.cur.id
  s3_region = "us-east-1"
  s3_prefix = "cur"

  additional_artifacts = ["ATHENA"]

  report_versioning = "OVERWRITE_REPORT"

  refresh_closed_reports = true
}

resource "aws_s3_bucket" "cur" {
  bucket = "company-cost-usage-reports"

  tags = {
    Purpose = "Cost & Usage Reports"
    Compliance = "required"
  }
}

resource "aws_s3_bucket_policy" "cur" {
  bucket = aws_s3_bucket.cur.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowCURDelivery"
        Effect = "Allow"
        Principal = {
          Service = "billingreports.amazonaws.com"
        }
        Action = [
          "s3:GetBucketAcl",
          "s3:GetBucketPolicy"
        ]
        Resource = aws_s3_bucket.cur.arn
      },
      {
        Sid    = "AllowCURWrite"
        Effect = "Allow"
        Principal = {
          Service = "billingreports.amazonaws.com"
        }
        Action   = "s3:PutObject"
        Resource = "${aws_s3_bucket.cur.arn}/*"
      }
    ]
  })
}

# Athena setup for CUR queries
resource "aws_athena_workgroup" "cur" {
  name = "cur-analysis"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena_results.bucket}/output/"

      encryption_configuration {
        encryption_option = "SSE_S3"
      }
    }

    engine_version {
      selected_engine_version = "Athena engine version 3"
    }
  }

  tags = {
    Team = "finops"
  }
}

resource "aws_glue_catalog_database" "cur" {
  name = "cur_database"

  description = "Cost and Usage Report database"
}

Cost Intelligence Dashboard

# quicksight/cid.tf

# Deploy Cost Intelligence Dashboard using CloudFormation
resource "aws_cloudformation_stack" "cid" {
  name = "cost-intelligence-dashboard"

  template_url = "https://aws-well-architected-labs.s3.amazonaws.com/Cost/Labs/400_Cost_Intelligence_Dashboard/cid-cfn.yaml"

  parameters = {
    QuickSightUserName        = var.quicksight_admin_user
    CURBucket                 = aws_s3_bucket.cur.id
    CURDatabaseName           = aws_glue_catalog_database.cur.name
    CURTableName              = "cost_and_usage_report"
    OptimizationDataCollectionAccountID = var.management_account_id
  }

  capabilities = ["CAPABILITY_IAM", "CAPABILITY_NAMED_IAM"]

  tags = {
    Dashboard = "CID"
    Team      = "finops"
  }
}

# QuickSight data source
resource "aws_quicksight_data_source" "athena_cur" {
  data_source_id = "athena-cur"
  name           = "Athena CUR Data Source"
  type           = "ATHENA"

  parameters {
    athena {
      work_group = aws_athena_workgroup.cur.id
    }
  }

  ssl_properties {
    disable_ssl = false
  }

  aws_account_id = data.aws_caller_identity.current.account_id

  permission {
    principal = aws_quicksight_group.finops.arn
    actions = [
      "quicksight:DescribeDataSource",
      "quicksight:DescribeDataSourcePermissions",
      "quicksight:PassDataSource",
      "quicksight:UpdateDataSource",
      "quicksight:UpdateDataSourcePermissions"
    ]
  }
}

# Custom analysis for executive dashboard
resource "aws_quicksight_analysis" "executive_summary" {
  analysis_id = "executive-cost-summary"
  name        = "Executive Cost Summary"

  source_entity {
    source_template {
      arn = aws_quicksight_template.executive.arn

      data_set_references {
        data_set_arn         = aws_quicksight_data_set.monthly_costs.arn
        data_set_placeholder = "monthlycosts"
      }
    }
  }

  aws_account_id = data.aws_caller_identity.current.account_id

  permissions {
    principal = aws_quicksight_group.executives.arn
    actions = [
      "quicksight:RestoreAnalysis",
      "quicksight:UpdateAnalysisPermissions",
      "quicksight:DeleteAnalysis",
      "quicksight:DescribeAnalysisPermissions",
      "quicksight:QueryAnalysis",
      "quicksight:DescribeAnalysis",
      "quicksight:UpdateAnalysis"
    ]
  }
}

Cost Anomaly Detection

# cost-anomaly/main.tf

resource "aws_ce_anomaly_monitor" "service_monitor" {
  name              = "service-cost-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"

  tags = {
    Team = "finops"
  }
}

resource "aws_ce_anomaly_monitor" "account_monitor" {
  name              = "account-cost-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "LINKED_ACCOUNT"
}

# High impact anomalies
resource "aws_ce_anomaly_subscription" "high_impact" {
  name      = "high-impact-anomalies"
  frequency = "IMMEDIATE"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn,
    aws_ce_anomaly_monitor.account_monitor.arn,
  ]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_alerts.arn
  }

  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        values        = ["500"]
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }

  tags = {
    Severity = "high"
  }
}

# Daily summary of all anomalies
resource "aws_ce_anomaly_subscription" "daily_summary" {
  name      = "daily-anomaly-summary"
  frequency = "DAILY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn,
    aws_ce_anomaly_monitor.account_monitor.arn,
  ]

  subscriber {
    type    = "EMAIL"
    address = "finops-team@company.com"
  }

  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        values        = ["100"]
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }
}

# Lambda to process anomalies and create tickets
resource "aws_lambda_function" "anomaly_processor" {
  filename         = "anomaly_processor.zip"
  function_name    = "cost-anomaly-processor"
  role            = aws_iam_role.anomaly_processor.arn
  handler         = "index.handler"
  runtime         = "python3.11"
  timeout         = 60

  environment {
    variables = {
      SLACK_WEBHOOK   = var.slack_webhook_url
      JIRA_API_URL    = var.jira_api_url
      JIRA_API_TOKEN  = var.jira_api_token
      SEVERITY_THRESHOLD = "500"
    }
  }
}

resource "aws_sns_topic_subscription" "anomaly_to_lambda" {
  topic_arn = aws_sns_topic.cost_alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.anomaly_processor.arn
}

Budget Management

# budgets/hierarchical.tf

locals {
  teams = {
    platform = {
      monthly_budget = 15000
      contacts       = ["platform-leads@company.com"]
      services      = ["EC2", "EKS", "RDS"]
    }
    data = {
      monthly_budget = 25000
      contacts       = ["data-leads@company.com"]
      services      = ["EMR", "Glue", "Athena", "S3"]
    }
    ml = {
      monthly_budget = 30000
      contacts       = ["ml-leads@company.com"]
      services      = ["SageMaker", "Bedrock", "EC2"]
    }
  }
}

# Team-level budgets
resource "aws_budgets_budget" "team_budgets" {
  for_each = local.teams

  name         = "${each.key}-monthly-budget"
  budget_type  = "COST"
  limit_amount = each.value.monthly_budget
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:Team$${each.key}"]
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 80
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_email_addresses = each.value.contacts
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 100
    threshold_type      = "PERCENTAGE"
    notification_type   = "FORECASTED"

    subscriber_email_addresses = concat(
      each.value.contacts,
      ["cfo@company.com"]
    )
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 120
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_email_addresses = concat(
      each.value.contacts,
      ["cfo@company.com", "cto@company.com"]
    )

    subscriber_sns_topic_arns = [aws_sns_topic.budget_breach.arn]
  }
}

# Organization-level budget
resource "aws_budgets_budget" "organizational" {
  name         = "organizational-monthly-budget"
  budget_type  = "COST"
  limit_amount = "150000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 90
    threshold_type      = "PERCENTAGE"
    notification_type   = "FORECASTED"

    subscriber_email_addresses = [
      "cfo@company.com",
      "cto@company.com"
    ]
  }
}

# Service-specific budgets for high-cost services
resource "aws_budgets_budget" "ec2_compute" {
  name         = "ec2-compute-budget"
  budget_type  = "COST"
  limit_amount = "50000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name = "Service"
    values = ["Amazon Elastic Compute Cloud - Compute"]
  }

  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 85
    threshold_type      = "PERCENTAGE"
    notification_type   = "ACTUAL"

    subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
  }
}

Cost Allocation Tags

# tagging/cost-allocation.tf

# Activate cost allocation tags
resource "aws_ce_cost_allocation_tag" "team" {
  tag_key = "Team"
  status  = "Active"
}

resource "aws_ce_cost_allocation_tag" "environment" {
  tag_key = "Environment"
  status  = "Active"
}

resource "aws_ce_cost_allocation_tag" "project" {
  tag_key = "Project"
  status  = "Active"
}

resource "aws_ce_cost_allocation_tag" "cost_center" {
  tag_key = "CostCenter"
  status  = "Active"
}

# Tag policy for Organizations
resource "aws_organizations_policy" "tagging_policy" {
  name        = "RequiredTagsPolicy"
  description = "Enforce required cost allocation tags"
  type        = "TAG_POLICY"

  content = jsonencode({
    tags = {
      Team = {
        tag_key = {
          "@@assign" = "Team"
        }
        enforced_for = {
          "@@assign" = [
            "ec2:instance",
            "ec2:volume",
            "rds:db",
            "s3:bucket",
            "dynamodb:table",
            "lambda:function"
          ]
        }
      }
      Environment = {
        tag_key = {
          "@@assign" = "Environment"
        }
        tag_value = {
          "@@assign" = ["production", "staging", "development", "sandbox"]
        }
        enforced_for = {
          "@@assign" = [
            "ec2:*",
            "rds:*",
            "s3:*"
          ]
        }
      }
      CostCenter = {
        tag_key = {
          "@@assign" = "CostCenter"
        }
        enforced_for = {
          "@@assign" = ["*"]
        }
      }
    }
  })
}

resource "aws_organizations_policy_attachment" "tagging_workloads" {
  policy_id = aws_organizations_policy.tagging_policy.id
  target_id = aws_organizations_organizational_unit.workloads.id
}

Cost Optimization Automation

# lambda/cost-optimization-recommendations.py
import boto3
import json
from datetime import datetime, timedelta

ce_client = boto3.client('ce')
ec2_client = boto3.client('ec2')
rds_client = boto3.client('rds')

def lambda_handler(event, context):
    """Generate cost optimization recommendations"""

    recommendations = []

    # Find idle EC2 instances
    recommendations.extend(find_idle_ec2_instances())

    # Find unattached EBS volumes
    recommendations.extend(find_unattached_volumes())

    # Find old snapshots
    recommendations.extend(find_old_snapshots())

    # Find underutilized RDS instances
    recommendations.extend(find_underutilized_rds())

    # Calculate total potential savings
    total_savings = sum(r['monthly_savings'] for r in recommendations)

    # Send report
    send_recommendations_report(recommendations, total_savings)

    return {
        'statusCode': 200,
        'body': json.dumps({
            'recommendations_count': len(recommendations),
            'potential_monthly_savings': total_savings
        })
    }

def find_idle_ec2_instances():
    """Find EC2 instances with low CPU utilization"""
    cloudwatch = boto3.client('cloudwatch')
    recommendations = []

    instances = ec2_client.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']

            # Get CPU utilization for last 7 days
            metrics = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=datetime.now() - timedelta(days=7),
                EndTime=datetime.now(),
                Period=3600,
                Statistics=['Average']
            )

            if metrics['Datapoints']:
                avg_cpu = sum(d['Average'] for d in metrics['Datapoints']) / len(metrics['Datapoints'])

                if avg_cpu < 5:  # Less than 5% average CPU
                    # Calculate cost
                    instance_type = instance['InstanceType']
                    monthly_cost = get_instance_cost(instance_type)

                    recommendations.append({
                        'type': 'idle_ec2',
                        'resource_id': instance_id,
                        'instance_type': instance_type,
                        'avg_cpu': round(avg_cpu, 2),
                        'monthly_savings': monthly_cost,
                        'recommendation': 'Stop or terminate idle instance',
                        'priority': 'high'
                    })

    return recommendations

def find_unattached_volumes():
    """Find EBS volumes not attached to any instance"""
    recommendations = []

    volumes = ec2_client.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )

    for volume in volumes['Volumes']:
        volume_id = volume['VolumeId']
        size_gb = volume['Size']
        volume_type = volume['VolumeType']

        # Calculate monthly cost (rough estimate)
        cost_per_gb = {'gp3': 0.08, 'gp2': 0.10, 'io1': 0.125, 'io2': 0.125}
        monthly_cost = size_gb * cost_per_gb.get(volume_type, 0.10)

        recommendations.append({
            'type': 'unattached_volume',
            'resource_id': volume_id,
            'size_gb': size_gb,
            'volume_type': volume_type,
            'monthly_savings': monthly_cost,
            'recommendation': 'Delete unused volume or create snapshot',
            'priority': 'medium'
        })

    return recommendations

def send_recommendations_report(recommendations, total_savings):
    """Send recommendations via email and Slack"""
    sns = boto3.client('sns')

    message = f"""
Cost Optimization Recommendations

Total Potential Monthly Savings: ${total_savings:,.2f}

Recommendations: {len(recommendations)}
- High Priority: {len([r for r in recommendations if r['priority'] == 'high'])}
- Medium Priority: {len([r for r in recommendations if r['priority'] == 'medium'])}
- Low Priority: {len([r for r in recommendations if r['priority'] == 'low'])}

View detailed report: https://quicksight.aws.amazon.com/cost-optimization
"""

    sns.publish(
        TopicArn=os.environ['SNS_TOPIC_ARN'],
        Subject='Weekly Cost Optimization Recommendations',
        Message=message
    )

Chargeback Dashboard

-- athena/queries/team-chargeback.sql

-- Monthly cost per team
CREATE OR REPLACE VIEW team_monthly_costs AS
SELECT
    bill_payer_account_id,
    line_item_usage_account_id as account_id,
    resource_tags_user_team as team,
    DATE_TRUNC('month', line_item_usage_start_date) as month,
    line_item_product_code as service,
    SUM(line_item_unblended_cost) as total_cost,
    SUM(CASE WHEN line_item_line_item_type = 'Usage' THEN line_item_unblended_cost ELSE 0 END) as usage_cost,
    SUM(CASE WHEN line_item_line_item_type = 'SavingsPlanCoveredUsage' THEN line_item_unblended_cost ELSE 0 END) as savings_plan_cost
FROM
    cur_database.cost_and_usage_report
WHERE
    line_item_line_item_type IN ('Usage', 'SavingsPlanCoveredUsage', 'DiscountedUsage')
    AND resource_tags_user_team IS NOT NULL
GROUP BY
    1, 2, 3, 4, 5;

-- Top 10 cost drivers per team
CREATE OR REPLACE VIEW team_top_costs AS
WITH ranked_costs AS (
    SELECT
        team,
        month,
        service,
        total_cost,
        ROW_NUMBER() OVER (PARTITION BY team, month ORDER BY total_cost DESC) as rank
    FROM team_monthly_costs
)
SELECT *
FROM ranked_costs
WHERE rank <= 10;

-- Month-over-month cost change
CREATE OR REPLACE VIEW team_cost_trends AS
SELECT
    curr.team,
    curr.month as current_month,
    curr.total_cost as current_cost,
    prev.total_cost as previous_cost,
    curr.total_cost - prev.total_cost as cost_change,
    ROUND(((curr.total_cost - prev.total_cost) / NULLIF(prev.total_cost, 0)) * 100, 2) as percent_change
FROM team_monthly_costs curr
LEFT JOIN team_monthly_costs prev
    ON curr.team = prev.team
    AND curr.month = DATE_ADD('month', 1, prev.month)
WHERE curr.month = DATE_TRUNC('month', CURRENT_DATE);

Cost Governance Policies

Results: 30% Cost Reduction

Best Practices

Practice	Implementation	Impact
Tag everything	Enforce tag policies	100% cost visibility
Right-size resources	Weekly recommendations	20-30% savings
Use Savings Plans	Automated purchase	40-70% discount
Delete unused resources	Automated cleanup	10-15% savings
Monitor anomalies	ML-based detection	Catch issues early
Implement chargeback	Per-team dashboards	Accountability

Troubleshooting

"CUR data not appearing in Athena"

# Check CUR delivery
aws cur describe-report-definitions

# Verify S3 bucket
aws s3 ls s3://company-cost-usage-reports/cur/

# Check Glue crawler
aws glue get-crawler --name cur-crawler

"Budget notifications not working"

Verify SNS topic subscriptions confirmed
Check budget threshold configuration
Ensure cost allocation tags are active

"QuickSight dashboard errors"

Refresh SPICE datasets
Check Athena query permissions
Verify data source connections

Conclusion

Building a FinOps framework transforms cloud cost management from reactive firefighting to proactive optimization. The combination of:

Cost & Usage Reports for detailed cost data
Cost Intelligence Dashboard for executive visibility
Anomaly Detection for early issue identification
Budgets & Alerts for proactive governance
Chargeback mechanisms for team accountability

Delivered 30% cost reduction ($540K annually) while creating a culture of cost awareness. The key is making cost data visible, actionable, and tied to team ownership.

system-design13 min read

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance

Introduction

Architecture Overview

Cost & Usage Report Setup

Cost Intelligence Dashboard

Cost Anomaly Detection

Budget Management

Cost Allocation Tags

Cost Optimization Automation

Chargeback Dashboard

Cost Governance Policies

Results: 30% Cost Reduction

Best Practices

Troubleshooting

"CUR data not appearing in Athena"

"Budget notifications not working"

"QuickSight dashboard errors"

Conclusion

Related Articles

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Cloud FinOps Framework: AWS Cost Intelligence Dashboard, Budgets, and Cost Anomaly Detection for Enterprise Cost Governance

Introduction

Architecture Overview

Cost & Usage Report Setup

Cost Intelligence Dashboard

Cost Anomaly Detection

Budget Management

Cost Allocation Tags

Cost Optimization Automation

Chargeback Dashboard

Cost Governance Policies

Results: 30% Cost Reduction

Best Practices

Troubleshooting

"CUR data not appearing in Athena"

"Budget notifications not working"

"QuickSight dashboard errors"

Conclusion

Related Articles

Payment Processing System at Scale: Stripe/Adyen Integration with AWS EventBridge, Lambda, and DynamoDB

AI Chatbot System Architecture: WhatsApp Business API, Facebook Messenger, and AWS Bedrock Integration

Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive