Introduction
Managing cloud infrastructure at enterprise scale isn't just about deploying resources - it's about building a platform that enables teams to move fast while maintaining security, compliance, and cost control. I built a cloud operating model that transformed how 50+ engineering teams interact with AWS.
The challenge was creating a system that:
- Reduces operational overhead by 40% through automation
- Accelerates time-to-market with self-service infrastructure
- Enforces governance without blocking innovation
- Provides visibility across all accounts and workloads
Architecture Overview
flowchart TB
subgraph Management["Management Account"]
ORG[AWS Organizations]
CT[Control Tower]
SSO[IAM Identity Center]
end
subgraph OUs["Organizational Units"]
subgraph Security["Security OU"]
LOG[Log Archive]
AUDIT[Security Audit]
end
subgraph Infrastructure["Infrastructure OU"]
NETWORK[Network Hub]
SHARED[Shared Services]
end
subgraph Workloads["Workloads OU"]
PROD[Production Accounts]
STAGING[Staging Accounts]
DEV[Development Accounts]
end
end
subgraph Platform["Platform Services"]
SC[Service Catalog]
CFN[CloudFormation StackSets]
TRAIL[CloudTrail Lake]
CONFIG[AWS Config]
end
subgraph Governance["Governance & Compliance"]
SCP[Service Control Policies]
GUARD[Control Tower Guardrails]
BUDGET[Budget Controls]
end
ORG --> OUs
CT --> Platform
CT --> Governance
SSO --> OUs
Platform --> Workloads
Governance --> Workloads
style Management fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff
style Security fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
style Infrastructure fill:#264653,stroke:#2a9d8f,stroke-width:2px,color:#fff
style Workloads fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style Platform fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style Governance fill:#9b5de5,stroke:#fff,stroke-width:2px,color:#fff
AWS Organizations Multi-Account Strategy
# organizations/main.tf
resource "aws_organizations_organization" "main" {
aws_service_access_principals = [
"cloudtrail.amazonaws.com",
"config.amazonaws.com",
"sso.amazonaws.com",
"controltower.amazonaws.com",
"stacksets.cloudformation.amazonaws.com",
]
enabled_policy_types = [
"SERVICE_CONTROL_POLICY",
"TAG_POLICY",
"BACKUP_POLICY",
]
feature_set = "ALL"
}
# Organizational Units
resource "aws_organizations_organizational_unit" "security" {
name = "Security"
parent_id = aws_organizations_organization.main.roots[0].id
}
resource "aws_organizations_organizational_unit" "infrastructure" {
name = "Infrastructure"
parent_id = aws_organizations_organization.main.roots[0].id
}
resource "aws_organizations_organizational_unit" "workloads" {
name = "Workloads"
parent_id = aws_organizations_organization.main.roots[0].id
}
# Security Accounts
resource "aws_organizations_account" "log_archive" {
name = "log-archive"
email = "aws-log-archive@company.com"
parent_id = aws_organizations_organizational_unit.security.id
tags = {
Purpose = "Centralized logging"
Managed = "terraform"
CostCenter = "security"
}
}
resource "aws_organizations_account" "security_audit" {
name = "security-audit"
email = "aws-security-audit@company.com"
parent_id = aws_organizations_organizational_unit.security.id
tags = {
Purpose = "Security monitoring"
Managed = "terraform"
CostCenter = "security"
}
}Service Control Policies
flowchart TD
subgraph SCPHierarchy["SCP Inheritance Model"]
ROOT["Root<br/>Base restrictions"]
subgraph OULevel["OU Level"]
SEC_OU["Security OU<br/>+ Audit controls"]
INFRA_OU["Infrastructure OU<br/>+ Network controls"]
WORK_OU["Workloads OU<br/>+ Resource limits"]
end
subgraph AccountLevel["Account Level"]
PROD_ACC["Production<br/>+ Change control"]
DEV_ACC["Development<br/>+ Cost controls"]
end
end
ROOT --> OULevel
SEC_OU --> PROD_ACC
WORK_OU --> PROD_ACC
WORK_OU --> DEV_ACC
style ROOT fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff
style OULevel fill:#264653,stroke:#2a9d8f,stroke-width:2px,color:#fff
style AccountLevel fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
# organizations/scp.tf
# Base SCP - Deny dangerous operations
resource "aws_organizations_policy" "deny_root_user" {
name = "DenyRootUser"
description = "Deny all actions by root user"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyRootUser"
Effect = "Deny"
Action = "*"
Resource = "*"
Condition = {
StringLike = {
"aws:PrincipalArn" = "arn:aws:iam::*:root"
}
}
}
]
})
}
# Prevent disabling security services
resource "aws_organizations_policy" "protect_security_services" {
name = "ProtectSecurityServices"
description = "Prevent disabling CloudTrail, Config, GuardDuty"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "ProtectCloudTrail"
Effect = "Deny"
Action = [
"cloudtrail:StopLogging",
"cloudtrail:DeleteTrail",
"cloudtrail:PutEventSelectors"
]
Resource = "*"
},
{
Sid = "ProtectConfig"
Effect = "Deny"
Action = [
"config:DeleteConfigRule",
"config:DeleteConfigurationRecorder",
"config:DeleteDeliveryChannel",
"config:StopConfigurationRecorder"
]
Resource = "*"
},
{
Sid = "ProtectGuardDuty"
Effect = "Deny"
Action = [
"guardduty:DeleteDetector",
"guardduty:DeleteMembers",
"guardduty:DisassociateFromMasterAccount",
"guardduty:StopMonitoringMembers"
]
Resource = "*"
}
]
})
}
# Region restrictions
resource "aws_organizations_policy" "region_restriction" {
name = "RegionRestriction"
description = "Restrict operations to approved regions"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyUnapprovedRegions"
Effect = "Deny"
NotAction = [
"iam:*",
"organizations:*",
"route53:*",
"budgets:*",
"cloudfront:*",
"support:*",
"sts:*"
]
Resource = "*"
Condition = {
StringNotEquals = {
"aws:RequestedRegion" = [
"us-east-1",
"us-west-2",
"eu-west-1"
]
}
}
}
]
})
}
# Attach SCPs
resource "aws_organizations_policy_attachment" "root_deny_root" {
policy_id = aws_organizations_policy.deny_root_user.id
target_id = aws_organizations_organization.main.roots[0].id
}
resource "aws_organizations_policy_attachment" "workloads_region" {
policy_id = aws_organizations_policy.region_restriction.id
target_id = aws_organizations_organizational_unit.workloads.id
}Control Tower Setup
# control-tower/main.tf
resource "aws_controltower_landing_zone" "main" {
manifest_json = jsonencode({
governedRegions = [
"us-east-1",
"us-west-2",
"eu-west-1"
]
organizationStructure = {
security = {
name = "Security"
}
sandbox = {
name = "Sandbox"
}
}
centralizedLogging = {
accountId = aws_organizations_account.log_archive.id
configurations = {
loggingBucket = {
retentionDays = 365
}
accessLoggingBucket = {
retentionDays = 3650
}
}
}
securityRoles = {
accountId = aws_organizations_account.security_audit.id
}
accessManagement = {
enabled = true
}
})
version = "3.3"
}
# Mandatory guardrails (enforced)
resource "aws_controltower_control" "disallow_public_read_acl" {
control_identifier = "arn:aws:controltower:us-east-1::control/AWS-GR_S3_BUCKET_PUBLIC_READ_PROHIBITED"
target_identifier = aws_organizations_organizational_unit.workloads.arn
}
resource "aws_controltower_control" "disallow_public_write_acl" {
control_identifier = "arn:aws:controltower:us-east-1::control/AWS-GR_S3_BUCKET_PUBLIC_WRITE_PROHIBITED"
target_identifier = aws_organizations_organizational_unit.workloads.arn
}
resource "aws_controltower_control" "enable_encryption_at_rest" {
control_identifier = "arn:aws:controltower:us-east-1::control/AWS-GR_ENCRYPTED_VOLUMES"
target_identifier = aws_organizations_organizational_unit.workloads.arn
}
# Detective guardrails (alerts only)
resource "aws_controltower_control" "detect_public_ip_on_eni" {
control_identifier = "arn:aws:controltower:us-east-1::control/AWS-GR_DETECT_PUBLIC_IP_ON_ENI"
target_identifier = aws_organizations_organizational_unit.workloads.arn
}IAM Identity Center (SSO)
flowchart LR
subgraph Users["Users & Groups"]
DEV_GROUP[Developers Group]
OPS_GROUP[Operations Group]
ADMIN_GROUP[Administrators Group]
end
subgraph PermissionSets["Permission Sets"]
DEV_PS[Developer Access]
OPS_PS[Operations Access]
ADMIN_PS[Administrator Access]
RO_PS[ReadOnly Access]
end
subgraph Accounts["AWS Accounts"]
PROD[Production]
STAGING[Staging]
DEV[Development]
end
DEV_GROUP --> DEV_PS
OPS_GROUP --> OPS_PS
ADMIN_GROUP --> ADMIN_PS
DEV_PS --> DEV
DEV_PS --> STAGING
OPS_PS --> PROD
OPS_PS --> STAGING
ADMIN_PS --> PROD
ADMIN_PS --> STAGING
ADMIN_PS --> DEV
RO_PS --> PROD
style Users fill:#1a1a2e,stroke:#00d9ff,stroke-width:2px,color:#fff
style PermissionSets fill:#f77f00,stroke:#fff,stroke-width:2px,color:#fff
style Accounts fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
# iam-identity-center/permission-sets.tf
resource "aws_ssoadmin_permission_set" "developer" {
name = "DeveloperAccess"
description = "Developer access to non-production environments"
instance_arn = local.sso_instance_arn
session_duration = "PT8H"
tags = {
Environment = "all"
Team = "engineering"
}
}
resource "aws_ssoadmin_managed_policy_attachment" "developer_power_user" {
instance_arn = local.sso_instance_arn
permission_set_arn = aws_ssoadmin_permission_set.developer.arn
managed_policy_arn = "arn:aws:iam::aws:policy/PowerUserAccess"
}
# Custom inline policy for developers
resource "aws_ssoadmin_permission_set_inline_policy" "developer_restrictions" {
inline_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyExpensiveInstances"
Effect = "Deny"
Action = [
"ec2:RunInstances",
"ec2:StartInstances"
]
Resource = "arn:aws:ec2:*:*:instance/*"
Condition = {
StringNotLike = {
"ec2:InstanceType" = [
"t3.*",
"t3a.*",
"m6i.large",
"m6i.xlarge"
]
}
}
}
]
})
instance_arn = local.sso_instance_arn
permission_set_arn = aws_ssoadmin_permission_set.developer.arn
}
# Operations permission set
resource "aws_ssoadmin_permission_set" "operations" {
name = "OperationsAccess"
description = "Operations team access for production"
instance_arn = local.sso_instance_arn
session_duration = "PT12H"
}
resource "aws_ssoadmin_managed_policy_attachment" "ops_admin" {
instance_arn = local.sso_instance_arn
permission_set_arn = aws_ssoadmin_permission_set.operations.arn
managed_policy_arn = "arn:aws:iam::aws:policy/AdministratorAccess"
}
# Account assignments
resource "aws_ssoadmin_account_assignment" "developers_dev" {
instance_arn = local.sso_instance_arn
permission_set_arn = aws_ssoadmin_permission_set.developer.arn
principal_id = aws_identitystore_group.developers.group_id
principal_type = "GROUP"
target_id = aws_organizations_account.development.id
target_type = "AWS_ACCOUNT"
}Service Catalog Portfolio
# service-catalog/portfolio.tf
resource "aws_servicecatalog_portfolio" "infrastructure" {
name = "Infrastructure Templates"
description = "Self-service infrastructure products"
provider_name = "Platform Engineering"
tags = {
Team = "platform"
}
}
# VPC Product
resource "aws_servicecatalog_product" "vpc" {
name = "Standard VPC"
owner = "Platform Engineering"
type = "CLOUD_FORMATION_TEMPLATE"
provisioning_artifact_parameters {
name = "v1.0"
description = "Standard 3-tier VPC with NAT Gateways"
type = "CLOUD_FORMATION_TEMPLATE"
template_url = "https://s3.amazonaws.com/templates/vpc-standard.yaml"
}
tags = {
Product = "vpc"
}
}
resource "aws_servicecatalog_product_portfolio_association" "vpc" {
portfolio_id = aws_servicecatalog_portfolio.infrastructure.id
product_id = aws_servicecatalog_product.vpc.id
}
# EKS Cluster Product
resource "aws_servicecatalog_product" "eks" {
name = "EKS Cluster"
owner = "Platform Engineering"
type = "CLOUD_FORMATION_TEMPLATE"
provisioning_artifact_parameters {
name = "v1.29"
description = "EKS 1.29 with managed node groups"
type = "CLOUD_FORMATION_TEMPLATE"
template_url = "https://s3.amazonaws.com/templates/eks-cluster.yaml"
}
tags = {
Product = "eks"
}
}
# RDS Database Product
resource "aws_servicecatalog_product" "rds" {
name = "RDS PostgreSQL"
owner = "Platform Engineering"
type = "CLOUD_FORMATION_TEMPLATE"
provisioning_artifact_parameters {
name = "v1.0"
description = "PostgreSQL with Multi-AZ and encryption"
type = "CLOUD_FORMATION_TEMPLATE"
template_url = "https://s3.amazonaws.com/templates/rds-postgres.yaml"
}
tags = {
Product = "rds"
}
}
# Grant access to portfolio
resource "aws_servicecatalog_principal_portfolio_association" "developers" {
portfolio_id = aws_servicecatalog_portfolio.infrastructure.id
principal_arn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/aws-reserved/sso.amazonaws.com/DeveloperAccess"
principal_type = "IAM"
}CloudFormation StackSets
# stacksets/baseline.tf
# Deploy security baseline to all accounts
resource "aws_cloudformation_stack_set" "security_baseline" {
name = "security-baseline"
description = "Security baseline for all accounts"
permission_model = "SERVICE_MANAGED"
capabilities = ["CAPABILITY_NAMED_IAM"]
template_body = file("${path.module}/templates/security-baseline.yaml")
parameters = {
EnableGuardDuty = "true"
EnableSecurityHub = "true"
EnableConfigRules = "true"
LogRetentionDays = "365"
}
auto_deployment {
enabled = true
retain_stacks_on_account_removal = false
}
operation_preferences {
failure_tolerance_count = 0
max_concurrent_count = 5
region_concurrency_type = "PARALLEL"
}
}
resource "aws_cloudformation_stack_set_instance" "security_baseline" {
stack_set_name = aws_cloudformation_stack_set.security_baseline.name
deployment_targets {
organizational_unit_ids = [
aws_organizations_organizational_unit.workloads.id
]
}
region = "us-east-1"
}
# Cost optimization baseline
resource "aws_cloudformation_stack_set" "cost_optimization" {
name = "cost-optimization"
description = "Cost optimization resources"
permission_model = "SERVICE_MANAGED"
template_body = file("${path.module}/templates/cost-optimization.yaml")
parameters = {
BudgetAmount = "10000"
BudgetThreshold = "80"
NotificationEmail = "finance@company.com"
}
auto_deployment {
enabled = true
}
}CloudTrail Lake for Compliance
# cloudtrail/lake.tf
resource "aws_cloudtrail_event_data_store" "organizational" {
name = "organizational-events"
advanced_event_selector {
name = "Log all management events"
field_selector {
field = "eventCategory"
equals = ["Management"]
}
}
advanced_event_selector {
name = "Log S3 data events"
field_selector {
field = "eventCategory"
equals = ["Data"]
}
field_selector {
field = "resources.type"
equals = ["AWS::S3::Object"]
}
}
retention_period = 365
organization_enabled = true
multi_region_enabled = true
termination_protection_enabled = true
tags = {
Compliance = "required"
DataStore = "organizational"
}
}
# Common compliance queries
resource "aws_cloudtrail_event_data_store_query" "root_user_activity" {
event_data_store = aws_cloudtrail_event_data_store.organizational.id
query_statement = <<-SQL
SELECT
eventTime,
eventName,
userIdentity.arn,
sourceIPAddress,
requestParameters
FROM ${aws_cloudtrail_event_data_store.organizational.id}
WHERE userIdentity.type = 'Root'
AND eventTime > '${timeadd(timestamp(), "-24h")}'
ORDER BY eventTime DESC
SQL
}
resource "aws_cloudtrail_event_data_store_query" "unauthorized_api_calls" {
event_data_store = aws_cloudtrail_event_data_store.organizational.id
query_statement = <<-SQL
SELECT
eventTime,
eventName,
userIdentity.arn,
errorCode,
errorMessage
FROM ${aws_cloudtrail_event_data_store.organizational.id}
WHERE errorCode IN ('AccessDenied', 'UnauthorizedOperation')
AND eventTime > '${timeadd(timestamp(), "-7d")}'
ORDER BY eventTime DESC
SQL
}Account Vending Machine
sequenceDiagram
participant User as Team Lead
participant Portal as Service Catalog
participant Lambda as Account Factory
participant Org as AWS Organizations
participant CT as Control Tower
participant SSO as IAM Identity Center
User->>Portal: Request new account
Portal->>Lambda: Trigger account creation
Lambda->>Org: Create account in OU
Org-->>Lambda: Account created
Lambda->>CT: Apply guardrails
CT-->>Lambda: Guardrails active
Lambda->>SSO: Configure access
SSO-->>Lambda: Access configured
Lambda->>Org: Apply SCPs
Lambda->>Org: Tag account
Lambda-->>Portal: Account ready
Portal-->>User: Account credentials & details
Note over User,SSO: Account provisioned in ~5 minutes
Cost Governance
# budgets/main.tf
# Account-level budget
resource "aws_budgets_budget" "account_monthly" {
name = "monthly-account-budget"
budget_type = "COST"
limit_amount = "5000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["finance@company.com", "platform@company.com"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["cto@company.com"]
}
cost_filter {
name = "LinkedAccount"
values = [data.aws_caller_identity.current.account_id]
}
}
# Service-specific budgets
resource "aws_budgets_budget" "ec2_monthly" {
name = "ec2-monthly-budget"
budget_type = "COST"
limit_amount = "2000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon Elastic Compute Cloud - Compute"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
}
}Best Practices
| Practice | Why | Impact |
|---|---|---|
| Multi-account strategy | Isolation, blast radius control | Security++ |
| Automated account provisioning | Self-service, consistency | 75% faster |
| Service Control Policies | Preventive controls | Compliance |
| Service Catalog | Standardized templates | 60% less errors |
| CloudTrail Lake | Centralized audit logs | Investigation |
| Cost allocation tags | Chargeback per team | Accountability |
Business Impact
flowchart TB
subgraph Before["Before Cloud Operating Model"]
B1["Manual account creation<br/>2-3 weeks"]
B2["Ad-hoc security configs<br/>Inconsistent"]
B3["No cost visibility<br/>Budget overruns"]
B4["Ticket-driven provisioning<br/>Bottleneck"]
end
subgraph After["After Implementation"]
A1["Automated account vending<br/>5 minutes"]
A2["Consistent security baseline<br/>100% compliance"]
A3["Per-team cost allocation<br/>Predictable spend"]
A4["Self-service infrastructure<br/>Empowered teams"]
end
Before ==> After
subgraph Results["Business Outcomes"]
R1["40% operational cost reduction"]
R2["85% faster time-to-market"]
R3["Zero security incidents"]
R4["99.5% compliance score"]
end
After ==> Results
style Before fill:#e63946,stroke:#fff,stroke-width:2px,color:#fff
style After fill:#2a9d8f,stroke:#fff,stroke-width:2px,color:#fff
style Results fill:#ffbe0b,stroke:#fff,stroke-width:2px,color:#000
Troubleshooting
"Account creation fails"
# Check Control Tower status
aws controltower list-landing-zones
# Verify OU structure
aws organizations list-organizational-units-for-parent \
--parent-id r-xxxx
# Check SCP conflicts
aws organizations list-policies-for-target \
--target-id ou-xxxx \
--filter SERVICE_CONTROL_POLICY"SSO access not working"
- Verify permission set assignments
- Check if user is in correct IdP group
- Ensure account is in expected OU
- Validate session duration settings
"Service Catalog provisioning errors"
- Check IAM roles for Service Catalog
- Verify CloudFormation template syntax
- Ensure sufficient service quotas
- Review launch constraints
Conclusion
Building a cloud operating model isn't just about deploying AWS services - it's about creating a platform that enables innovation while maintaining governance. The combination of:
- AWS Organizations for multi-account structure
- Control Tower for automated governance
- IAM Identity Center for centralized access
- Service Catalog for self-service infrastructure
Creates a foundation that scales with your organization, reduces operational overhead by 40%, and accelerates time-to-market while maintaining security and compliance.
The key is treating your cloud platform as a product, with internal teams as your customers, and building the automation and guardrails that let them move fast safely.