Networking, IAM & drift

Production platforms need more than a single public subnet: EKS-ready networking, tiered security groups, least-privilege IAM, and a disciplined answer when someone edits the console. This guide extends your VPC module, wires roles for Kubernetes, and shows how to detect and fix drift without panic applies.

Prerequisites: Environments, variables & secrets and remote state from the prior guides.

After reading, you should be able to:

Tag subnets for public/internal load balancers and run nodes in private subnets with NAT.
Model security groups as ALB → app → database tiers.
Attach minimal IAM roles for EKS control plane, nodes, and IRSA workloads.
Use import, refresh, and plan-driven workflows to reconcile drift.

VPC with public and private subnets, security group tiers, IAM roles, and drift detection via terraform plan. — Network layout and IAM are code; drift is when reality diverges—plan is your diff, import is your adoption path.

Step 1 — EKS-ready subnet layout

Extend the VPC module with two AZs, public subnets (load balancers + NAT) and private subnets (worker nodes):

resource "aws_subnet" "public" {
  for_each = toset(["a", "b"])
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 8, each.key == "a" ? 1 : 2)
  availability_zone       = data.aws_availability_zones.available.names[each.key == "a" ? 0 : 1]
  map_public_ip_on_launch = true
  tags = merge(var.tags, {
    Name = "${var.name_prefix}-public-${each.key}"
    "kubernetes.io/role/elb" = "1"
  })
}

resource "aws_subnet" "private" {
  for_each = toset(["a", "b"])
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, each.key == "a" ? 11 : 12)
  availability_zone = data.aws_availability_zones.available.names[each.key == "a" ? 0 : 1]
  tags = merge(var.tags, {
    Name = "${var.name_prefix}-private-${each.key}"
    "kubernetes.io/role/internal-elb" = "1"
  })
}

resource "aws_eip" "nat" {
  count  = var.enable_nat ? 1 : 0
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  count         = var.enable_nat ? 1 : 0
  allocation_id = aws_eip.nat[0].id
  subnet_id     = values(aws_subnet.public)[0].id
}

resource "aws_route_table" "private" {
  count  = var.enable_nat ? 1 : 0
  vpc_id = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[0].id
  }
}

Tag kubernetes.io/cluster/CLUSTER_NAME=shared on subnets when using a specific cluster name—check current EKS docs for your cluster version.

Step 2 — Security groups (three tiers)

SG	Inbound	Outbound
sg-alb	443 from `0.0.0.0/0` (or corporate CIDR)	to `sg-app` on app port
sg-app	app port from `sg-alb`	to `sg-db` on 5432
sg-db	5432 from `sg-app` only	minimal (often none)

resource "aws_security_group" "app" {
  name_prefix = "${var.name_prefix}-app-"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "From ALB"
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.db.id]
  }
}

Avoid 0.0.0.0/0 on app or database tiers. For EKS, node security groups often allow control plane ↔ node traffic per AWS documentation—do not copy generic “allow all” examples into prod.

Step 3 — IAM: cluster, nodes, workloads

resource "aws_iam_role" "eks_cluster" {
  name = "${var.name_prefix}-eks-cluster"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = { Service = "eks.amazonaws.com" }
      Action = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "eks_cluster" {
  role       = aws_iam_role.eks_cluster.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
}

resource "aws_iam_role" "eks_nodes" {
  name = "${var.name_prefix}-eks-nodes"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
      Action = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "eks_nodes" {
  for_each = toset([
    "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
    "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
    "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
  ])
  role       = aws_iam_role.eks_nodes.name
  policy_arn = each.value
}

# Pod assumes a role via OIDC — grant only what the app needs (e.g. S3 read)
resource "aws_iam_role" "app_s3_read" {
  name = "${var.name_prefix}-app-s3-read"
  assume_role_policy = data.aws_iam_policy_document.irsa.json
}

resource "aws_iam_role_policy" "app_s3_read" {
  role = aws_iam_role.app_s3_read.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:GetObject"]
      Resource = "arn:aws:s3:::my-bucket/*"
    }]
  })
}

Prefer IRSA over mounting instance-wide node credentials for application AWS API calls.

Least privilege checklist: no AdministratorAccess on node roles; scope policies to resource ARNs; use separate roles per microservice; audit with IAM Access Analyzer.

Step 4 — What is drift?

Drift means Terraform state and configuration no longer match real infrastructure—usually because someone changed the AWS console, a support ticket added a rule, or a failed apply left partial resources.

Signal	Meaning
`plan` shows changes you did not code	Reality ≠ code (drift) or provider read changed
`refresh` updates state from API	State catches up to reality without applying
`apply` would change prod	Either adopt drift into code or revert reality

Step 5 — Detect drift in CI and locally

terraform plan -var-file=envs/prod.tfvars -detailed-exitcode
# 0 = no changes, 1 = error, 2 = changes present (drift or pending work)

Scheduled workflow (nightly) on prod state:

- name: drift detection
  run: |
    terraform plan -var-file=envs/prod.tfvars -detailed-exitcode -no-color
  continue-on-error: true
# alert if exit code 2 — Slack/PagerDuty

Pair with AWS Config or CloudTrail alerts for resources tagged as Terraform-managed but modified outside pipeline.

Step 6 — Fix strategies (decision tree)

Console change was wrong — revert in console or terraform apply to restore coded values.
Console change was right, temporary — encode in Terraform, open PR, apply through CI.
Resource exists but not in state — terraform import then align HCL.
Resource removed in console — plan shows create; apply recreates or remove from code if intentional.

Step 7 — Import existing resources

Example: security group created manually during an incident.

# Write matching resource block first, then:
terraform import aws_security_group.app sg-0abc123def456789

terraform plan -var-file=envs/prod.tfvars
# goal: No changes. Your HCL must match imported attributes.

For complex resources use terraform plan -generate-config-out=generated.tf (Terraform 1.5+) as a starting point, then hand-edit and move into modules.

import {
  to = aws_security_group.app
  id = "sg-0abc123def456789"
}

Import blocks in code (Terraform 1.5+) make adoption reviewable in PRs like any other change.

Step 8 — Refresh without apply

terraform apply -refresh-only -var-file=envs/prod.tfvars
# updates state from AWS; does not change infrastructure

Use when you need state accurate before a refactor—not a substitute for fixing bad code. After refresh, run plan again.

Step 9 — moved blocks (rename without destroy)

moved {
  from = aws_security_group.application
  to   = aws_security_group.app
}

Prevents destroy/recreate when refactoring addresses—always verify plan shows no unexpected -/+ on production databases or NAT gateways.

Step 10 — Incident: emergency console fix

Runbook: During an outage it is acceptable to open a security group rule in the console. Within 24 hours: (1) capture what changed, (2) update Terraform to match or revert, (3) import if needed, (4) apply via CI with review. Undocumented drift becomes the next outage.

Step 11 — Connect to Kubernetes

With network and IAM in place, the Kubernetes networking guide covers Services and Ingress inside the cluster. Terraform provisions the substrate; CI deploys workloads onto it.

Step 12 — Troubleshooting

Symptom	Action
EKS nodes cannot pull images	Check NAT route on private RT; endpoints for ECR
LoadBalancer stuck pending	Missing `kubernetes.io/role/elb` on public subnets
`import` succeeds, plan wants replace	HCL attributes differ—align tags, name, vpc_id
Drift every plan on same attribute	Provider bug or computed field—use `lifecycle { ignore_changes = [...] }` sparingly
Accidental destroy in plan	Stop—fix `moved` or `state mv`; never apply blind

Step 13 — Anti-patterns

0.0.0.0/0 on SSH or database security groups.
Shared mega-role for all pods and nodes.
Ignoring nightly drift alerts until apply fails Friday night.
lifecycle { prevent_destroy = false } on NAT without change process.
Importing prod resources without a PR review of generated config.

Interview phrase: “We run EKS in private subnets with tagged public subnets for ALBs, tier security groups by reference not CIDR sprawl, IRSA for app AWS access, nightly terraform plan for drift with exit code 2 alerts, and any console break-glass change gets imported or codified in the next PR.”

The one line to remember

Network and IAM are contracts—code them explicitly, detect when reality diverges, and never let console fixes become permanent secrets.

Infrastructure track — complete

You now have the full path: IaC explained → first project → state & modules → environments → this guide. Next on DevOps: Observability explained →