Skip to content

Infrastructure as Code with AI Assistants

You inherit a 2,000-line Terraform module with no docs, the previous platform engineer has left, and you have two days to add a multi-region read replica without taking down production. Reading the AWS provider docs tab-by-tab while reconstructing the author’s intent is exactly the slow, error-prone work AI assistants are good at compressing—if you ground them in real provider schemas instead of letting them guess from stale training data.

This guide shows the workflows that hold up in production: conversational design, MCP-grounded generation, security review, and drift remediation across Terraform, CloudFormation, Pulumi, and CDK—using Cursor, Claude Code, and Codex.

  • A repeatable pattern for framing infrastructure requests as business constraints, not resource lists
  • MCP setup for the HashiCorp Terraform, Pulumi, and AWS servers in all three tools, so suggestions are grounded in live registry/account data
  • Three copy-paste prompts you can use today: a security-framework review, a drift-detection sweep, and a cost-governance generator
  • A “When This Breaks” checklist for the failure modes that actually bite (state locks, drift, provider-pin breakage, MCP auth)

Frame Infrastructure as Constraints, Not Resources

Section titled “Frame Infrastructure as Constraints, Not Resources”

The single biggest lever on output quality is the opening prompt. Engineers who get generic configs ask for generic things (“create an EKS cluster with 3 nodes”). Engineers who get production-ready configs describe the business and let the model derive the topology.

The “three riskiest assumptions” clause is what separates a usable answer from a wall of HCL. It forces the model to surface where it’s guessing (Is Atlas in the same region? Is the SLA per-region or global?) so you correct course before any code exists. This is identical across Cursor, Claude Code, and Codex—the discipline is in the prompt, not the tool.

Ground the AI in Real Provider Data with MCP

Section titled “Ground the AI in Real Provider Data with MCP”

Models hallucinate resource arguments and lag behind provider releases. MCP servers fix this by giving the assistant live access to the Terraform Registry, the Pulumi Registry, and your AWS account. Setup is the same conceptually in every tool—register the server, then prompt normally—but the registration command differs.

This is HashiCorp’s official server (terraform-mcp-server), providing Registry lookups, provider/module discovery, and HCP Terraform workspace management.

Add to .cursor/mcp.json in your project (or the global config via Settings, MCP):

{
"mcpServers": {
"terraform": {
"command": "npx",
"args": ["-y", "terraform-mcp-server"]
}
}
}

Then in Agent mode: “List the current aws_db_instance arguments for Postgres and flag any that are deprecated in the latest provider.”

AWS Labs publishes several servers. Use the right one and avoid the deprecated ones:

  • awslabs.cfn-mcp-server — CloudFormation and direct resource management via the Cloud Control API. Current and maintained.
  • awslabs.aws-iac-mcp-server — CloudFormation template validation, compliance checks, and deployment troubleshooting. This is the consolidated successor to the now-deprecated CDK server.

These ship as Python packages, so they run with uvx, not npx:

{
"mcpServers": {
"aws-cfn": {
"command": "uvx",
"args": ["awslabs.cfn-mcp-server@latest"]
},
"aws-iac": {
"command": "uvx",
"args": ["awslabs.aws-iac-mcp-server@latest"]
}
}
}

Credentials come from your standard AWS profile chain; set AWS_PROFILE in the server env block if you use named profiles.

For programming-language infrastructure, the Pulumi server (@pulumi/mcp-server) runs pulumi preview/up, retrieves stack outputs, and reads the Pulumi Registry.

{
"mcpServers": {
"pulumi": {
"command": "npx",
"args": ["@pulumi/mcp-server@latest", "stdio"]
}
}
}

Generate Production Terraform, Then Read It Critically

Section titled “Generate Production Terraform, Then Read It Critically”

With the Terraform MCP server connected, generation grounds itself in current schemas. Ask for the latest provider versions explicitly—then verify the result rather than trusting it.

A grounded request produces a root config like this:

terraform {
required_version = ">= 1.9"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 6.0"
}
}
backend "s3" {
bucket = "acme-tf-state-prod"
key = "platform/terraform.tfstate"
region = "us-east-1"
encrypt = true
use_lockfile = true
}
}

Two things to verify before you trust any AI-generated Terraform, because models routinely get them wrong:

  • Provider pin matches reality. The AWS provider is on the 6.x line (6.49.0 at the time of writing). If the model emits ~> 5.0 while claiming it’s “the latest,” that’s a stale-training tell—correct it to ~> 6.0 and re-run terraform init -upgrade.
  • State locking is current. Modern Terraform locks S3 state with use_lockfile = true; the old dynamodb_table lock is no longer required. If you see a DynamoDB lock table generated for a new project, ask why.

When the model returns that table, expect it to over-flag (e.g., insisting an internal ALB needs WAF). Treat it as a senior reviewer’s first pass: accept the encryption-at-rest and least-privilege findings, push back on the ones that don’t fit your threat model, and ask it to justify any “Critical” you disagree with.

Drift—someone clicking in the console, a hotfix that never made it back to code—is where IaC quietly rots. With the AWS MCP server connected, the assistant can read actual resource state and compare it to what your code declares.

A representative slice of what comes back:

URGENT db-cluster-prod / SecurityGroupIngress
expected: 10.0.0.0/16 on 5432
current: 0.0.0.0/0 on 5432 <- opened manually 3 days ago
risk: Postgres exposed to the internet
fix: revert in console now, then re-apply stack to lock it

Before you act on it, verify the two things AI drift reports get wrong: that the “expected” value matches your current code (not an old plan), and that the manual change wasn’t a deliberate, undocumented break-glass fix. Confirm with whoever owns the stack, then let the assistant generate the corrective change set.

Cost optimization is multi-dimensional—performance, reliability, and spend traded against each other—which is exactly the kind of reasoning to delegate, then audit.

For a batch workload, give the model the shape of the job and let it pick the cost-optimal topology:

Workload: 100GB/day batch, tolerates 4h delay, needs 16 vCPU / 64GB while running,
only 08:00-18:00 EST. Target under $500/month. Recommend compute (Spot vs. on-demand
vs. Fargate), storage tiering, and the scheduler. Show the trade-off you made to hit budget.

A good answer reaches for Spot with an on-demand fallback, a scheduled scale-to-zero outside business hours, and S3 Intelligent-Tiering—then names the trade-off (Spot interruptions add latency variance within the 4h SLA budget). If it silently picks always-on on-demand, push back: “Why not Spot, given the 4-hour tolerance?”

For teams that prefer real programming languages over HCL or YAML, Pulumi and CDK let the assistant generate typed, testable infrastructure.

Ask: “Using Pulumi TypeScript, create a ComponentResource for a microservice with a configurable replica count, CPU/memory, and an ingress path. Default to 1 replica in non-prod and 3 in prod.” A grounded result looks like:

import * as pulumi from '@pulumi/pulumi';
interface ServiceArgs {
environment: string;
replicas: number;
cpu: string;
memory: string;
}
class MicroserviceStack extends pulumi.ComponentResource {
constructor(name: string, args: ServiceArgs) {
super('acme:infra:MicroserviceStack', name, {}, {});
// ...service, ingress, and monitoring resources, parented to this
this.registerOutputs();
}
}
for (const svc of ['user-service', 'order-service', 'payment-service']) {
new MicroserviceStack(svc, {
environment: env,
replicas: env === 'production' ? 3 : 1,
cpu: '500m',
memory: '1Gi',
});
}

Then: “Generate @pulumi/policy unit tests asserting that every service in production has at least 2 replicas and a CPU limit set.”

Whichever you choose, pin the engine version against reality (Aurora Postgres 16.x is current and has the longest support window) and don’t ship the first synth—run the synthesized template through awslabs.aws-iac-mcp-server for a validation pass.

Real IaC failures cluster into a few patterns. Recognize them fast:

  1. State lock contention. terraform apply hangs on “Acquiring state lock.” Usually a CI run and a human ran at once, or a previous run crashed mid-apply. Prompt: “A terraform apply is blocked on a state lock. Check whether the lock is stale (look at the lock holder and timestamp), tell me whether force-unlock <ID> is safe, and recommend a CI mutex so this stops happening.” Never force-unlock blindly—you can corrupt state if an apply is genuinely still running.

  2. Drift after a manual console edit. Your plan shows changes you didn’t make. Someone fixed something by hand. Run the drift prompt above, then decide per-resource: import the change or revert it. Don’t blanket-revert—you may erase a real fix.

  3. Provider-pin breakage. terraform init fails or a plan explodes after the model wrote ~> 6.0 and your lockfile pins 5.x (or vice versa). Regenerate the lockfile with terraform providers lock, and never let the AI bump a major provider version without reading that provider’s upgrade guide first.

  4. MCP server returns nothing useful. The AWS server can’t see your account, or the Terraform server returns stale data. Almost always auth or region: confirm AWS_PROFILE/AWS_REGION are set in the server’s env, that the profile has read permissions, and that you registered the right package (not a deprecated one). Re-run claude mcp list / codex mcp list to confirm the server is actually connected.

  5. Hallucinated resources or arguments. Without MCP grounding, models invent plausible-but-fake arguments. If terraform validate rejects a field, don’t argue with the model—reconnect the Terraform MCP server and ask it to confirm the argument against the live provider schema.