Docker and Kubernetes Containerization

Your image is 1.2GB, Trivy is flagging 40 CVEs in the base layer, and the pod just got OOMKilled with exit code 137 in staging. You could spend the afternoon reading Dockerfile best-practice blog posts and kubectl describe output, or you could put an AI agent in the loop with the actual files and let it generate, scan, and explain the fix while you review.

This article shows how to drive Docker and Kubernetes work with Cursor, Claude Code, and Codex: generating multi-stage Dockerfiles, hardening manifests, debugging crash loops, and connecting the real Docker and Kubernetes MCP servers so the agent can read live cluster state instead of guessing.

What You’ll Walk Away With

A repeatable prompt that turns a fat single-stage Dockerfile into a hardened, distroless multi-stage build under 80MB
A debugging workflow for exit 137 OOMKills that correlates limits, kubectl describe, and recent code changes
The correct, non-hallucinated setup for the Docker MCP gateway and the kubernetes-mcp-server (with the flags that actually exist)
A clear rule for when an Agent Skill beats a persistent MCP server for container work
Three copy-paste prompts you can run today against your own repo

The Workflow: Generating a Hardened Dockerfile

The highest-leverage move is to hand the agent your current Dockerfile plus your real constraints (runtime, port, build tool) and ask for a multi-stage rewrite. The setup differs per tool, but the prompt is nearly identical.

In Cursor, attach the file as context with @Dockerfile and run this in Agent mode (do not type an @agent prefix — selecting Agent mode is enough; @ is reserved for context references like @Dockerfile or @package.json):

@Dockerfile @package.json Rewrite this as a multi-stage build for our Node 20
service. Builder stage installs dev deps and runs the build; final stage is
gcr.io/distroless/nodejs20-debian12, runs as UID 65534, copies only dist/ and
production node_modules. Add a HEALTHCHECK hitting /health on 3000. Keep the
final image under 80MB and explain each layer you cut.

A persistent rule keeps every future Dockerfile consistent. Add .cursor/rules/containers.md:

---
description: Container build standards
globs: ["**/Dockerfile", "**/*.dockerfile"]
---
- Always use multi-stage builds; never ship build tooling in the final image.
- Final stage: distroless or alpine, non-root USER, no shell unless required.
- Pin base images by tag, never `latest`. Add a HEALTHCHECK.

From the repo root, Claude Code reads the file off disk — no attachment step:

claude "Rewrite ./Dockerfile as a multi-stage build for our Node 20 service.
Builder stage installs dev deps and runs the build; final stage is
gcr.io/distroless/nodejs20-debian12, runs as UID 65534, copies only dist/ and
production node_modules. Add a HEALTHCHECK on /health:3000, keep it under 80MB,
and tell me which layers you removed and why."

Codify the standard in CLAUDE.md at the repo root so it applies to every session:

## Containers
- Multi-stage builds only; distroless/alpine final stage, non-root USER.
- Pin base images by tag. Add HEALTHCHECK. Never COPY secrets into a layer.

Codex (running GPT-5.6 Sol) works the same from the terminal. Keep it in a read-then-write sandbox so it can edit the Dockerfile but not touch your registry credentials:

codex --sandbox workspace-write -c approval_policy=on-request \
  "Rewrite ./Dockerfile as a multi-stage build for our Node 20 service.
   Final stage gcr.io/distroless/nodejs20-debian12, UID 65534, dist/ +
   prod node_modules only, HEALTHCHECK on /health:3000, under 80MB.
   List the layers you cut and why."

The sandbox and approval policy are separate controls: workspace-write permits routine edits inside the repository, while on-request lets Codex ask before an action that needs to cross that boundary.

The result should look roughly like this — the point is not the YAML, it’s that every line is justified and you reviewed it:

FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build && npm prune --omit=dev

FROM gcr.io/distroless/nodejs20-debian12 AS production
WORKDIR /app
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
USER 65534:65534
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD ["node", "dist/healthcheck.js"]
CMD ["dist/server.js"]

Generation is only half the loop. The style that separates a reviewer from a copy-paster is the critique pass — make the agent attack its own output before you trust it.

Copy-paste prompt to critique an AI-generated Dockerfile before merging:

Review the Dockerfile you just produced as a hostile security reviewer.
For each finding give severity and a one-line fix:
- Any layer that COPYs secrets, .env, or .git into the image?
- Is the final stage actually non-root, and does the app still write only to
  volumes (not the root filesystem)?
- Does `npm ci` run before COPY . . so the dependency layer caches?
- Would `docker scout cves` or Trivy flag the base image? Suggest a smaller one.
Then output only the corrected Dockerfile.

Hardening Kubernetes Manifests

The same generate-then-critique loop applies to manifests, but the failure mode is different: agents love to emit a 200-line Deployment with every field set, and you cannot review what you cannot read. Keep the prompt scoped to one resource and one concern at a time.

A focused security-context prompt that works across all three tools (the request is identical — only the invocation differs):

Copy-paste prompt to harden an existing Deployment’s pod security:

Here is our Deployment manifest. Add only the security hardening, nothing else:
- securityContext: runAsNonRoot, runAsUser 65534, readOnlyRootFilesystem true,
  drop ALL capabilities, allowPrivilegeEscalation false
- an emptyDir mount for /tmp since the root FS is now read-only
- resource requests AND limits (start CPU 250m/500m, memory 256Mi/512Mi)
- readiness and liveness probes on the existing /health endpoint
Return a unified diff against the manifest I gave you, not a full rewrite.

Asking for a diff rather than a full rewrite is the key trick: you see exactly the five lines that changed instead of re-reviewing a wall of YAML the agent reproduced from memory (and may have subtly altered).

Debugging an OOMKilled Pod

This is where AI assistance pays for itself. Exit code 137 means the kernel OOM-killed the process — but why takes correlating the limit, actual usage, and what changed. Feed the agent the evidence rather than asking it to speculate.

Collect the evidence into one place.

kubectl describe pod -l app=api > /tmp/pod.txt
kubectl top pod -l app=api >> /tmp/pod.txt
git log --oneline -10 >> /tmp/pod.txt

Hand the bundle to the agent and ask for ranked hypotheses. In Claude Code: claude "Read /tmp/pod.txt ..."; in Cursor, attach the file with @pod.txt; in Codex, pass the path in the prompt. The ask is identical across tools.
Apply the smallest fix and verify. Usually a limit bump or a memory leak in the last deploy. Re-run kubectl top after the rollout and confirm the working set sits under the new limit with headroom.

Copy-paste prompt for an exit-137 OOMKill investigation:

Read /tmp/pod.txt (kubectl describe + kubectl top + recent git log). Our pods
are OOMKilled with exit 137. Give me a ranked list of root causes with the
specific evidence for each:
1. Is the memory LIMIT lower than the actual working set in `kubectl top`?
2. Does any recent commit add an in-memory cache, large buffer, or unbounded
   concurrency that would grow the heap?
3. Is the JVM/Node heap configured larger than the container limit (classic
   cgroup-unaware runtime)?
For the top hypothesis, give the exact manifest change or env var to set, and
how to confirm it worked after redeploy.

Wiring Up the Docker and Kubernetes MCP Servers

Prompts that paste files are fine, but the real upgrade is letting the agent query live state through MCP. Two servers matter here, and both are commonly hallucinated in AI-generated guides — here is the setup that actually works.

Docker MCP (the gateway model)

There is no mcp/docker-toolkit image and no localhost:8080 HTTP endpoint to point a client at. The Docker MCP Toolkit is a Docker Desktop feature powered by the docker mcp CLI plugin (the MCP Gateway). You enable the MCP Toolkit in Docker Desktop, enable the servers you want from the catalog, then connect a client over stdio to the gateway process:

Add the gateway to .cursor/mcp.json:

{
  "mcpServers": {
    "docker": {
      "command": "docker",
      "args": ["mcp", "gateway", "run"]
    }
  }
}

claude mcp add docker -- docker mcp gateway run

Add it to ~/.codex/config.toml:

[mcp_servers.docker]
command = "docker"
args = ["mcp", "gateway", "run"]

Enable servers from the catalog first with docker mcp server enable <name>; the gateway exposes their tools to every connected client. See the Docker MCP Toolkit docs and docker/mcp-gateway.

Kubernetes MCP (real flags only)

The kubernetes-mcp-server package (from containers/kubernetes-mcp-server) is real, but its flags are frequently invented. There is no --audit-log, --rbac-mode, --namespace-filter, or --context flag. The flags that exist are --kubeconfig, --read-only, --toolsets, --port, --disable-multi-cluster, and --config. RBAC is enforced by the ServiceAccount bound to the kubeconfig you point it at — not by a CLI switch.

Start read-only, which is the only sane default for a tool an LLM drives:

.cursor/mcp.json:

{
  "mcpServers": {
    "kubernetes": {
      "command": "npx",
      "args": ["-y", "kubernetes-mcp-server@latest", "--read-only"],
      "env": { "KUBECONFIG": "/path/to/restricted-kubeconfig" }
    }
  }
}

# Safe default: read-only, scoped to a restricted kubeconfig
claude mcp add kubernetes \
  --env KUBECONFIG=/path/to/restricted-kubeconfig \
  -- npx -y kubernetes-mcp-server@latest --read-only

# Scope which tools are exposed instead of all of them
claude mcp add kubernetes -- npx -y kubernetes-mcp-server@latest \
  --read-only --toolsets core,helm

~/.codex/config.toml:

[mcp_servers.kubernetes]
command = "npx"
args = ["-y", "kubernetes-mcp-server@latest", "--read-only"]
env = { KUBECONFIG = "/path/to/restricted-kubeconfig" }

With the server connected, cluster questions become conversational — and grounded in real state:

Audit the production cluster: list pods that are not Running, any with no
resource limits set, and any running as root. Group by namespace and flag the
three highest-risk findings.

MCP server or Agent Skill?

Not every task needs a persistent connection. Agent Skills — installed with a universal CLI, npx skills add <owner/repo> (from vercel-labs/skills) and working across Claude Code, Cursor, and Codex — are the lighter-weight option for single-purpose, stateless augmentation: a Dockerfile-linting skill, a Helm-values generator, a deployment-checklist skill. Reach for a skill when you want repeatable knowledge or a one-shot transform; reach for an MCP server when the agent needs to read or act on live state (your running cluster, the Docker daemon). A Dockerfile-hardening skill plus the kubernetes-mcp-server for live reads is a common, complementary pairing.

When This Breaks

The agent emits a removed API. As above, PodSecurityPolicy, extensions/v1beta1, and autoscaling/v2beta2 still show up from stale training data. Pin it: tell the agent your cluster version and have it verify with kubectl api-resources.
docker mcp gateway run exits immediately. The MCP Toolkit feature has to be enabled in Docker Desktop first, and you must docker mcp server enable <name> at least one catalog server. A gateway with nothing enabled has no tools to serve.
The Kubernetes MCP server can see the cluster but every write fails. That is --read-only and a scoped ServiceAccount doing their job. If you genuinely need a mutation, drop --read-only deliberately for that session — do not hand it admin.
CI scan step uses a retired action. The old github/codeql-action/upload-sarif@v2 was retired in 2025; use @v3 (or @v4). Have the agent grep your workflows for pinned action versions and bump only the SARIF upload step:
```
- name: Upload Trivy results
  uses: github/codeql-action/upload-sarif@v3
  if: always()
  with:
    sarif_file: 'trivy-results.sarif'
```
A “secure” devcontainer silently disables permission prompts. If you template a Claude Code devcontainer, the VS Code setting is claudeCode.allowDangerouslySkipPermissions (with claudeCode.initialPermissionMode for the default mode) — not a claude-code.dangerouslySkipPermissions key, which does nothing. And think twice before enabling bypass-permissions in a container you called “secure”: it contradicts the framing. Prefer the default prompting mode and an isolated, network-restricted devcontainer.

What’s Next

CI/CD Pipelines — wire these images into a build-scan-sign-deploy pipeline
Infrastructure as Code — generate and review the Terraform/Helm around these workloads
Monitoring and Observability — close the loop with metrics that catch the next OOMKill before staging does
Incident Response — the live-debugging playbook when one of these pods pages you at 3am