Kubernetes Orchestration Patterns

Your rollout just took the API down. The readiness probe went green before the database connection pool finished warming up, so Kubernetes shifted live traffic onto pods that immediately returned 500s. The dashboard is red, your PM is asking for an ETA, and the manifest that caused it was generated by an AI tool that didn’t know your pool takes eight seconds to fill.

That is the real risk with AI-generated Kubernetes config: the YAML applies cleanly and still ships an outage. The fix is not to stop using AI — it is to drive it with prompts that encode your actual constraints (probe timing, resource headroom, disruption budgets) and to review the output against the failure modes that bite in production.

What You’ll Walk Away With

A copy-paste prompt that generates a zero-downtime Deployment with probes tuned to a real warm-up window — not the AI’s default initialDelaySeconds: 0.
A debugging prompt that turns kubectl describe + kubectl logs output into a ranked root-cause list for CrashLoopBackOff, OOMKilled, and ImagePullBackOff.
A prompt that converts loose manifests into a parameterized Helm chart with values validation — using maintained chart dependencies, not the deprecated Bitnami catalog.
A “When This Breaks” playbook for the five failures AI-generated manifests cause most: probe races, OOMKills, PDBs that block node drains, image pull failures, and stuck ArgoCD syncs.

The Workflow

The pattern is the same across all three tools: describe the workload and its operational constraints, let the tool generate, then diff the result against what would actually fail. What differs is how you invoke each tool — and that difference matters for Kubernetes work, where you often iterate on a directory of manifests.

Open the agent chat (Cmd/Ctrl+I), reference the directory with @k8s/, and let the agent write files directly. Cursor’s checkpoints let you accept the Deployment, then roll back just the HPA if it generated a bad metric target. Keep @k8s/values.yaml open so inline edits stay scoped to one file.

The strength here is visual review: Cursor shows a diff for every manifest it touches, so you catch a maxUnavailable: 0 that would deadlock a single-replica Deployment before you apply.

Run claude in the repo root to open the REPL, then paste the prompt. Because Claude Code reads the working tree, it can see your existing kustomization.yaml and overlays and keep new manifests consistent with them.

For scripted, non-interactive generation (e.g. regenerating manifests in CI), use print mode: claude -p "Regenerate the staging HPA from k8s/base/hpa.yaml with maxReplicas 30" --output-format text. Don’t treat the bare claude "query" form as a batch command — it opens an interactive REPL seeded with that prompt.

Codex is available in the ChatGPT desktop app, CLI, IDE, and Cloud. In the terminal, run codex to start the TUI, or pass a prompt positionally. For multi-file refactors that you don’t want touching your working tree, run it in a git worktree so the changes land on an isolated branch.

For reviewable local work, run codex --sandbox workspace-write -c approval_policy=on-request "Generate a zero-downtime Deployment for the orders service". The sandbox confines writes to the workspace, while Codex can request approval for broader access. For trusted unattended CI, use codex exec --sandbox workspace-write -c approval_policy=never "<prompt>" only in an isolated checkout with a timeout and mandatory diff/validation review. Codex Cloud can open the resulting manifest changes as a GitHub PR for review.

The MCP setup is identical across all three tools — they all speak the Model Context Protocol. A Kubernetes MCP server lets the tool query live cluster state (what’s actually running, recent events) instead of guessing:

# Claude Code
claude mcp add k8s -- npx -y kubernetes-mcp-server

# Cursor / Codex: add the same server to .mcp.json
# (Cursor reads .cursor/mcp.json; Codex reads ~/.codex/config.toml or .mcp.json)

{
  "mcpServers": {
    "k8s": {
      "command": "npx",
      "args": ["-y", "kubernetes-mcp-server"]
    }
  }
}

With the cluster connected, “why is the orders pod restarting?” gets answered against real kubectl get events output instead of a generic checklist.

Step 1 — Generate a Deployment that won’t race its own probes

The single most common AI manifest defect is a readiness probe that passes before the app is actually ready. Encode the warm-up window in the prompt and the generated startupProbe will gate the others correctly.

Here is the shape you should expect back — note how startupProbe protects the slow boot, and maxUnavailable: 0 guarantees no capacity dips during the roll:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
  namespace: production
  labels:
    app: orders-api
    tier: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: orders-api
  template:
    metadata:
      labels:
        app: orders-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: orders-api
              topologyKey: kubernetes.io/hostname
      containers:
      - name: orders-api
        image: myregistry/orders-api:v1.4.0
        ports:
        - name: http
          containerPort: 8080
        - name: metrics
          containerPort: 9090
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        startupProbe:        # gates liveness/readiness until the pool warms
          httpGet:
            path: /health/startup
            port: http
          periodSeconds: 2
          failureThreshold: 30   # up to 60s before the app is declared ready
        readinessProbe:
          httpGet:
            path: /health/ready
            port: http
          periodSeconds: 5
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health/live
            port: http
          periodSeconds: 10
          failureThreshold: 3
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]

What to check before applying: if the AI emits initialDelaySeconds on the readiness probe instead of a startupProbe, reject it — initialDelaySeconds is a fixed guess, while startupProbe adapts to a slow boot and stops liveness from killing the pod mid-warm-up.

Step 2 — Stand up the surrounding networking

A Deployment alone isn’t reachable. Generate the Service, Ingress, and a default-deny NetworkPolicy together so the workload is exposed and locked down. The manifests below are production-shaped — TLS via cert-manager, rate limiting on the Ingress, and egress restricted to Postgres, Redis, and DNS:

apiVersion: v1
kind: Service
metadata:
  name: orders-api
  namespace: production
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 80
    targetPort: http
  selector:
    app: orders-api
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: orders-api
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
  - hosts: [api.example.com]
    secretName: api-tls-secret
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: orders-api
            port:
              number: 80
---
# networkpolicy.yaml — default-deny ingress, scoped egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: orders-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: orders-api
  policyTypes: [Ingress, Egress]
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: nginx-ingress
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: production
    ports:
    - protocol: TCP
      port: 5432   # PostgreSQL
    - protocol: TCP
      port: 6379   # Redis
  - to:                        # DNS must be allowed explicitly
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

Step 3 — Parameterize into a Helm chart

When the same manifests need to ship to dev, staging, and prod, ask the tool to extract them into a chart with values validation. The important correction in 2026: do not let the AI default to Bitnami chart dependencies. As of August 28, 2025 the public Bitnami catalog was cut down to a “latest”-only community subset and older image tags moved to bitnamilegacy, so the postgresql/redis chart pins that AI tools learned from training data now resolve to removed or broken images. Pin maintained alternatives instead.

The Chart.yaml and validation helper should come back looking like this — note the dependency repos are CloudNativePG and the prometheus-community charts, both actively maintained:

apiVersion: v2
name: orders-api
description: Helm chart for the orders-api service
type: application
version: 1.0.0
appVersion: "1.4.0"
dependencies:
  - name: cloudnative-pg
    version: "0.x.x"               # CloudNativePG operator — maintained
    repository: https://cloudnative-pg.github.io/charts
    condition: postgresql.enabled
  - name: kube-prometheus-stack
    version: "70.x.x"
    repository: https://prometheus-community.github.io/helm-charts
    condition: monitoring.enabled

# templates/_helpers.tpl (validation excerpt)
{{- define "orders-api.validateValues" -}}
{{- if not .Values.image.repository }}
  {{- fail "image.repository is required" }}
{{- end }}
{{- if and .Values.ingress.enabled (not .Values.ingress.hosts) }}
  {{- fail "ingress.hosts must be set when ingress is enabled" }}
{{- end }}
{{- end }}

Step 4 — Wire up GitOps, Istio, and autoscaling

For continuous delivery, generate an ArgoCD Application that points at the chart’s overlay and self-heals. For traffic shaping, generate Istio config — and make sure the apiVersion is networking.istio.io/v1 (promoted to stable in Istio 1.22 and recommended for new config), not the older v1beta1 that AI tools still emit by default.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: orders-api-production
  namespace: argocd
  finalizers: [resources-finalizer.argocd.argoproj.io]
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-configs
    targetRevision: main
    path: applications/orders-api/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
---
# virtualservice.yaml — note apiVersion v1 (Istio 1.22+), not v1beta1
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: orders-api
  namespace: production
spec:
  hosts: [orders-api]
  http:
  - route:                    # 90/10 canary split
    - destination:
        host: orders-api
        subset: v1
      weight: 90
    - destination:
        host: orders-api
        subset: v2
      weight: 10
    retries:
      attempts: 3
      perTryTimeout: 10s
      retryOn: 5xx
    timeout: 30s

For autoscaling, generate an HPA and — critically — a PodDisruptionBudget. The PDB has a sharp edge AI tools get wrong: you may set minAvailable OR maxUnavailable, never both. A manifest with both is rejected by kubectl apply. Pick one (maxUnavailable is usually better because it tracks replica count):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: orders-api
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: orders-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # avoid flapping
---
# pdb.yaml — exactly one of minAvailable / maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: orders-api
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: orders-api

Step 5 — CI/CD with current action versions

Generate a GitHub Actions workflow to build, push, and bump the image tag. AI tools tend to emit stale action majors from their training data — pin the current ones (actions/checkout@v6, docker/build-push-action@v7, peter-evans/create-pull-request@v8):

# .github/workflows/deploy.yml (excerpt)
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: docker/setup-buildx-action@v4
      - uses: docker/login-action@v4
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/metadata-action@v5
        id: meta
        with:
          images: ghcr.io/${{ github.repository }}
      - uses: docker/build-push-action@v7
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
  bump-prod:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
        with:
          repository: myorg/k8s-configs
          token: ${{ secrets.CONFIG_REPO_TOKEN }}
      - run: kustomize edit set image orders-api=ghcr.io/${{ github.repository }}:${{ github.sha }}
        working-directory: applications/orders-api/overlays/production
      - uses: peter-evans/create-pull-request@v8
        with:
          token: ${{ secrets.CONFIG_REPO_TOKEN }}
          commit-message: "Deploy ${{ github.sha }} to production"
          title: "Deploy ${{ github.sha }} to production"
          branch: deploy-${{ github.sha }}

Copy-Paste Prompts

Diagnose a CrashLoopBackOff from real output:

Here is the output of kubectl describe pod orders-api-7d8f -n production and kubectl logs orders-api-7d8f -n production --previous:
<paste describe + logs here>
Act as an SRE. Rank the three most likely root causes from most to least probable, citing the exact line of evidence for each. For the top cause, give me the precise kubectl command to confirm it and the one-line manifest change to fix it. Common patterns to weigh: liveness probe failing during slow boot, OOMKill (check Last State: Terminated, Reason: OOMKilled), missing ConfigMap/Secret, and image entrypoint crash.

Tighten resource limits from observed usage:

The orders-api Deployment requests 250m CPU / 256Mi memory with limits 500m / 512Mi. Here is 7 days of kubectl top pod data and the P95/P99 from Prometheus:
<paste usage data>
Recommend new requests and limits. Keep memory limit at least 25% above observed P99 (memory is non-compressible — an OOMKill is worse than CPU throttling). Set the CPU request near the P50 and the limit near P95. Explain the headroom choice in one sentence per resource. Output the patched resources: block only.

Audit a manifest for production readiness:

Review this Kubernetes manifest against production best practices and list only the problems, ranked by severity. Check specifically for: missing resource requests/limits, missing or misconfigured probes (readiness passing before the app is ready), runAsNonRoot not set, a PDB that sets both minAvailable and maxUnavailable, a default-deny NetworkPolicy with no DNS egress, and latest image tags. For each issue give the exact fix. Manifest:
<paste manifest>

When This Breaks

CrashLoopBackOff right after a rollout. Usually the liveness probe is killing the pod before it finishes booting. Check kubectl logs <pod> --previous for the last words before the kill. Fix: move the boot guard into a startupProbe (see Step 1) so liveness only begins after the app reports ready.
OOMKilled under load. kubectl describe pod shows Last State: Terminated, Reason: OOMKilled. The memory limit is too tight for peak usage. Memory is non-compressible — the kernel kills the container rather than throttling it. Fix: raise the memory limit above observed P99 (use the “tighten resource limits” prompt with real kubectl top data), not by guessing.
A node drain hangs forever. kubectl drain stalls because the PodDisruptionBudget won’t allow another eviction. Common cause: minAvailable equals the replica count (e.g. minAvailable: 3 with 3 replicas leaves zero disruption budget), or the manifest illegally set both minAvailable and maxUnavailable. Fix: set exactly one field, and leave real headroom (maxUnavailable: 1).
ImagePullBackOff. kubectl describe pod shows the pull error. Either the tag doesn’t exist (the CI bump pushed a different SHA than the manifest references), the registry needs imagePullSecrets, or — increasingly common in 2026 — the chart points at a removed Bitnami image. Fix: confirm the tag with crane ls / your registry UI, attach the pull secret, and replace Bitnami deps with maintained charts.
ArgoCD stuck OutOfSync or Progressing forever. Often a field the cluster mutates (like replicas, managed by the HPA) fighting Git. Add an ignoreDifferences entry for /spec/replicas so ArgoCD stops trying to revert the HPA. If it’s Progressing, the underlying Deployment never goes Ready — drop back to failure mode 1 or 2.

What’s Next

Docker Containerization Build the lean, non-root images these manifests deploy — before they ever reach the cluster.

CI/CD Pipelines Go deeper on the build-test-deploy automation that feeds your GitOps repo.

MCP Server Ecosystem Connect a Kubernetes MCP server so your AI tool reasons over live cluster state, not guesses.

Debugging Patterns Apply the same describe-plus-logs prompting pattern to incidents beyond Kubernetes.