Kubernetes Orchestration Patterns
Your rollout just took the API down. The readiness probe went green before the database connection pool finished warming up, so Kubernetes shifted live traffic onto pods that immediately returned 500s. The dashboard is red, your PM is asking for an ETA, and the manifest that caused it was generated by an AI tool that didn’t know your pool takes eight seconds to fill.
That is the real risk with AI-generated Kubernetes config: the YAML applies cleanly and still ships an outage. The fix is not to stop using AI — it is to drive it with prompts that encode your actual constraints (probe timing, resource headroom, disruption budgets) and to review the output against the failure modes that bite in production.
What You’ll Walk Away With
Section titled “What You’ll Walk Away With”- A copy-paste prompt that generates a zero-downtime Deployment with probes tuned to a real warm-up window — not the AI’s default
initialDelaySeconds: 0. - A debugging prompt that turns
kubectl describe+kubectl logsoutput into a ranked root-cause list forCrashLoopBackOff,OOMKilled, andImagePullBackOff. - A prompt that converts loose manifests into a parameterized Helm chart with values validation — using maintained chart dependencies, not the deprecated Bitnami catalog.
- A “When This Breaks” playbook for the five failures AI-generated manifests cause most: probe races, OOMKills, PDBs that block node drains, image pull failures, and stuck ArgoCD syncs.
The Workflow
Section titled “The Workflow”The pattern is the same across all three tools: describe the workload and its operational constraints, let the tool generate, then diff the result against what would actually fail. What differs is how you invoke each tool — and that difference matters for Kubernetes work, where you often iterate on a directory of manifests.
Open the agent chat (Cmd/Ctrl+I), reference the directory with @k8s/, and let the agent write files directly. Cursor’s checkpoints let you accept the Deployment, then roll back just the HPA if it generated a bad metric target. Keep @k8s/values.yaml open so inline edits stay scoped to one file.
The strength here is visual review: Cursor shows a diff for every manifest it touches, so you catch a maxUnavailable: 0 that would deadlock a single-replica Deployment before you apply.
Run claude in the repo root to open the REPL, then paste the prompt. Because Claude Code reads the working tree, it can see your existing kustomization.yaml and overlays and keep new manifests consistent with them.
For scripted, non-interactive generation (e.g. regenerating manifests in CI), use print mode: claude -p "Regenerate the staging HPA from k8s/base/hpa.yaml with maxReplicas 30" --output-format text. Don’t treat the bare claude "query" form as a batch command — it opens an interactive REPL seeded with that prompt.
Codex spans App, CLI, IDE, and Cloud. In the terminal, run codex to start the TUI, or pass a prompt positionally: codex "Generate a zero-downtime Deployment for the orders service". For multi-file refactors that you don’t want touching your working tree, run it in a git worktree so the changes land on an isolated branch.
Set the approval gate explicitly with --ask-for-approval on-request (values: untrusted, on-failure, on-request, never) and the sandbox with --sandbox workspace-write. For a hands-off run that still can’t touch the network, combine --full-auto with --sandbox workspace-write. Codex Cloud can open the resulting manifest changes as a GitHub PR for review.
The MCP setup is identical across all three tools — they all speak the Model Context Protocol. A Kubernetes MCP server lets the tool query live cluster state (what’s actually running, recent events) instead of guessing:
# Claude Codeclaude mcp add k8s -- npx -y kubernetes-mcp-server
# Cursor / Codex: add the same server to .mcp.json# (Cursor reads .cursor/mcp.json; Codex reads ~/.codex/config.toml or .mcp.json){ "mcpServers": { "k8s": { "command": "npx", "args": ["-y", "kubernetes-mcp-server"] } }}With the cluster connected, “why is the orders pod restarting?” gets answered against real kubectl get events output instead of a generic checklist.
Step 1 — Generate a Deployment that won’t race its own probes
Section titled “Step 1 — Generate a Deployment that won’t race its own probes”The single most common AI manifest defect is a readiness probe that passes before the app is actually ready. Encode the warm-up window in the prompt and the generated startupProbe will gate the others correctly.
Here is the shape you should expect back — note how startupProbe protects the slow boot, and maxUnavailable: 0 guarantees no capacity dips during the roll:
apiVersion: apps/v1kind: Deploymentmetadata: name: orders-api namespace: production labels: app: orders-api tier: apispec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: orders-api template: metadata: labels: app: orders-api annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" prometheus.io/path: "/metrics" spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 seccompProfile: type: RuntimeDefault affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: orders-api topologyKey: kubernetes.io/hostname containers: - name: orders-api image: myregistry/orders-api:v1.4.0 ports: - name: http containerPort: 8080 - name: metrics containerPort: 9090 resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" startupProbe: # gates liveness/readiness until the pool warms httpGet: path: /health/startup port: http periodSeconds: 2 failureThreshold: 30 # up to 60s before the app is declared ready readinessProbe: httpGet: path: /health/ready port: http periodSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: path: /health/live port: http periodSeconds: 10 failureThreshold: 3 securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: ["ALL"]What to check before applying: if the AI emits initialDelaySeconds on the readiness probe instead of a startupProbe, reject it — initialDelaySeconds is a fixed guess, while startupProbe adapts to a slow boot and stops liveness from killing the pod mid-warm-up.
Step 2 — Stand up the surrounding networking
Section titled “Step 2 — Stand up the surrounding networking”A Deployment alone isn’t reachable. Generate the Service, Ingress, and a default-deny NetworkPolicy together so the workload is exposed and locked down. The manifests below are production-shaped — TLS via cert-manager, rate limiting on the Ingress, and egress restricted to Postgres, Redis, and DNS:
apiVersion: v1kind: Servicemetadata: name: orders-api namespace: productionspec: type: ClusterIP ports: - name: http port: 80 targetPort: http selector: app: orders-api---# ingress.yamlapiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: orders-api namespace: production annotations: nginx.ingress.kubernetes.io/rate-limit: "100" nginx.ingress.kubernetes.io/force-ssl-redirect: "true" cert-manager.io/cluster-issuer: "letsencrypt-prod"spec: ingressClassName: nginx tls: - hosts: [api.example.com] secretName: api-tls-secret rules: - host: api.example.com http: paths: - path: / pathType: Prefix backend: service: name: orders-api port: number: 80---# networkpolicy.yaml — default-deny ingress, scoped egressapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: orders-api namespace: productionspec: podSelector: matchLabels: app: orders-api policyTypes: [Ingress, Egress] ingress: - from: - podSelector: matchLabels: app: nginx-ingress ports: - protocol: TCP port: 8080 egress: - to: - namespaceSelector: matchLabels: name: production ports: - protocol: TCP port: 5432 # PostgreSQL - protocol: TCP port: 6379 # Redis - to: # DNS must be allowed explicitly - namespaceSelector: {} podSelector: matchLabels: k8s-app: kube-dns ports: - protocol: UDP port: 53Step 3 — Parameterize into a Helm chart
Section titled “Step 3 — Parameterize into a Helm chart”When the same manifests need to ship to dev, staging, and prod, ask the tool to extract them into a chart with values validation. The important correction in 2026: do not let the AI default to Bitnami chart dependencies. As of August 28, 2025 the public Bitnami catalog was cut down to a “latest”-only community subset and older image tags moved to bitnamilegacy, so the postgresql/redis chart pins that AI tools learned from training data now resolve to removed or broken images. Pin maintained alternatives instead.
The Chart.yaml and validation helper should come back looking like this — note the dependency repos are CloudNativePG and the prometheus-community charts, both actively maintained:
apiVersion: v2name: orders-apidescription: Helm chart for the orders-api servicetype: applicationversion: 1.0.0appVersion: "1.4.0"dependencies: - name: cloudnative-pg version: "0.x.x" # CloudNativePG operator — maintained repository: https://cloudnative-pg.github.io/charts condition: postgresql.enabled - name: kube-prometheus-stack version: "70.x.x" repository: https://prometheus-community.github.io/helm-charts condition: monitoring.enabled# templates/_helpers.tpl (validation excerpt){{- define "orders-api.validateValues" -}}{{- if not .Values.image.repository }} {{- fail "image.repository is required" }}{{- end }}{{- if and .Values.ingress.enabled (not .Values.ingress.hosts) }} {{- fail "ingress.hosts must be set when ingress is enabled" }}{{- end }}{{- end }}Step 4 — Wire up GitOps, Istio, and autoscaling
Section titled “Step 4 — Wire up GitOps, Istio, and autoscaling”For continuous delivery, generate an ArgoCD Application that points at the chart’s overlay and self-heals. For traffic shaping, generate Istio config — and make sure the apiVersion is networking.istio.io/v1 (promoted to stable in Istio 1.22 and recommended for new config), not the older v1beta1 that AI tools still emit by default.
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: orders-api-production namespace: argocd finalizers: [resources-finalizer.argocd.argoproj.io]spec: project: default source: repoURL: https://github.com/myorg/k8s-configs targetRevision: main path: applications/orders-api/overlays/production destination: server: https://kubernetes.default.svc namespace: production syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true---# virtualservice.yaml — note apiVersion v1 (Istio 1.22+), not v1beta1apiVersion: networking.istio.io/v1kind: VirtualServicemetadata: name: orders-api namespace: productionspec: hosts: [orders-api] http: - route: # 90/10 canary split - destination: host: orders-api subset: v1 weight: 90 - destination: host: orders-api subset: v2 weight: 10 retries: attempts: 3 perTryTimeout: 10s retryOn: 5xx timeout: 30sFor autoscaling, generate an HPA and — critically — a PodDisruptionBudget. The PDB has a sharp edge AI tools get wrong: you may set minAvailable OR maxUnavailable, never both. A manifest with both is rejected by kubectl apply. Pick one (maxUnavailable is usually better because it tracks replica count):
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: orders-api namespace: productionspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: orders-api minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: scaleDown: stabilizationWindowSeconds: 300 # avoid flapping---# pdb.yaml — exactly one of minAvailable / maxUnavailableapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: orders-api namespace: productionspec: maxUnavailable: 1 selector: matchLabels: app: orders-apiStep 5 — CI/CD with current action versions
Section titled “Step 5 — CI/CD with current action versions”Generate a GitHub Actions workflow to build, push, and bump the image tag. AI tools tend to emit stale action majors from their training data — pin the current ones (actions/checkout@v6, docker/build-push-action@v7, peter-evans/create-pull-request@v8):
# .github/workflows/deploy.yml (excerpt)jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v6 - uses: docker/setup-buildx-action@v4 - uses: docker/login-action@v4 with: registry: ghcr.io username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - uses: docker/metadata-action@v5 id: meta with: images: ghcr.io/${{ github.repository }} - uses: docker/build-push-action@v7 with: context: . push: true tags: ${{ steps.meta.outputs.tags }} cache-from: type=gha cache-to: type=gha,mode=max bump-prod: needs: build if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest steps: - uses: actions/checkout@v6 with: repository: myorg/k8s-configs token: ${{ secrets.CONFIG_REPO_TOKEN }} - run: kustomize edit set image orders-api=ghcr.io/${{ github.repository }}:${{ github.sha }} working-directory: applications/orders-api/overlays/production - uses: peter-evans/create-pull-request@v8 with: token: ${{ secrets.CONFIG_REPO_TOKEN }} commit-message: "Deploy ${{ github.sha }} to production" title: "Deploy ${{ github.sha }} to production" branch: deploy-${{ github.sha }}Copy-Paste Prompts
Section titled “Copy-Paste Prompts”When This Breaks
Section titled “When This Breaks”-
CrashLoopBackOff right after a rollout. Usually the liveness probe is killing the pod before it finishes booting. Check
kubectl logs <pod> --previousfor the last words before the kill. Fix: move the boot guard into astartupProbe(see Step 1) so liveness only begins after the app reports ready. -
OOMKilled under load.
kubectl describe podshowsLast State: Terminated, Reason: OOMKilled. The memory limit is too tight for peak usage. Memory is non-compressible — the kernel kills the container rather than throttling it. Fix: raise the memory limit above observed P99 (use the “tighten resource limits” prompt with realkubectl topdata), not by guessing. -
A node drain hangs forever.
kubectl drainstalls because the PodDisruptionBudget won’t allow another eviction. Common cause:minAvailableequals the replica count (e.g.minAvailable: 3with 3 replicas leaves zero disruption budget), or the manifest illegally set bothminAvailableandmaxUnavailable. Fix: set exactly one field, and leave real headroom (maxUnavailable: 1). -
ImagePullBackOff.
kubectl describe podshows the pull error. Either the tag doesn’t exist (the CI bump pushed a different SHA than the manifest references), the registry needsimagePullSecrets, or — increasingly common in 2026 — the chart points at a removed Bitnami image. Fix: confirm the tag withcrane ls/ your registry UI, attach the pull secret, and replace Bitnami deps with maintained charts. -
ArgoCD stuck
OutOfSyncorProgressingforever. Often a field the cluster mutates (likereplicas, managed by the HPA) fighting Git. Add anignoreDifferencesentry for/spec/replicasso ArgoCD stops trying to revert the HPA. If it’sProgressing, the underlying Deployment never goes Ready — drop back to failure mode 1 or 2.