Skip to content

Kubernetes Orchestration Patterns

Your rollout just took the API down. The readiness probe went green before the database connection pool finished warming up, so Kubernetes shifted live traffic onto pods that immediately returned 500s. The dashboard is red, your PM is asking for an ETA, and the manifest that caused it was generated by an AI tool that didn’t know your pool takes eight seconds to fill.

That is the real risk with AI-generated Kubernetes config: the YAML applies cleanly and still ships an outage. The fix is not to stop using AI — it is to drive it with prompts that encode your actual constraints (probe timing, resource headroom, disruption budgets) and to review the output against the failure modes that bite in production.

  • A copy-paste prompt that generates a zero-downtime Deployment with probes tuned to a real warm-up window — not the AI’s default initialDelaySeconds: 0.
  • A debugging prompt that turns kubectl describe + kubectl logs output into a ranked root-cause list for CrashLoopBackOff, OOMKilled, and ImagePullBackOff.
  • A prompt that converts loose manifests into a parameterized Helm chart with values validation — using maintained chart dependencies, not the deprecated Bitnami catalog.
  • A “When This Breaks” playbook for the five failures AI-generated manifests cause most: probe races, OOMKills, PDBs that block node drains, image pull failures, and stuck ArgoCD syncs.

The pattern is the same across all three tools: describe the workload and its operational constraints, let the tool generate, then diff the result against what would actually fail. What differs is how you invoke each tool — and that difference matters for Kubernetes work, where you often iterate on a directory of manifests.

Open the agent chat (Cmd/Ctrl+I), reference the directory with @k8s/, and let the agent write files directly. Cursor’s checkpoints let you accept the Deployment, then roll back just the HPA if it generated a bad metric target. Keep @k8s/values.yaml open so inline edits stay scoped to one file.

The strength here is visual review: Cursor shows a diff for every manifest it touches, so you catch a maxUnavailable: 0 that would deadlock a single-replica Deployment before you apply.

The MCP setup is identical across all three tools — they all speak the Model Context Protocol. A Kubernetes MCP server lets the tool query live cluster state (what’s actually running, recent events) instead of guessing:

Terminal window
# Claude Code
claude mcp add k8s -- npx -y kubernetes-mcp-server
# Cursor / Codex: add the same server to .mcp.json
# (Cursor reads .cursor/mcp.json; Codex reads ~/.codex/config.toml or .mcp.json)
{
"mcpServers": {
"k8s": {
"command": "npx",
"args": ["-y", "kubernetes-mcp-server"]
}
}
}

With the cluster connected, “why is the orders pod restarting?” gets answered against real kubectl get events output instead of a generic checklist.

Step 1 — Generate a Deployment that won’t race its own probes

Section titled “Step 1 — Generate a Deployment that won’t race its own probes”

The single most common AI manifest defect is a readiness probe that passes before the app is actually ready. Encode the warm-up window in the prompt and the generated startupProbe will gate the others correctly.

Here is the shape you should expect back — note how startupProbe protects the slow boot, and maxUnavailable: 0 guarantees no capacity dips during the roll:

deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
namespace: production
labels:
app: orders-api
tier: api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: orders-api
template:
metadata:
labels:
app: orders-api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: orders-api
topologyKey: kubernetes.io/hostname
containers:
- name: orders-api
image: myregistry/orders-api:v1.4.0
ports:
- name: http
containerPort: 8080
- name: metrics
containerPort: 9090
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
startupProbe: # gates liveness/readiness until the pool warms
httpGet:
path: /health/startup
port: http
periodSeconds: 2
failureThreshold: 30 # up to 60s before the app is declared ready
readinessProbe:
httpGet:
path: /health/ready
port: http
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: http
periodSeconds: 10
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]

What to check before applying: if the AI emits initialDelaySeconds on the readiness probe instead of a startupProbe, reject it — initialDelaySeconds is a fixed guess, while startupProbe adapts to a slow boot and stops liveness from killing the pod mid-warm-up.

Step 2 — Stand up the surrounding networking

Section titled “Step 2 — Stand up the surrounding networking”

A Deployment alone isn’t reachable. Generate the Service, Ingress, and a default-deny NetworkPolicy together so the workload is exposed and locked down. The manifests below are production-shaped — TLS via cert-manager, rate limiting on the Ingress, and egress restricted to Postgres, Redis, and DNS:

service.yaml
apiVersion: v1
kind: Service
metadata:
name: orders-api
namespace: production
spec:
type: ClusterIP
ports:
- name: http
port: 80
targetPort: http
selector:
app: orders-api
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: orders-api
namespace: production
annotations:
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts: [api.example.com]
secretName: api-tls-secret
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: orders-api
port:
number: 80
---
# networkpolicy.yaml — default-deny ingress, scoped egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: orders-api
namespace: production
spec:
podSelector:
matchLabels:
app: orders-api
policyTypes: [Ingress, Egress]
ingress:
- from:
- podSelector:
matchLabels:
app: nginx-ingress
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: production
ports:
- protocol: TCP
port: 5432 # PostgreSQL
- protocol: TCP
port: 6379 # Redis
- to: # DNS must be allowed explicitly
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53

When the same manifests need to ship to dev, staging, and prod, ask the tool to extract them into a chart with values validation. The important correction in 2026: do not let the AI default to Bitnami chart dependencies. As of August 28, 2025 the public Bitnami catalog was cut down to a “latest”-only community subset and older image tags moved to bitnamilegacy, so the postgresql/redis chart pins that AI tools learned from training data now resolve to removed or broken images. Pin maintained alternatives instead.

The Chart.yaml and validation helper should come back looking like this — note the dependency repos are CloudNativePG and the prometheus-community charts, both actively maintained:

Chart.yaml
apiVersion: v2
name: orders-api
description: Helm chart for the orders-api service
type: application
version: 1.0.0
appVersion: "1.4.0"
dependencies:
- name: cloudnative-pg
version: "0.x.x" # CloudNativePG operator — maintained
repository: https://cloudnative-pg.github.io/charts
condition: postgresql.enabled
- name: kube-prometheus-stack
version: "70.x.x"
repository: https://prometheus-community.github.io/helm-charts
condition: monitoring.enabled
# templates/_helpers.tpl (validation excerpt)
{{- define "orders-api.validateValues" -}}
{{- if not .Values.image.repository }}
{{- fail "image.repository is required" }}
{{- end }}
{{- if and .Values.ingress.enabled (not .Values.ingress.hosts) }}
{{- fail "ingress.hosts must be set when ingress is enabled" }}
{{- end }}
{{- end }}

Step 4 — Wire up GitOps, Istio, and autoscaling

Section titled “Step 4 — Wire up GitOps, Istio, and autoscaling”

For continuous delivery, generate an ArgoCD Application that points at the chart’s overlay and self-heals. For traffic shaping, generate Istio config — and make sure the apiVersion is networking.istio.io/v1 (promoted to stable in Istio 1.22 and recommended for new config), not the older v1beta1 that AI tools still emit by default.

argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: orders-api-production
namespace: argocd
finalizers: [resources-finalizer.argocd.argoproj.io]
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-configs
targetRevision: main
path: applications/orders-api/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
---
# virtualservice.yaml — note apiVersion v1 (Istio 1.22+), not v1beta1
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: orders-api
namespace: production
spec:
hosts: [orders-api]
http:
- route: # 90/10 canary split
- destination:
host: orders-api
subset: v1
weight: 90
- destination:
host: orders-api
subset: v2
weight: 10
retries:
attempts: 3
perTryTimeout: 10s
retryOn: 5xx
timeout: 30s

For autoscaling, generate an HPA and — critically — a PodDisruptionBudget. The PDB has a sharp edge AI tools get wrong: you may set minAvailable OR maxUnavailable, never both. A manifest with both is rejected by kubectl apply. Pick one (maxUnavailable is usually better because it tracks replica count):

hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: orders-api
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: orders-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # avoid flapping
---
# pdb.yaml — exactly one of minAvailable / maxUnavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: orders-api
namespace: production
spec:
maxUnavailable: 1
selector:
matchLabels:
app: orders-api

Step 5 — CI/CD with current action versions

Section titled “Step 5 — CI/CD with current action versions”

Generate a GitHub Actions workflow to build, push, and bump the image tag. AI tools tend to emit stale action majors from their training data — pin the current ones (actions/checkout@v6, docker/build-push-action@v7, peter-evans/create-pull-request@v8):

# .github/workflows/deploy.yml (excerpt)
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: docker/setup-buildx-action@v4
- uses: docker/login-action@v4
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/metadata-action@v5
id: meta
with:
images: ghcr.io/${{ github.repository }}
- uses: docker/build-push-action@v7
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
bump-prod:
needs: build
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
repository: myorg/k8s-configs
token: ${{ secrets.CONFIG_REPO_TOKEN }}
- run: kustomize edit set image orders-api=ghcr.io/${{ github.repository }}:${{ github.sha }}
working-directory: applications/orders-api/overlays/production
- uses: peter-evans/create-pull-request@v8
with:
token: ${{ secrets.CONFIG_REPO_TOKEN }}
commit-message: "Deploy ${{ github.sha }} to production"
title: "Deploy ${{ github.sha }} to production"
branch: deploy-${{ github.sha }}
  1. CrashLoopBackOff right after a rollout. Usually the liveness probe is killing the pod before it finishes booting. Check kubectl logs <pod> --previous for the last words before the kill. Fix: move the boot guard into a startupProbe (see Step 1) so liveness only begins after the app reports ready.

  2. OOMKilled under load. kubectl describe pod shows Last State: Terminated, Reason: OOMKilled. The memory limit is too tight for peak usage. Memory is non-compressible — the kernel kills the container rather than throttling it. Fix: raise the memory limit above observed P99 (use the “tighten resource limits” prompt with real kubectl top data), not by guessing.

  3. A node drain hangs forever. kubectl drain stalls because the PodDisruptionBudget won’t allow another eviction. Common cause: minAvailable equals the replica count (e.g. minAvailable: 3 with 3 replicas leaves zero disruption budget), or the manifest illegally set both minAvailable and maxUnavailable. Fix: set exactly one field, and leave real headroom (maxUnavailable: 1).

  4. ImagePullBackOff. kubectl describe pod shows the pull error. Either the tag doesn’t exist (the CI bump pushed a different SHA than the manifest references), the registry needs imagePullSecrets, or — increasingly common in 2026 — the chart points at a removed Bitnami image. Fix: confirm the tag with crane ls / your registry UI, attach the pull secret, and replace Bitnami deps with maintained charts.

  5. ArgoCD stuck OutOfSync or Progressing forever. Often a field the cluster mutates (like replicas, managed by the HPA) fighting Git. Add an ignoreDifferences entry for /spec/replicas so ArgoCD stops trying to revert the HPA. If it’s Progressing, the underlying Deployment never goes Ready — drop back to failure mode 1 or 2.