Skip to content

Distributed Systems Development with AI

You change one field on the Order service’s API and three other services start returning 500s in staging. The trace is incomplete because two services never propagated the trace context, the saga that processes payments silently skipped its compensation step, and your on-call dashboard shows green while customers can’t check out. Distributed systems fail in the gaps between services—and that’s exactly where AI assistants are most useful and most dangerous: they generate plausible cross-service scaffolding fast, but pasting “generate the entire production system” gets you code you can’t verify.

This guide shows how to use Cursor, Claude Code, and Codex to do the parts AI is genuinely good at—drafting service skeletons, propagating trace context, writing the boring compensation logic—while keeping the verification loop tight enough that you’d ship the result.

  • A workflow for coordinating a single feature across multiple service repos in each of the three tools
  • Copy-paste prompts for designing service boundaries, writing a saga step with its compensation and a failing test, and instrumenting OpenTelemetry incrementally
  • The real, verified MCP servers for monitoring (Sentry, Grafana, Dynatrace) and infrastructure (Docker, Kubernetes, AWS)—with the exact install commands
  • A “When This Breaks” section covering the failure modes that actually bite: broken trace context, missing saga compensation, and MCP auth failures

Half the value of AI on distributed systems is letting it query live infrastructure instead of guessing. But the ecosystem is full of look-alike npm packages—sentry-mcp is a low-traffic stub, not Sentry’s server. Use these verified servers. MCP setup is identical across Cursor, Claude Code, and Codex: all three read the same server definitions (.mcp.json for Claude Code, .cursor/mcp.json for Cursor, ~/.codex/config.toml for Codex), so the commands below apply to whichever tool you drive.

  1. Sentry (errors, traces, releases) — use the official hosted server with OAuth, no token to manage:

    Terminal window
    claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

    For a self-hosted Sentry, the official npm package is @sentry/mcp-server:

    Terminal window
    claude mcp add sentry -- npx -y @sentry/mcp-server@latest --access-token=YOUR_TOKEN
  2. Grafana (dashboards, Loki/Prometheus queries, incidents) — the official server is grafana/mcp-grafana, a Go binary distributed via Docker (there is no mcp-grafana npm package):

    Terminal window
    claude mcp add grafana -- docker run --rm -i \
    -e GRAFANA_URL=http://localhost:3000 \
    -e GRAFANA_SERVICE_ACCOUNT_TOKEN=YOUR_TOKEN \
    grafana/mcp-grafana -t stdio
  3. Dynatrace (APM, AI anomaly detection) — the official package is published by the Dynatrace OSS org and needs Node 22.10+:

    Terminal window
    DT_ENVIRONMENT=https://YOUR.apps.dynatrace.com \
    claude mcp add dynatrace -- npx -y @dynatrace-oss/dynatrace-mcp-server@latest
  1. Docker — the official MCP ships with Docker Desktop’s MCP Toolkit; you run the gateway rather than an npm package:

    Terminal window
    claude mcp add docker -- docker mcp gateway run

    In Cursor, add a command-type server in Settings → MCP pointing at the same docker mcp gateway run.

  2. Kuberneteskubernetes-mcp-server is a real package; it uses your current kubeconfig context:

    Terminal window
    claude mcp add k8s -- npx -y kubernetes-mcp-server@latest
  3. AWS — AWS Labs publishes purpose-specific servers (not one monolithic image). Pick the one you need and rely on the standard AWS credential chain rather than inlining keys:

    Terminal window
    claude mcp add aws-api -- uvx awslabs.aws-api-mcp-server@latest

    Browse the full catalog at awslabs.github.io/mcp. For Google Cloud, deploy a custom MCP server on Cloud Run—see cloud.google.com/run/docs.

AI drafts bounded-context proposals quickly, but the boundaries are a business decision—treat the output as a first draft to argue with, not a verdict. Start narrow: ask for boundaries plus the reasoning, so you can spot where the model conflated a technical layer with a domain.

When you’ve agreed on boundaries, design one service at a time. Resist “generate all services”—you can’t review a dump of seven services, and the contracts between them are where bugs hide.

The saga pattern is where AI-generated distributed code most often looks right and is wrong. The failure mode is always the same: the happy path is fine, but a compensation step is non-idempotent, or a timeout budget is missing. The fix is to build one step at a time, each with its compensation and a failing test first, then watch it go green.

Open the Order service repo and switch to Agent mode. Ask for one saga step plus a failing test, run the test in Cursor’s terminal, and only accept the diff once it’s green. Use a checkpoint before each step so you can roll back a bad compensation without losing the prior steps. Cursor’s inline diff view makes it easy to spot when the model “fixed” the test by weakening the assertion instead of the code.

Tie every step to an observable check: after the model claims a step works, run the one test that proves the compensation fires. If you can’t articulate the test, you can’t trust the code.

Inter-Service Communication and Trace Context

Section titled “Inter-Service Communication and Trace Context”

Service mesh and gateway configs are high-leverage for AI—but again, incrementally. Start with the smallest config that you can verify with a single command (curl, istioctl analyze), then layer on canary weights and circuit breakers.

For event-driven flows, the recurring production bug is a broken trace: a service consumes a Kafka message but never extracts and re-injects the trace context, so the trace dead-ends. When you ask AI to wire up consumers, make context propagation an explicit, tested requirement—not an afterthought.

Modern observability has moved beyond dashboards to AI-driven anomaly detection and topology-aware root-cause analysis—but you still earn it one service at a time. The “instrument 8 services and 3 databases in one prompt” approach produces config you can’t validate. Instrument one service end to end, confirm a span shows up in Jaeger, then template it.

With the Grafana and Sentry MCP servers connected, you can close the loop without leaving your editor: ask the AI to pull the actual error rate or the slowest trace for a service and reason about it, instead of you screenshotting a dashboard.

A feature like “loyalty points” touches Customer, Order, Payment, and Notification. The coordination problem—not the per-service code—is what makes this hard, and the three tools take genuinely different approaches.

Open all four service repos in a single multi-root workspace so the agent can see every contract at once. Design the OpenAPI/event contracts first, then use a background agent to implement each service in dependency order while you review diffs per repo. Cursor’s per-file checkpoints let you revert one service’s changes without unwinding the others. Best when you want to watch and steer each service’s diff visually.

Distributed systems fail in ways a single-service mindset misses. Here are the failure modes that actually surface with AI-assisted work and how to recover.

  1. Trace context dead-ends at an async boundary. A request shows up in Jaeger for two hops then vanishes. The consumer didn’t extract the trace context from message headers. Search the consumer for context extraction; if it’s missing, ask the AI to add header-based propagation and a test that asserts a known traceId survives the hop (see the Kafka prompt above). Don’t trust “I added tracing”—verify the traceId end to end.

  2. A saga leaves orphaned state. Payment succeeded, inventory was never released after a downstream failure. The compensation is missing or non-idempotent. Reproduce by injecting a failure at the step after the one you suspect, and assert compensation fires exactly once. Rebuild that step with the failing-test-first prompt; never accept compensation logic without a test that triggers it.

  3. MCP server auth fails or returns nothing. The tool connects but every query errors or returns empty. Usually a missing/expired token or wrong env var (GRAFANA_SERVICE_ACCOUNT_TOKEN, DT_ENVIRONMENT, Sentry OAuth not completed). Run claude mcp list to confirm the server is connected, re-check the env vars against the install commands above, and for the hosted Sentry server re-run the OAuth flow. If npm view <pkg> shows a suspiciously low download count, you installed a look-alike—reinstall the official scoped package.

  4. The AI generated a “distributed monolith.” Services that must deploy together, or two services writing the same table. This is a design failure the model won’t flag on its own. Ask it to audit: “List every place two services share a database, a write path, or must deploy in lockstep.” Resolve those before splitting further—shared write paths defeat the point of microservices.

  5. Canary auto-rollback never triggers. The deploy went bad but stayed at 100%. The rollback threshold references a metric that isn’t being emitted, or the metric name is wrong. Confirm the golden-signal metrics exist in Prometheus/Grafana (use the Grafana MCP to query them) before relying on automated rollback, and test the rollback path in staging with a deliberately failing build.