Distributed Systems with AI Assistance

Your checkout service is returning intermittent 500 errors. The logs show the payment service responding correctly, but the order service is timing out waiting for an event that the inventory service should have published. Three services, three repositories, three different teams — and the AI tool you are using can only see the one repo you have open. Debugging distributed systems with AI requires a fundamentally different approach than single-service development.

What You’ll Walk Away With

Contract-first development workflows where AI generates and validates service interfaces
Cross-service debugging strategies that work within single-tool context limits
Patterns for coordinating schema migrations across service boundaries
Techniques for AI-assisted distributed tracing analysis
Prompts for generating consistent error handling across services

The Distributed Context Problem

Microservices split your system across repositories, languages, and teams. AI tools see one repository at a time. The solution: encode cross-service contracts and conventions in each repository so the AI always has the integration context it needs.

Contract-First Development

Define your service contracts before writing implementation code. AI tools excel at generating implementations from well-defined interfaces.

Store contract definitions in your service repository and reference them explicitly:

This service (order-service) communicates with:
- payment-service: REST API, OpenAPI spec at /contracts/payment-api.yaml
- inventory-service: Events via RabbitMQ, schemas at /contracts/inventory-events.json
- notification-service: Events via RabbitMQ, schemas at /contracts/notification-events.json

When implementing any integration:
1. Always read the relevant contract file first
2. Generate client code from the contract, do not hand-write it
3. Include retry logic with exponential backoff for all HTTP calls
4. Include dead-letter queue handling for all event consumers

Use @contracts/payment-api.yaml to bring contract context into your conversation.

Claude Code can read contract files directly and generate implementations:

Microservice: order-service
Contracts directory: /contracts/
- payment-api.yaml (OpenAPI 3.1) - payment-service REST API
- inventory-events.json (AsyncAPI 2.6) - inventory-service event schemas
- notification-events.json (AsyncAPI 2.6) - notification-service events

Integration rules:
- All HTTP clients must use the generated SDK in /src/clients/
- Regenerate clients when contracts change: npm run generate-clients
- All event publishers must validate against the schema before publishing
- Circuit breaker pattern required for all external service calls

Codex can access multiple repositories through its GitHub integration:

This is part of a microservices architecture. Related repos:
- org/payment-service - Payment processing
- org/inventory-service - Stock management
- org/notification-service - User notifications

Shared contracts are in org/service-contracts repo.
When making changes that affect service boundaries:
1. Check the contract in org/service-contracts first
2. Update the contract if needed (creates PR to service-contracts)
3. Implement the change in this service
4. Note any downstream services that need updates

Copy-paste prompt for contract-first service development:

I need to add a refund endpoint to the payment service. Work contract-first:
1. Read the existing OpenAPI spec at /contracts/payment-api.yaml
2. Add a POST /payments/{paymentId}/refund endpoint to the spec with:
   - Request body: { amount: number, reason: string, idempotencyKey: string }
   - Response 200: { refundId: string, status: "pending" | "completed", amount: number }
   - Response 404: Payment not found
   - Response 409: Refund already exists for this idempotency key
   - Response 422: Amount exceeds original payment
3. Generate the Express route handler from the contract
4. Generate the client SDK method from the contract
5. Generate integration test stubs from the contract

Show me the updated contract before generating any implementation code.

Cross-Service Debugging

Strategy: Trace-Driven Debugging

When a distributed issue spans services, work from the trace backward.

Copy-paste prompt for distributed debugging:

I have a distributed trace showing this sequence:
1. web-gateway receives POST /checkout (200ms)
2. order-service creates order (150ms)
3. payment-service charges card (800ms, success)
4. inventory-service.reserve event published (50ms)
5. inventory-service.reserve event consumed (TIMEOUT after 30s)
6. order-service times out waiting for inventory confirmation

The inventory service logs show the event was received but processing hung.
Analyze our inventory service event consumer at /src/consumers/reserve.consumer.ts
and identify:
- Where the processing could hang (database locks, external calls, deadlocks)
- Why it only fails intermittently (race conditions, connection pool exhaustion)
- What observability we're missing (metrics, structured logs, health checks)

Strategy: Contract Violation Detection

AI tools can verify that services honor their contracts even when you only have access to one side.

Compare our order-service HTTP client for the payment service
against the payment-api.yaml contract:
1. Are we handling all documented error codes?
2. Are we sending all required headers?
3. Are we respecting rate limits and timeouts from the spec?
4. Are there any fields we're ignoring in responses that we should handle?

claude "Read /contracts/payment-api.yaml and /src/clients/payment-client.ts.
Perform a contract compliance audit:
- List every endpoint in the contract and whether our client implements it
- Check error handling for every documented error response
- Verify request/response types match the schema
- Check timeout and retry configurations against SLA requirements
Output as a compliance checklist with pass/fail for each item."

Audit contract compliance for the order-service against all its upstream contracts.
For each contract in /contracts/:
1. Find the corresponding client implementation in /src/clients/
2. Verify every endpoint, error code, and schema field is handled
3. Check for missing retry logic, circuit breakers, and timeout handling
4. Create issues for any violations found

Distributed Schema Migrations

Changing a data format that crosses service boundaries requires coordination.

Define the new schema version

Add the new schema alongside the old one. Do not replace it yet.
Update producers to publish both versions

The producing service sends events in both old and new formats during the transition period.
Update consumers to accept both versions

Each consuming service handles both schema versions gracefully.
Verify all consumers are updated

Monitor that no service is still consuming the old format.
Remove the old schema

Only after all consumers have migrated, stop producing the old format and remove it.

Copy-paste prompt for distributed schema migration:

I need to add a `currency` field to the PaymentCompleted event that's consumed by
order-service, notification-service, and analytics-service.

Generate a backward-compatible migration plan:
1. Update the event schema in /contracts/payment-events.json to v2 with currency field
2. Modify the payment-service publisher to emit both v1 and v2 events
3. Generate a v2 event consumer that handles both versions (check for currency field, default to "USD" if missing)
4. Create a test that verifies backward compatibility with v1 events
5. Add a metric to track which consumers are still on v1

Start with the contract change and show me before proceeding.

Service Mesh and Observability

AI-Assisted Observability Setup

Copy-paste prompt for observability scaffolding:

Set up distributed tracing for our Express-based microservice:
1. Add OpenTelemetry instrumentation with automatic trace propagation
2. Include custom spans for: database queries, external HTTP calls, message queue operations
3. Add structured logging with trace ID correlation (use pino)
4. Create a health check endpoint that verifies all downstream dependencies
5. Add Prometheus metrics for: request duration, error rate, active connections, queue depth

Follow our existing patterns in /src/middleware/ for consistency.
Include the Docker Compose configuration for a local Jaeger instance.

When This Breaks

“The AI does not understand our service boundaries.” You need per-service CLAUDE.md or .cursor/rules files that explicitly document what this service communicates with and how. Without this, the AI treats every service as a standalone application.

“Contract changes break downstream services in production.” You skipped the dual-publish phase. Always run both schema versions simultaneously during migration. Use the version tracking metrics to verify all consumers have migrated before removing the old version.

“The AI generates client code that does not match the contract.” Regenerate clients from contracts, do not hand-write them. Add contract validation to your CI pipeline: npm run validate-contracts should fail the build if implementations drift from specs.

“Distributed debugging takes forever even with AI.” Invest in observability infrastructure first. AI tools become dramatically more effective when they can analyze structured traces and correlated logs rather than piecing together separate log files.

What’s Next

Monitoring and Observability Set up comprehensive logging, metrics, and tracing across services.

Incident Response AI-assisted on-call workflows for distributed system failures.

API Testing Contract testing and API automation across service boundaries.