Vendor risk management — multi-vendor + abstraction + DR plan

Scorecard question: How do you manage vendor risk (Anthropic / OpenAI / Cursor / Google)? Max‑score answer (3 pts): Multi-vendor + abstraction (router, gateway, OpenRouter / Bedrock / Vertex) + DR plan for rate-limit or vendor outage.

Why this matters in 2026

In Q1 2026, telemetry across LLM gateway vendors showed that roughly 60% of production LLM errors were rate-limit driven — not model errors, not bad prompts, not your timeouts. The provider said “you’ve hit the wall, come back in 60 seconds”, and meanwhile your deploy pipeline, incident bot, support copilot, and agentic background jobs all went dark together. The blast radius of a single-vendor dependency is no longer “the chat feature is slow”; it’s “the entire org’s AI surface is down”.

That number is the cleanest signal yet that AI vendor reliability has crossed from “procurement question” to “SRE question”. You assume the dependency will fail, you fail over, you communicate. Each major LLM provider logged multi-hour incidents in the trailing twelve months — none caught by single-vendor “SLA in the contract” promises.

If you scored 0 or 1, you probably cannot answer today: (1) what happens to production AI if Anthropic returns 529 for 90 minutes? (2) what fraction of monthly LLM spend is locked into fine-tunes that wouldn’t survive a switch? (3) when did you last execute a failover, with measured latency and quality deltas? Three points means those questions have ready answers, written down, recently tested.

What “max score” actually looks like

A router or gateway in front of every production LLM call. No app code holds an ANTHROPIC_API_KEY directly. Every request goes through one logical endpoint that handles retries, fallbacks, rate-limit budgets, and per-tenant policy — self-hosted LiteLLM, managed (Portkey, TrueFoundry, Maxim, Bifrost, Kong AI), or cloud-native (Bedrock, Vertex, Azure AI, Cloudflare AI Gateway, OpenRouter).
At least two production-grade providers per capability tier. Frontier: Anthropic (Claude 4.x) plus OpenAI (GPT-5.x) — or Anthropic direct plus the same model via Bedrock. Embeddings: OpenAI plus a backup (Voyage, Cohere, self-hosted BGE). Cheap classification: frontier plus a local fallback (Llama 3.x via Groq, Together, your own GPU).
An allowlist, not an open buffet. The gateway only routes to a named set of (provider, model, region, version) tuples that legal, security, and engineering signed off on. Anything else returns 403. New models enter via a change ticket.
A DR runbook with named triggers, owners, and steps. Example: “If primary returns >5% 429/529 for >5 minutes, the gateway flips to secondary. On-call announces in #incidents, opens a vendor ticket. If secondary degrades, fall back to local model with reduced feature set.” Rehearsed quarterly.
Cost and quality baselines per provider. For each (provider, model): average request cost, p50/p95 latency, golden eval score. When a price changes 30% overnight (twice in 2025), you re-shape traffic in hours.
No fine-tuning lock-in without exit cost noted. If you fine-tuned on OpenAI or Vertex, the migration is priced: data volume, re-tune time, expected quality regression. Anything that lives only as a fine-tuned weight is logged as risk.
Auth, audit, and PII handling owned at the gateway. Key rotation, per-team budgets, PII scrubbing, prompt/response logging — one place, not every microservice. When a key leaks, you rotate once and twelve services keep working.

Concretely: Anthropic’s status page goes yellow, the gateway sees 429 climbing at 14:02, routing fires at 14:03, traffic shifts to Bedrock-hosted Claude and OpenAI-hosted GPT-5, on-call posts at 14:04 “failover path, user impact: zero”. By 16:00 Anthropic is back, secondary drains, channel closes. Customers never noticed.

Current landscape (web‑search‑verified)

Why single-vendor breaks now

Three forces stacked in 2025–2026:

Rate limits are no longer rare. RPM/TPM ceilings tightened as demand grew. Roughly 60% of LLM production errors in early 2026 were rate-limit driven (429 from OpenAI/Google, 529 overload from Anthropic). A prompt that works locally fails at 12:00 UTC because everyone in your timezone runs at noon.
Multi-hour outages hit every provider. Anthropic, OpenAI, Google, and Azure each had at least one multi-hour degradation in the trailing twelve months. Single-provider uptime sits around 99.5–99.9%; teams on more than one commonly measure 99.99%.
Pricing changes are large and frequent. Prices drop 30–80% within a year of release, new tiers appear quarterly. A single-vendor contract from early 2025 is meaningfully overpriced by mid‑2026.

“We standardised on Anthropic” or “We standardised on OpenAI” is not, in 2026, a strategy — it is a deferred outage and a deferred bill.

Router/gateway options

Pick the category that matches your team’s posture, not the one with the most GitHub stars.

Hosted / SaaS gateways. Portkey, TrueFoundry, Maxim, Helicone, Bifrost, Eden AI. Point code at their endpoint, they fan out, you get a dashboard, audit logs, budgets. Lowest operational overhead. Trade-off: they see every prompt — a fresh data-processing relationship with security.
OpenRouter. One API aggregating hundreds of models across dozens of providers, with automatic failover, prompt caching, BYOK, thin margins. Excellent for “any model, anywhere, behind one key”. Trade-off: itself a single dependency — treat as one provider in your DR plan, not the DR plan.
Cloud-native AI gateways. AWS Bedrock, GCP Vertex AI (Gemini + Anthropic + Model Garden), Azure AI Foundry (OpenAI + Mistral + Meta + DeepSeek), Cloudflare AI Gateway. Best when you already have an enterprise relationship — VPC peering, single bill, single DPA, regional residency. Trade-off: model availability lags direct (sometimes weeks) and cloud quirks leak into your code.
Self-hosted LiteLLM / Bifrost / OpenLLMetry. OSS proxies on your own infra. Most flexibility, lowest marginal cost, full audit. Trade-off: you own the supply chain (see March 2026), uptime, patch cadence. Suitable for teams with platform engineers; risky without.

For most teams above 20 engineers, the honest answer is a hybrid: cloud-native gateway for the bulk of traffic, plus one alternative path for failover.

Multi-vendor allowlist patterns

A working allowlist is a small, opinionated table:

Tier 1 — frontier reasoning. Two providers, two regions. anthropic/claude-fable-5@us-east-1 (or claude-opus-4.x when budget matters more than peak intelligence) and bedrock/anthropic.claude-opus-4.x@eu-central-1. One logical capability; the gateway picks one. See model comparison for the current tier ladder.
Tier 2 — workhorse coding / general. anthropic/claude-sonnet-4.x and openai/gpt-5.x-mini. Routed by cost, latency, and rate-limit headroom.
Tier 3 — cheap / batch / classification. Local-first, frontier as fallback. Llama 3.x on Groq or your GPU, with OpenAI’s cheapest tier as safety net.
Embeddings. Two providers minimum, with a re-embed plan on schedule, not in a panic.
Vision / audio / specialty. Pin to whichever vendor leads, but wrap the call so swapping is a one-line config change.

Engineers pick capability slots, not model strings; the gateway resolves to a tuple.

DR plan (what triggers failover, how to test)

The runbook is one page.

Triggers. Rolling 5‑minute error rate >5% on primary, OR p95 latency >3× baseline, OR provider status page goes “incident”. Each fires automatically; the human is notified, not asked.
Failover destinations. Per capability tier, a named secondary, with documented cost and quality deltas (e.g., “Tier 1 secondary: +20% per‑1k cost, –4 points on eval, p95 +180ms”).
Degradation modes. If both fail: coding assistant degrades to local completion; agentic workflows pause and queue; chat shows “backup, slower replies”; safety-critical features fail closed.
Communications. Who posts to status, who pings #incidents, who emails customers, who calls the vendor. Names, not roles.
Quarterly fire drill. Scheduled exercise in a low-traffic window: intentionally fail primary, let failover run for 30 minutes on real traffic. Measure routing latency, eval delta, cost delta, alert firing, runbook adherence. The first drill is always embarrassing — that’s the point.

If you haven’t executed a failover in production in the last six months, you don’t have a DR plan; you have a wish.

Trade-offs

Model parity gaps. Tool-calling formats, structured outputs, vision limits, long-context behaviour, and safety filters differ between vendors and shift across versions. The gateway normalises the wire, not the behaviour. Plan for eval regressions on failover.
Latency tax. Every hop adds 30–150ms. Invisible for chat; compounds for an agent loop firing 40 LLM calls per task. Co-locate the gateway with your most-used endpoints.
Audit complexity. One gateway, two providers, three regions → three DPAs and three subprocessor lists. The win is centralised audit logs in your SIEM (see Q3 · Team billing); the cost is more contracts. Worth it.
Provider tail features. Some features are only native (Anthropic caching headers, OpenAI structured outputs, Google grounded search). Using them creates lock-in. Document where you accept it.
Self-hosted supply chain. Your gateway holds every model key. Pin versions, vendor your image, monitor CVEs, restrict egress, rotate keys on schedule.

Step‑by‑step: building vendor resilience

Inventory every LLM call in production. Grep code, background jobs, edge workers, and platform scripts for SDK import substrings — anthropic (matches @anthropic-ai/sdk and the Python anthropic), openai, @google/genai, cohere-ai, voyageai, mistral (matches @mistralai/mistralai and mistralai). For each call site: capability tier, monthly volume, monthly cost, current provider, blast radius on 5xx for 10 minutes.
Pick your gateway shape. Hosted, cloud-native, OpenRouter, or self-hosted, based on existing cloud contracts, platform-engineer headcount, and compliance scope. Most teams above 20 engineers land on cloud-native plus one alternative. Document in your Q2 · Tooling policy artifact.
Stand up the gateway in shadow mode. Deploy it, point one low‑risk workload, mirror traffic, compare cost/latency/outputs on a sample. Tune until parity is acceptable. Do not migrate prod on day one.
Define the allowlist with security and finance. One hour, one room, one table: capability tier × provider × model × region × monthly budget × DPA reference. Off-table is blocked at the gateway. Governance artifact for SOC 2 / ISO 27001 evidence.
Migrate tier by tier, lowest risk first. Internal tools, then customer-facing non-critical, then critical. Each migration: swap to the gateway endpoint behind a feature flag, watch error budgets a week, then remove the direct SDK import.
Wire in at least one secondary per capability tier. Each tier names a primary and a secondary with the trigger thresholds above. Run secondary in shadow for two weeks before promoting to live failover.
Write the DR runbook and rehearse it. One page: triggers, owners (names), comms (channels), steps. First fire drill within a month of go-live, with the actual on-call. Re-run quarterly.
Centralise keys, audit, and budgets at the gateway. Rotate every direct key into the gateway; app servers get gateway-scoped keys. Per-team budgets alert at 70% and hard-stop at 100%. Pipe logs into your SIEM.
Track and budget for the parity gap. Run your golden eval weekly against every allowlisted tuple. When a vendor falls behind two consecutive quarters, you have data to renegotiate or demote.
Review quarterly with security, finance, and engineering. 30-minute standing meeting: allowlist updates, DPA renewals, budget vs actual, drill findings. Forum where the next vendor change is decided.

Common pitfalls

A single API key in .env, used in fifteen services. When that key is rate-limited or revoked, fifteen services degrade together. Even with one provider, route every call through one internal endpoint.
“We’ll add failover later, the contract has an SLA.” SLAs pay credits for downtime that already happened — they don’t prevent it. A failover path is what you build instead.
Treating the gateway as a single point of failure. Deploy in at least two regions behind a health-checked LB, with its own “gateway down” runbook.
Fine-tuning on a single vendor without exit strategy. Fine-tunes are sticky. Moving means re-tuning plus a quarter of quality regression. Acceptable only if you wrote down the exit cost when you committed.
No drill = no plan. Teams paste the runbook in Notion and never test it. The first failover in production almost always exposes a missing alert, a stale on-call, or routing pointed at the wrong secondary.
Routing for cost only, ignoring quality. Sending everything to the cheapest model is tempting; months later your support copilot hallucinates because the cheap model regressed. Pair cost routing with a quality budget.
Ignoring the gateway’s own supply chain. Self-hosted gateways hold every model key. Pin versions, vendor the image, restrict egress, audit dependencies. The March 2026 LiteLLM incident was a warning shot.
PII at the gateway with no scrubbing. Deploy in a region matching your data residency, or scrub before forwarding. Document in Q21 · Compliance policy.
No budget owner per route. If no one owns each budget, surprise bills land on the CTO. Assign every allowlisted entry to a named engineer.

How to verify you’re there

The gateway dashboard shows every production LLM call from the last 30 days, attributed to route, provider, region, model, cost, latency, and tenant. Zero traffic bypasses the gateway.
The allowlist artifact is a table with provider × model × region × DPA × monthly budget × named owner.
The DR runbook is one page with triggers, actions, owners, and comms — and the last fire-drill report is dated within the trailing quarter, with gaps closed.
The routing policy names a primary and at least one secondary per tier. Failover thresholds are encoded in the policy, not in a human’s head.
A random engineer can name, in under 30 seconds, which provider their feature falls back to if Anthropic returned 529 for 60 minutes.
The golden eval ran last week against every allowlisted tuple. Scores are stored over time and visible to engineering leadership.
The last vendor incident appears in post-incident notes with measured user impact (ideally zero) and a closed follow-up.
Finance produces a single AI spend report by provider × tier × team, reconciling to the gateway within ±2%.
A new model release enters production via a change ticket — not via a developer changing a string in a PR.