Skip to content

LLM Gateway Configuration

Claude Code can be configured to work with custom LLM gateways, enterprise endpoints, and alternative providers. This guide covers advanced configurations for routing LLM requests through custom infrastructure.

Enterprise Gateway

Route all LLM requests through corporate infrastructure for security and compliance

Cost Management

Track usage, implement quotas, and optimize model selection based on task requirements

Multi-Provider

Switch between different LLM providers (OpenAI, Anthropic, etc.) seamlessly

Edge Deployment

Use local or edge-deployed models for sensitive data or offline scenarios

  1. Set custom endpoint

    Terminal window
    # Point to your LLM gateway
    export CLAUDE_API_ENDPOINT=https://llm-gateway.company.com/v1
    export CLAUDE_API_KEY=your-gateway-api-key
    # For compatibility with OpenAI-style gateways
    export OPENAI_API_BASE=https://llm-gateway.company.com/v1
    export OPENAI_API_KEY=your-gateway-api-key
  2. Configure authentication

    Terminal window
    # Bearer token authentication
    export CLAUDE_AUTH_TYPE=bearer
    export CLAUDE_AUTH_TOKEN=your-bearer-token
    # Custom headers for enterprise gateways
    export CLAUDE_CUSTOM_HEADERS='{"X-Department": "Engineering", "X-Project": "ProjectName"}'
  3. Test the connection

    Terminal window
    claude "Test gateway connection:
    - Verify endpoint is reachable
    - Check authentication
    - Test model availability
    - Validate response format"
Terminal window
# Install LiteLLM proxy
pip install litellm[proxy]
# Start proxy server
litellm --model claude-3-opus --port 8000
# Configure Claude Code
export CLAUDE_API_ENDPOINT=http://localhost:8000
export CLAUDE_API_KEY=dummy-key-for-local
# With multiple models
cat > litellm_config.yaml << EOF
model_list:
- model_name: claude-3-opus
litellm_params:
model: claude-3-opus
api_key: $ANTHROPIC_API_KEY
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: $OPENAI_API_KEY
EOF
litellm --config litellm_config.yaml --port 8000

Multi-Instance Configuration

Terminal window
# HAProxy configuration for LLM load balancing
cat > haproxy.cfg << EOF
global
daemon
defaults
mode http
timeout connect 5000ms
timeout client 300000ms # 5 minutes for LLM responses
timeout server 300000ms
frontend llm_frontend
bind *:8080
default_backend llm_servers
backend llm_servers
balance leastconn # Best for varying response times
option httpchk GET /health
server llm1 llm1.internal:8000 check weight 100
server llm2 llm2.internal:8000 check weight 100
server llm3 llm3.internal:8000 check weight 50 # Less powerful instance
EOF
# Configure Claude Code to use load balancer
export CLAUDE_API_ENDPOINT=http://localhost:8080/v1
  1. Model selection based on task

    # Custom gateway router example
    from flask import Flask, request, jsonify
    import requests
    app = Flask(__name__)
    @app.route('/v1/completions', methods=['POST'])
    def route_request():
    data = request.json
    prompt_length = len(data.get('prompt', ''))
    # Route based on prompt complexity
    if prompt_length < 500:
    # Use faster, cheaper model for simple tasks
    model = 'claude-instant'
    endpoint = 'http://instant-llm:8000'
    elif 'code' in data.get('prompt', '').lower():
    # Use specialized code model
    model = 'claude-3-opus'
    endpoint = 'http://opus-llm:8000'
    else:
    # Default model
    model = 'claude-3-sonnet'
    endpoint = 'http://sonnet-llm:8000'
    # Forward request
    response = requests.post(
    f"{endpoint}/v1/completions",
    json={**data, 'model': model},
    headers=request.headers
    )
    return response.json()
  2. Implement circuit breaker

    Terminal window
    claude "Create a circuit breaker gateway that:
    - Implements retry logic with exponential backoff
    - Falls back to secondary endpoints
    - Monitors success rates
    - Provides health status endpoint"

PII Detection

# Gateway middleware for PII filtering
@app.before_request
def check_pii():
if request.method == 'POST':
data = request.get_json()
prompt = data.get('prompt', '')
# Check for PII patterns
if detect_pii(prompt):
return jsonify({
'error': 'PII detected in prompt',
'type': 'security_violation'
}), 400

Audit Logging

# Comprehensive request logging
@app.after_request
def log_request(response):
log_entry = {
'timestamp': datetime.utcnow(),
'user': request.headers.get('X-User-ID'),
'department': request.headers.get('X-Department'),
'model': request.json.get('model'),
'prompt_tokens': count_tokens(request.json.get('prompt')),
'response_tokens': count_tokens(response.json.get('text')),
'latency': response.headers.get('X-Process-Time'),
'status': response.status_code
}
audit_logger.info(json.dumps(log_entry))
return response
Terminal window
# Install and run Ollama
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve
# Pull models
ollama pull llama2
ollama pull codellama
# Configure Claude Code
export CLAUDE_API_ENDPOINT=http://localhost:11434/api
export CLAUDE_MODEL_NAME=codellama
# Test local model
claude "Test local Ollama model:
- Generate code snippet
- Explain functionality
- Check response quality"

Offline-First Configuration

Terminal window
# Configure fallback chain
export CLAUDE_ENDPOINTS=(
"http://localhost:8000/v1" # Local model first
"https://edge-server.local/v1" # Edge server
"https://llm-gateway.company.com/v1" # Corporate gateway
"https://api.anthropic.com/v1" # Direct API fallback
)
# Implement fallback logic
claude "Create a wrapper script that:
- Tries each endpoint in order
- Falls back on connection failure
- Caches successful endpoints
- Monitors latency and adjusts order
- Provides offline capabilities"
  1. Configure provider routing

    gateway-config.yaml
    providers:
    anthropic:
    endpoint: https://api.anthropic.com
    api_key: ${ANTHROPIC_API_KEY}
    models: [claude-3-opus, claude-3-sonnet]
    openai:
    endpoint: https://api.openai.com
    api_key: ${OPENAI_API_KEY}
    models: [gpt-4, gpt-3.5-turbo]
    cohere:
    endpoint: https://api.cohere.ai
    api_key: ${COHERE_API_KEY}
    models: [command-r-plus]
    routing_rules:
    - pattern: "code.*"
    preferred_model: claude-3-opus
    fallback: [gpt-4, command-r-plus]
    - pattern: "chat.*"
    preferred_model: gpt-3.5-turbo
    fallback: [claude-3-sonnet]
  2. Implement smart routing

    Terminal window
    claude "Create a smart router that:
    - Routes by task complexity
    - Implements cost optimization
    - Tracks token usage per model
    - Provides fallback options"

Token Counting

# Pre-flight token estimation
from tiktoken import encoding_for_model
def estimate_cost(prompt, model):
encoder = encoding_for_model(model)
prompt_tokens = len(encoder.encode(prompt))
# Estimate response tokens
estimated_response = prompt_tokens * 0.75
# Calculate cost
costs = {
'claude-3-opus': {'input': 0.015, 'output': 0.075},
'claude-3-sonnet': {'input': 0.003, 'output': 0.015},
'gpt-4': {'input': 0.03, 'output': 0.06}
}
model_cost = costs.get(model, {'input': 0.001, 'output': 0.002})
total_cost = (prompt_tokens * model_cost['input'] +
estimated_response * model_cost['output']) / 1000
return {
'prompt_tokens': prompt_tokens,
'estimated_response_tokens': estimated_response,
'estimated_cost': total_cost
}

Budget Controls

# Implement spending limits
class BudgetGateway:
def __init__(self, daily_limit=100, user_limits=None):
self.daily_limit = daily_limit
self.user_limits = user_limits or {}
self.usage = defaultdict(float)
def check_budget(self, user, estimated_cost):
# Check daily limit
if self.usage['total'] + estimated_cost > self.daily_limit:
raise BudgetExceeded("Daily limit reached")
# Check user limit
user_limit = self.user_limits.get(user, float('inf'))
if self.usage[user] + estimated_cost > user_limit:
raise BudgetExceeded(f"User {user} limit reached")
return True
def record_usage(self, user, actual_cost):
self.usage['total'] += actual_cost
self.usage[user] += actual_cost
# Prometheus metrics for gateway monitoring
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
request_count = Counter('llm_requests_total', 'Total LLM requests',
['model', 'status', 'user'])
request_duration = Histogram('llm_request_duration_seconds',
'LLM request duration', ['model'])
active_requests = Gauge('llm_active_requests', 'Active LLM requests')
token_usage = Counter('llm_tokens_total', 'Total tokens used',
['model', 'type'])
# Instrument gateway
@app.route('/v1/completions', methods=['POST'])
def handle_completion():
with request_duration.labels(model=request.json['model']).time():
active_requests.inc()
try:
response = process_request(request.json)
request_count.labels(
model=request.json['model'],
status='success',
user=request.headers.get('X-User-ID')
).inc()
return response
except Exception as e:
request_count.labels(
model=request.json['model'],
status='error',
user=request.headers.get('X-User-ID')
).inc()
raise
finally:
active_requests.dec()

Comprehensive Health Monitoring

Terminal window
claude "Create health check system that monitors:
- Endpoint availability
- Response times
- Error rates
- Token usage trends"
Terminal window
# Debug gateway connection
claude "Debug gateway connection:
- Test network connectivity
- Verify authentication
- Check SSL certificates
- Analyze request/response headers"
Terminal window
# Benchmark gateway performance
claude "Create performance test that:
- Measures latency for different models
- Tests concurrent request handling
- Monitors resource usage
- Generates performance report"

Explore related advanced topics:

Remember: LLM gateways provide powerful capabilities for enterprise deployments. Start simple and add complexity as needed, always keeping security and reliability as top priorities.