Enterprise Gateway
Route all LLM requests through corporate infrastructure for security and compliance
Claude Code can be configured to work with custom LLM gateways, enterprise endpoints, and alternative providers. This guide covers advanced configurations for routing LLM requests through custom infrastructure.
Enterprise Gateway
Route all LLM requests through corporate infrastructure for security and compliance
Cost Management
Track usage, implement quotas, and optimize model selection based on task requirements
Multi-Provider
Switch between different LLM providers (OpenAI, Anthropic, etc.) seamlessly
Edge Deployment
Use local or edge-deployed models for sensitive data or offline scenarios
Set custom endpoint
# Point to your LLM gatewayexport CLAUDE_API_ENDPOINT=https://llm-gateway.company.com/v1export CLAUDE_API_KEY=your-gateway-api-key
# For compatibility with OpenAI-style gatewaysexport OPENAI_API_BASE=https://llm-gateway.company.com/v1export OPENAI_API_KEY=your-gateway-api-key
Configure authentication
# Bearer token authenticationexport CLAUDE_AUTH_TYPE=bearerexport CLAUDE_AUTH_TOKEN=your-bearer-token
# Custom headers for enterprise gatewaysexport CLAUDE_CUSTOM_HEADERS='{"X-Department": "Engineering", "X-Project": "ProjectName"}'
Test the connection
claude "Test gateway connection:- Verify endpoint is reachable- Check authentication- Test model availability- Validate response format"
# Install LiteLLM proxypip install litellm[proxy]
# Start proxy serverlitellm --model claude-3-opus --port 8000
# Configure Claude Codeexport CLAUDE_API_ENDPOINT=http://localhost:8000export CLAUDE_API_KEY=dummy-key-for-local
# With multiple modelscat > litellm_config.yaml << EOFmodel_list: - model_name: claude-3-opus litellm_params: model: claude-3-opus api_key: $ANTHROPIC_API_KEY - model_name: gpt-4 litellm_params: model: gpt-4 api_key: $OPENAI_API_KEYEOF
litellm --config litellm_config.yaml --port 8000
# Run LocalAI with Dockerdocker run -p 8080:8080 \ -v $PWD/models:/models \ localai/localai:latest
# Configure for local modelsexport CLAUDE_API_ENDPOINT=http://localhost:8080/v1export CLAUDE_MODEL_NAME=llama-2-7b-chat
# Download and configure modelsclaude "Help me set up LocalAI with:- Download appropriate models- Configure model aliases- Set up model parameters- Test model responses"
# Configure MLflow AI Gatewaycat > config.yaml << EOFroutes: - name: claude-completions route_type: llm/v1/completions model: provider: anthropic name: claude-3-opus config: anthropic_api_key: $ANTHROPIC_API_KEY - name: embeddings route_type: llm/v1/embeddings model: provider: openai name: text-embedding-ada-002 config: openai_api_key: $OPENAI_API_KEYEOF
# Start gatewaymlflow gateway start --config-path config.yaml --port 5000
# Configure Claude Codeexport CLAUDE_API_ENDPOINT=http://localhost:5000/gateway/claude-completions/invocations
Multi-Instance Configuration
# HAProxy configuration for LLM load balancingcat > haproxy.cfg << EOFglobal daemon
defaults mode http timeout connect 5000ms timeout client 300000ms # 5 minutes for LLM responses timeout server 300000ms
frontend llm_frontend bind *:8080 default_backend llm_servers
backend llm_servers balance leastconn # Best for varying response times option httpchk GET /health server llm1 llm1.internal:8000 check weight 100 server llm2 llm2.internal:8000 check weight 100 server llm3 llm3.internal:8000 check weight 50 # Less powerful instanceEOF
# Configure Claude Code to use load balancerexport CLAUDE_API_ENDPOINT=http://localhost:8080/v1
Model selection based on task
# Custom gateway router examplefrom flask import Flask, request, jsonifyimport requests
app = Flask(__name__)
@app.route('/v1/completions', methods=['POST'])def route_request(): data = request.json prompt_length = len(data.get('prompt', ''))
# Route based on prompt complexity if prompt_length < 500: # Use faster, cheaper model for simple tasks model = 'claude-instant' endpoint = 'http://instant-llm:8000' elif 'code' in data.get('prompt', '').lower(): # Use specialized code model model = 'claude-3-opus' endpoint = 'http://opus-llm:8000' else: # Default model model = 'claude-3-sonnet' endpoint = 'http://sonnet-llm:8000'
# Forward request response = requests.post( f"{endpoint}/v1/completions", json={**data, 'model': model}, headers=request.headers )
return response.json()
Implement circuit breaker
claude "Create a circuit breaker gateway that:- Implements retry logic with exponential backoff- Falls back to secondary endpoints- Monitors success rates- Provides health status endpoint"
PII Detection
# Gateway middleware for PII filtering@app.before_requestdef check_pii(): if request.method == 'POST': data = request.get_json() prompt = data.get('prompt', '')
# Check for PII patterns if detect_pii(prompt): return jsonify({ 'error': 'PII detected in prompt', 'type': 'security_violation' }), 400
Audit Logging
# Comprehensive request logging@app.after_requestdef log_request(response): log_entry = { 'timestamp': datetime.utcnow(), 'user': request.headers.get('X-User-ID'), 'department': request.headers.get('X-Department'), 'model': request.json.get('model'), 'prompt_tokens': count_tokens(request.json.get('prompt')), 'response_tokens': count_tokens(response.json.get('text')), 'latency': response.headers.get('X-Process-Time'), 'status': response.status_code } audit_logger.info(json.dumps(log_entry)) return response
# Install and run Ollamacurl -fsSL https://ollama.ai/install.sh | shollama serve
# Pull modelsollama pull llama2ollama pull codellama
# Configure Claude Codeexport CLAUDE_API_ENDPOINT=http://localhost:11434/apiexport CLAUDE_MODEL_NAME=codellama
# Test local modelclaude "Test local Ollama model:- Generate code snippet- Explain functionality- Check response quality"
# Run vLLM serverpython -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-2-7b-chat-hf \ --port 8000 \ --max-model-len 4096
# Configure Claude Codeexport CLAUDE_API_ENDPOINT=http://localhost:8000/v1export CLAUDE_MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
# Performance tuningexport VLLM_NUM_GPUS=2export VLLM_TENSOR_PARALLEL_SIZE=2
# Text Generation Inference by HuggingFacedocker run --gpus all --shm-size 1g \ -p 8080:80 \ -v $PWD/data:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-2-7b-chat-hf \ --max-input-length 2048 \ --max-total-tokens 4096
# Configure endpointexport CLAUDE_API_ENDPOINT=http://localhost:8080export CLAUDE_API_FORMAT=tgi # Special format handling
Offline-First Configuration
# Configure fallback chainexport CLAUDE_ENDPOINTS=( "http://localhost:8000/v1" # Local model first "https://edge-server.local/v1" # Edge server "https://llm-gateway.company.com/v1" # Corporate gateway "https://api.anthropic.com/v1" # Direct API fallback)
# Implement fallback logicclaude "Create a wrapper script that:- Tries each endpoint in order- Falls back on connection failure- Caches successful endpoints- Monitors latency and adjusts order- Provides offline capabilities"
Configure provider routing
providers: anthropic: endpoint: https://api.anthropic.com api_key: ${ANTHROPIC_API_KEY} models: [claude-3-opus, claude-3-sonnet] openai: endpoint: https://api.openai.com api_key: ${OPENAI_API_KEY} models: [gpt-4, gpt-3.5-turbo] cohere: endpoint: https://api.cohere.ai api_key: ${COHERE_API_KEY} models: [command-r-plus]
routing_rules: - pattern: "code.*" preferred_model: claude-3-opus fallback: [gpt-4, command-r-plus] - pattern: "chat.*" preferred_model: gpt-3.5-turbo fallback: [claude-3-sonnet]
Implement smart routing
claude "Create a smart router that:- Routes by task complexity- Implements cost optimization- Tracks token usage per model- Provides fallback options"
Token Counting
# Pre-flight token estimationfrom tiktoken import encoding_for_model
def estimate_cost(prompt, model): encoder = encoding_for_model(model) prompt_tokens = len(encoder.encode(prompt))
# Estimate response tokens estimated_response = prompt_tokens * 0.75
# Calculate cost costs = { 'claude-3-opus': {'input': 0.015, 'output': 0.075}, 'claude-3-sonnet': {'input': 0.003, 'output': 0.015}, 'gpt-4': {'input': 0.03, 'output': 0.06} }
model_cost = costs.get(model, {'input': 0.001, 'output': 0.002}) total_cost = (prompt_tokens * model_cost['input'] + estimated_response * model_cost['output']) / 1000
return { 'prompt_tokens': prompt_tokens, 'estimated_response_tokens': estimated_response, 'estimated_cost': total_cost }
Budget Controls
# Implement spending limitsclass BudgetGateway: def __init__(self, daily_limit=100, user_limits=None): self.daily_limit = daily_limit self.user_limits = user_limits or {} self.usage = defaultdict(float)
def check_budget(self, user, estimated_cost): # Check daily limit if self.usage['total'] + estimated_cost > self.daily_limit: raise BudgetExceeded("Daily limit reached")
# Check user limit user_limit = self.user_limits.get(user, float('inf')) if self.usage[user] + estimated_cost > user_limit: raise BudgetExceeded(f"User {user} limit reached")
return True
def record_usage(self, user, actual_cost): self.usage['total'] += actual_cost self.usage[user] += actual_cost
# Prometheus metrics for gateway monitoringfrom prometheus_client import Counter, Histogram, Gauge
# Define metricsrequest_count = Counter('llm_requests_total', 'Total LLM requests', ['model', 'status', 'user'])request_duration = Histogram('llm_request_duration_seconds', 'LLM request duration', ['model'])active_requests = Gauge('llm_active_requests', 'Active LLM requests')token_usage = Counter('llm_tokens_total', 'Total tokens used', ['model', 'type'])
# Instrument gateway@app.route('/v1/completions', methods=['POST'])def handle_completion(): with request_duration.labels(model=request.json['model']).time(): active_requests.inc() try: response = process_request(request.json) request_count.labels( model=request.json['model'], status='success', user=request.headers.get('X-User-ID') ).inc() return response except Exception as e: request_count.labels( model=request.json['model'], status='error', user=request.headers.get('X-User-ID') ).inc() raise finally: active_requests.dec()
Comprehensive Health Monitoring
claude "Create health check system that monitors:- Endpoint availability- Response times- Error rates- Token usage trends"
# Debug gateway connectionclaude "Debug gateway connection:- Test network connectivity- Verify authentication- Check SSL certificates- Analyze request/response headers"
# Benchmark gateway performanceclaude "Create performance test that:- Measures latency for different models- Tests concurrent request handling- Monitors resource usage- Generates performance report"
Explore related advanced topics:
Remember: LLM gateways provide powerful capabilities for enterprise deployments. Start simple and add complexity as needed, always keeping security and reliability as top priorities.