Intelligent Experiment Design
Generate realistic failure scenarios based on your system architecture, historical incidents, and risk analysis
Ta treść nie jest jeszcze dostępna w Twoim języku.
Transform system failures from disasters into learning opportunities. Master chaos engineering with Cursor IDE and Claude Code to build resilient systems that gracefully handle the unexpected. Learn to design intelligent failure scenarios, automate recovery testing, and create systems that improve under stress.
In distributed systems, failures are inevitable. AI-powered chaos engineering transforms these failures from problems into competitive advantages:
Intelligent Experiment Design
Generate realistic failure scenarios based on your system architecture, historical incidents, and risk analysis
Automated Safety Controls
Built-in blast radius control, automatic rollback mechanisms, and real-time monitoring prevent chaos from becoming disaster
Learning from Failures
Extract actionable insights from every experiment to improve system resilience and incident response procedures
Proactive Resilience
Identify and fix weaknesses before they cause production outages, reducing MTTR and improving reliability
Prompt: “Design chaos engineering experiments to test our microservices architecture’s resilience to common failure scenarios.”
// Agent Mode: Chaos experiment designAgent: "Create comprehensive chaos engineering experiments:
1. Analyze our system architecture and dependencies2. Design failure scenarios for different components3. Create progressive experiment plans with safety controls4. Set up monitoring and alerting for experiments5. Define success criteria and rollback procedures6. Generate experiment runbooks and documentation
System components:- Web frontend (React/Next.js)- API gateway (Kong/Envoy)- Microservices (Node.js, Python, Go)- Databases (PostgreSQL, Redis)- Message queues (RabbitMQ, Kafka)- Infrastructure (Kubernetes, AWS)"
// AI generates chaos experiment frameworkclass ChaosExperimentDesigner { designInfrastructureTests() { return [ { name: 'Pod Failure Simulation', description: 'Test service recovery when pods are killed', target: 'kubernetes-pods', scenario: { action: 'kill-random-pods', percentage: 25, // Kill 25% of pods duration: '5m', rollback: 'automatic' }, monitoring: { metrics: ['response_time', 'error_rate', 'availability'], alerts: ['pod_restart_rate', 'service_degradation'], dashboards: ['service_health', 'infrastructure_overview'] }, successCriteria: { maxResponseTime: '2000ms', maxErrorRate: '1%', minAvailability: '99%', recoveryTime: '<60s' } },
{ name: 'Network Latency Injection', description: 'Test system behavior under network stress', target: 'service-mesh', scenario: { action: 'inject-latency', latency: '500ms', jitter: '100ms', affectedServices: ['user-service', 'order-service'], duration: '10m' }, monitoring: { metrics: ['request_latency', 'timeout_rate', 'retry_count'], businessMetrics: ['conversion_rate', 'user_satisfaction'] }, successCriteria: { circuitBreakerTriggered: true, gracefulDegradation: true, userImpact: 'minimal' } },
{ name: 'Database Connection Pool Exhaustion', description: 'Test application behavior when DB connections are exhausted', target: 'database-connections', scenario: { action: 'exhaust-connection-pool', targetService: 'order-service', duration: '3m', gradual: true }, monitoring: { metrics: ['db_connection_count', 'query_response_time', 'queue_depth'], applicationMetrics: ['order_processing_rate', 'error_responses'] }, successCriteria: { connectionPoolRecovery: '<30s', backpressureHandling: true, dataConsistency: 'maintained' } } ]; }
createSafetyControls() { return { blastRadiusControl: { maxAffectedServices: '30%', maxAffectedUsers: '5%', environmentIsolation: true },
automaticRollback: { triggers: [ 'error_rate > 5%', 'response_time > 5000ms', 'availability < 95%', 'business_metric_degradation > 20%' ], rollbackTime: '<30s' },
businessHourProtection: { allowedWindows: ['weekends', 'maintenance_windows'], requireApproval: 'outside_allowed_windows', stakeholderNotification: true } }; }}
# Design comprehensive chaos engineering experimentsclaude "Create chaos engineering test suite for our distributed system:
System Architecture:- Frontend: React SPA hosted on CDN- API Gateway: Kong with rate limiting- Services: User, Order, Payment, Inventory (microservices)- Databases: PostgreSQL (primary), Redis (cache)- Message Bus: RabbitMQ for async processing- Infrastructure: Kubernetes on AWS
Experiment Categories:1. Infrastructure failures (pod kills, node failures)2. Network issues (latency, partitions, packet loss)3. Resource exhaustion (CPU, memory, disk)4. Dependency failures (database, external APIs)5. Application-level failures (memory leaks, deadlocks)
Safety Requirements:- Blast radius: <30% of system- User impact: <5% of traffic- Auto-rollback: <30 seconds- Business hours: protected- Monitoring: comprehensive
Generate experiment definitions, monitoring setup, and runbooks."
# Claude creates complete chaos engineering framework
Prompt: “Create application-level chaos experiments to test resilience patterns like circuit breakers, retries, and graceful degradation.”
# AI-generated progressive chaos experimentsapiVersion: chaos-mesh.org/v1alpha1kind: Schedulemetadata: name: application-resilience-testsspec: schedule: "0 2 * * 6" # Weekly Saturday 2 AM type: Workflow workflow: entry: progressive-chaos templates: # Stage 1: Database latency injection - name: database-latency templateType: NetworkChaos networkChaos: action: delay mode: all delay: latency: "200ms" jitter: "50ms" selector: labelSelectors: app: "postgresql" duration: "10m"
# Stage 2: Service dependency failure - name: service-failure templateType: PodChaos podChaos: action: pod-kill mode: fixed-percent value: "50%" selector: labelSelectors: app: "payment-service" duration: "5m"
# Stage 3: Memory pressure - name: memory-stress templateType: StressChaos stressChaos: stressors: memory: workers: 4 size: "256MB" selector: labelSelectors: app: "order-service" duration: "8m"
# Stage 4: Network partition - name: network-partition templateType: NetworkChaos networkChaos: action: partition direction: both selector: labelSelectors: tier: "backend" duration: "3m"
// Agent Mode: Application-level chaos injectionAgent: "Create application-level chaos experiments:
1. Inject database latency to test timeout handling2. Simulate service dependencies failing3. Create memory pressure scenarios4. Test circuit breaker and retry mechanisms5. Validate graceful degradation patterns6. Monitor business impact during experiments"
// AI generates application chaos testing frameworkclass ApplicationChaosTests { async testResiliencePatterns() { const experiments = [ { name: 'Database Timeout Testing', description: 'Test application behavior when database queries timeout', implementation: async () => { // Inject latency into database calls const originalQuery = db.query; db.query = async (sql, params) => { // Simulate slow queries 30% of the time if (Math.random() < 0.3) { await new Promise(resolve => setTimeout(resolve, 5000)); } return originalQuery.call(db, sql, params); }; }, validation: { circuitBreakerTriggered: true, timeoutHandled: true, userExperienceImpact: 'minimal' } },
{ name: 'External API Failure', description: 'Test fallback mechanisms when external APIs fail', implementation: async () => { // Mock external API to return errors nock('https://api.external-service.com') .get('/data') .reply(500, { error: 'Service Unavailable' }) .persist(); }, validation: { fallbackActivated: true, cacheUsed: true, partialFunctionality: 'maintained' } },
{ name: 'Memory Pressure Test', description: 'Test application behavior under memory constraints', implementation: async () => { // Gradually consume memory const memoryConsumer = []; const interval = setInterval(() => { memoryConsumer.push(new Array(1000000).fill('data'));
// Stop before system becomes unstable if (process.memoryUsage().heapUsed > 500 * 1024 * 1024) { clearInterval(interval); } }, 1000); }, validation: { gracefulDegradation: true, memoryManagement: 'effective', serviceAvailability: '>95%' } } ];
// Execute experiments with monitoring for (const experiment of experiments) { await this.runExperimentWithMonitoring(experiment); } }
async runExperimentWithMonitoring(experiment) { const metrics = { start: Date.now(), baseline: await this.captureBaseline(), monitoring: true };
try { // Execute chaos experiment await experiment.implementation();
// Validate system behavior const results = await this.validateExperiment(experiment.validation);
return { experiment: experiment.name, status: 'completed', results, metrics: await this.captureMetrics(metrics.start) }; } catch (error) { return { experiment: experiment.name, status: 'failed', error: error.message, needsAttention: true }; } }}
// AI-powered dependency failure testingclass DependencyChaosTester { async testResilience(service: Service) { // Map all dependencies const dependencies = await this.ai.mapDependencies({ service, depth: 3, includeTransitive: true, criticality: 'analyze' });
// Generate failure scenarios const scenarios = await this.ai.generateFailureScenarios({ dependencies,
patterns: [ 'single_point_failure', 'cascading_failure', 'slow_degradation', 'intermittent_failure', 'partial_availability' ],
// AI prioritizes based on impact prioritization: { businessImpact: 0.4, userExperience: 0.3, dataIntegrity: 0.2, recoveryComplexity: 0.1 } });
// Execute tests for (const scenario of scenarios) { const result = await this.executeScenario({ scenario,
monitoring: { // Business metrics business: ['conversion_rate', 'revenue_impact', 'user_satisfaction'],
// Technical metrics technical: ['latency_p99', 'error_rate', 'throughput'],
// Resilience metrics resilience: ['recovery_time', 'degradation_level', 'blast_radius'] },
validation: async (metrics) => { return this.ai.validateResilience({ metrics, slo: await this.getSLOs(), acceptableDegradation: 0.2 }); } });
// Learn from each test await this.ai.updateResilienceModel({ scenario, result, systemResponse: await this.analyzeSystemResponse(result) }); } }
async analyzeSystemResponse(result: TestResult) { return this.ai.analyze({ circuitBreakers: { triggered: result.circuitBreakerActivations, effectiveness: await this.measureCircuitBreakerEffectiveness(result), tuning: await this.ai.suggestCircuitBreakerSettings(result) },
retries: { patterns: result.retryPatterns, success: result.retrySuccessRate, optimization: await this.ai.optimizeRetryStrategy(result) },
fallbacks: { used: result.fallbackActivations, quality: await this.measureFallbackQuality(result), improvements: await this.ai.suggestFallbackImprovements(result) },
caching: { hitRate: result.cachePerformance, staleness: result.cacheDataAge, strategy: await this.ai.optimizeCachingStrategy(result) } }); }}
// Test and optimize circuit breakersclass CircuitBreakerChaos { async testCircuitBreakers(service: Service) { const circuitBreakers = await this.identifyCircuitBreakers(service);
for (const cb of circuitBreakers) { // Test opening conditions const openingTest = await this.testOpening({ circuitBreaker: cb,
scenarios: await this.ai.generateOpeningScenarios({ errorRates: [0.1, 0.3, 0.5, 0.7, 0.9], latencies: ['100ms', '500ms', '1s', '5s', 'timeout'], patterns: ['sudden_spike', 'gradual_increase', 'intermittent'] }),
validate: async (behavior) => { return this.ai.assessOpeningBehavior({ behavior, expectedThreshold: cb.config.errorThreshold, acceptableDeviation: 0.1 }); } });
// Test half-open state const halfOpenTest = await this.testHalfOpen({ circuitBreaker: cb,
recovery: await this.ai.simulateRecovery({ patterns: ['immediate', 'gradual', 'unstable', 'false_recovery'], duration: cb.config.halfOpenDuration }),
validate: async (behavior) => { return this.ai.assessHalfOpenBehavior({ behavior, stabilityRequired: true, prematureCloseRisk: 'low' }); } });
// Optimize settings const optimization = await this.ai.optimizeCircuitBreaker({ current: cb.config, testResults: { openingTest, halfOpenTest }, constraints: { maxLatency: '2s', minAvailability: 0.95, recoveryTime: 'under 1m' } });
return { circuitBreaker: cb.name, currentSettings: cb.config, suggestedSettings: optimization.settings, expectedImprovement: optimization.metrics }; } }}
Start Small, Scale Gradually
Begin with low-risk experiments in test environments. Use AI to guide your progression to production systems safely.
Automate Safety Controls
Implement automatic rollback, blast radius controls, and monitoring. Never run experiments without safety nets.
Learn from Every Experiment
Extract insights from both successful and failed experiments. Build organizational resilience knowledge.
Make It a Team Sport
Include engineering, operations, and business stakeholders. Chaos engineering builds team resilience, not just system resilience.
Prompt: “Integrate chaos engineering into our CI/CD pipeline with automated experiment selection and execution.”
name: Continuous Chaos Engineering
on: schedule: - cron: '0 */6 * * *' # Every 6 hours workflow_dispatch: inputs: experiment_type: description: 'Type of chaos experiment' required: true default: 'infrastructure' type: choice options: - infrastructure - application - dependency - security
jobs: chaos-experiment: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3
- name: Select Chaos Experiment id: select-experiment run: | # AI selects appropriate experiment based on: # - Recent changes # - System health # - Historical incident patterns # - Risk assessment
EXPERIMENT=$(./scripts/select-chaos-experiment.sh \ --recent-changes ${{ github.sha }} \ --system-health current \ --risk-tolerance low)
echo "experiment=$EXPERIMENT" >> $GITHUB_OUTPUT
- name: Pre-flight Safety Checks run: | # Verify system is healthy before chaos ./scripts/pre-flight-checks.sh \ --slo-status green \ --active-incidents none \ --business-hours-check
- name: Execute Chaos Experiment run: | ./scripts/run-chaos-experiment.sh \ --experiment ${{ steps.select-experiment.outputs.experiment }} \ --environment staging \ --monitoring enhanced \ --auto-rollback true \ --blast-radius minimal
- name: Analyze Results and Learn if: always() run: | ./scripts/analyze-chaos-results.sh \ --generate-insights \ --update-resilience-model \ --create-action-items
Start with System Mapping - Document your architecture, dependencies, and critical paths
Establish Steady State - Define what “normal” looks like with key metrics and baselines
Design Safe Experiments - Begin with low-risk scenarios in test environments
Build Team Capabilities - Train teams on incident response through game days
Automate and Scale - Integrate chaos testing into CI/CD pipelines
Learn and Improve - Extract insights and strengthen system resilience
Agent: "Design chaos engineering experiments for our e-commerce platform:- Create infrastructure failure scenarios (pod kills, network issues)- Design application-level chaos tests (database timeouts, API failures)- Set up comprehensive monitoring and safety controls- Generate game day scenarios for team training- Include automated rollback and recovery procedures"
claude "Set up chaos engineering program:
Todo:- [ ] Map system architecture and dependencies- [ ] Create failure scenario library- [ ] Set up Chaos Mesh or Litmus in Kubernetes- [ ] Design progressive experiment stages- [ ] Implement safety controls and monitoring- [ ] Create game day runbooks and scenarios- [ ] Integrate with CI/CD pipeline- [ ] Set up automated analysis and learning
Focus on building team confidence and system resilience."