Przejdź do głównej zawartości

Chaos Engineering

Ta treść nie jest jeszcze dostępna w Twoim języku.

Transform system failures from disasters into learning opportunities. Master chaos engineering with Cursor IDE and Claude Code to build resilient systems that gracefully handle the unexpected. Learn to design intelligent failure scenarios, automate recovery testing, and create systems that improve under stress.

In distributed systems, failures are inevitable. AI-powered chaos engineering transforms these failures from problems into competitive advantages:

Intelligent Experiment Design

Generate realistic failure scenarios based on your system architecture, historical incidents, and risk analysis

Automated Safety Controls

Built-in blast radius control, automatic rollback mechanisms, and real-time monitoring prevent chaos from becoming disaster

Learning from Failures

Extract actionable insights from every experiment to improve system resilience and incident response procedures

Proactive Resilience

Identify and fix weaknesses before they cause production outages, reducing MTTR and improving reliability

Quick Start: Design Your First Chaos Experiment

Section titled “Quick Start: Design Your First Chaos Experiment”

Workflow 1: Infrastructure Resilience Testing

Section titled “Workflow 1: Infrastructure Resilience Testing”

Prompt: “Design chaos engineering experiments to test our microservices architecture’s resilience to common failure scenarios.”

// Agent Mode: Chaos experiment design
Agent: "Create comprehensive chaos engineering experiments:
1. Analyze our system architecture and dependencies
2. Design failure scenarios for different components
3. Create progressive experiment plans with safety controls
4. Set up monitoring and alerting for experiments
5. Define success criteria and rollback procedures
6. Generate experiment runbooks and documentation
System components:
- Web frontend (React/Next.js)
- API gateway (Kong/Envoy)
- Microservices (Node.js, Python, Go)
- Databases (PostgreSQL, Redis)
- Message queues (RabbitMQ, Kafka)
- Infrastructure (Kubernetes, AWS)"
// AI generates chaos experiment framework
class ChaosExperimentDesigner {
designInfrastructureTests() {
return [
{
name: 'Pod Failure Simulation',
description: 'Test service recovery when pods are killed',
target: 'kubernetes-pods',
scenario: {
action: 'kill-random-pods',
percentage: 25, // Kill 25% of pods
duration: '5m',
rollback: 'automatic'
},
monitoring: {
metrics: ['response_time', 'error_rate', 'availability'],
alerts: ['pod_restart_rate', 'service_degradation'],
dashboards: ['service_health', 'infrastructure_overview']
},
successCriteria: {
maxResponseTime: '2000ms',
maxErrorRate: '1%',
minAvailability: '99%',
recoveryTime: '<60s'
}
},
{
name: 'Network Latency Injection',
description: 'Test system behavior under network stress',
target: 'service-mesh',
scenario: {
action: 'inject-latency',
latency: '500ms',
jitter: '100ms',
affectedServices: ['user-service', 'order-service'],
duration: '10m'
},
monitoring: {
metrics: ['request_latency', 'timeout_rate', 'retry_count'],
businessMetrics: ['conversion_rate', 'user_satisfaction']
},
successCriteria: {
circuitBreakerTriggered: true,
gracefulDegradation: true,
userImpact: 'minimal'
}
},
{
name: 'Database Connection Pool Exhaustion',
description: 'Test application behavior when DB connections are exhausted',
target: 'database-connections',
scenario: {
action: 'exhaust-connection-pool',
targetService: 'order-service',
duration: '3m',
gradual: true
},
monitoring: {
metrics: ['db_connection_count', 'query_response_time', 'queue_depth'],
applicationMetrics: ['order_processing_rate', 'error_responses']
},
successCriteria: {
connectionPoolRecovery: '<30s',
backpressureHandling: true,
dataConsistency: 'maintained'
}
}
];
}
createSafetyControls() {
return {
blastRadiusControl: {
maxAffectedServices: '30%',
maxAffectedUsers: '5%',
environmentIsolation: true
},
automaticRollback: {
triggers: [
'error_rate > 5%',
'response_time > 5000ms',
'availability < 95%',
'business_metric_degradation > 20%'
],
rollbackTime: '<30s'
},
businessHourProtection: {
allowedWindows: ['weekends', 'maintenance_windows'],
requireApproval: 'outside_allowed_windows',
stakeholderNotification: true
}
};
}
}

Workflow 2: Application-Level Chaos Testing

Section titled “Workflow 2: Application-Level Chaos Testing”

Prompt: “Create application-level chaos experiments to test resilience patterns like circuit breakers, retries, and graceful degradation.”

# AI-generated progressive chaos experiments
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: application-resilience-tests
spec:
schedule: "0 2 * * 6" # Weekly Saturday 2 AM
type: Workflow
workflow:
entry: progressive-chaos
templates:
# Stage 1: Database latency injection
- name: database-latency
templateType: NetworkChaos
networkChaos:
action: delay
mode: all
delay:
latency: "200ms"
jitter: "50ms"
selector:
labelSelectors:
app: "postgresql"
duration: "10m"
# Stage 2: Service dependency failure
- name: service-failure
templateType: PodChaos
podChaos:
action: pod-kill
mode: fixed-percent
value: "50%"
selector:
labelSelectors:
app: "payment-service"
duration: "5m"
# Stage 3: Memory pressure
- name: memory-stress
templateType: StressChaos
stressChaos:
stressors:
memory:
workers: 4
size: "256MB"
selector:
labelSelectors:
app: "order-service"
duration: "8m"
# Stage 4: Network partition
- name: network-partition
templateType: NetworkChaos
networkChaos:
action: partition
direction: both
selector:
labelSelectors:
tier: "backend"
duration: "3m"
// AI-powered dependency failure testing
class DependencyChaosTester {
async testResilience(service: Service) {
// Map all dependencies
const dependencies = await this.ai.mapDependencies({
service,
depth: 3,
includeTransitive: true,
criticality: 'analyze'
});
// Generate failure scenarios
const scenarios = await this.ai.generateFailureScenarios({
dependencies,
patterns: [
'single_point_failure',
'cascading_failure',
'slow_degradation',
'intermittent_failure',
'partial_availability'
],
// AI prioritizes based on impact
prioritization: {
businessImpact: 0.4,
userExperience: 0.3,
dataIntegrity: 0.2,
recoveryComplexity: 0.1
}
});
// Execute tests
for (const scenario of scenarios) {
const result = await this.executeScenario({
scenario,
monitoring: {
// Business metrics
business: ['conversion_rate', 'revenue_impact', 'user_satisfaction'],
// Technical metrics
technical: ['latency_p99', 'error_rate', 'throughput'],
// Resilience metrics
resilience: ['recovery_time', 'degradation_level', 'blast_radius']
},
validation: async (metrics) => {
return this.ai.validateResilience({
metrics,
slo: await this.getSLOs(),
acceptableDegradation: 0.2
});
}
});
// Learn from each test
await this.ai.updateResilienceModel({
scenario,
result,
systemResponse: await this.analyzeSystemResponse(result)
});
}
}
async analyzeSystemResponse(result: TestResult) {
return this.ai.analyze({
circuitBreakers: {
triggered: result.circuitBreakerActivations,
effectiveness: await this.measureCircuitBreakerEffectiveness(result),
tuning: await this.ai.suggestCircuitBreakerSettings(result)
},
retries: {
patterns: result.retryPatterns,
success: result.retrySuccessRate,
optimization: await this.ai.optimizeRetryStrategy(result)
},
fallbacks: {
used: result.fallbackActivations,
quality: await this.measureFallbackQuality(result),
improvements: await this.ai.suggestFallbackImprovements(result)
},
caching: {
hitRate: result.cachePerformance,
staleness: result.cacheDataAge,
strategy: await this.ai.optimizeCachingStrategy(result)
}
});
}
}
// Test and optimize circuit breakers
class CircuitBreakerChaos {
async testCircuitBreakers(service: Service) {
const circuitBreakers = await this.identifyCircuitBreakers(service);
for (const cb of circuitBreakers) {
// Test opening conditions
const openingTest = await this.testOpening({
circuitBreaker: cb,
scenarios: await this.ai.generateOpeningScenarios({
errorRates: [0.1, 0.3, 0.5, 0.7, 0.9],
latencies: ['100ms', '500ms', '1s', '5s', 'timeout'],
patterns: ['sudden_spike', 'gradual_increase', 'intermittent']
}),
validate: async (behavior) => {
return this.ai.assessOpeningBehavior({
behavior,
expectedThreshold: cb.config.errorThreshold,
acceptableDeviation: 0.1
});
}
});
// Test half-open state
const halfOpenTest = await this.testHalfOpen({
circuitBreaker: cb,
recovery: await this.ai.simulateRecovery({
patterns: ['immediate', 'gradual', 'unstable', 'false_recovery'],
duration: cb.config.halfOpenDuration
}),
validate: async (behavior) => {
return this.ai.assessHalfOpenBehavior({
behavior,
stabilityRequired: true,
prematureCloseRisk: 'low'
});
}
});
// Optimize settings
const optimization = await this.ai.optimizeCircuitBreaker({
current: cb.config,
testResults: { openingTest, halfOpenTest },
constraints: {
maxLatency: '2s',
minAvailability: 0.95,
recoveryTime: 'under 1m'
}
});
return {
circuitBreaker: cb.name,
currentSettings: cb.config,
suggestedSettings: optimization.settings,
expectedImprovement: optimization.metrics
};
}
}
}

Start Small, Scale Gradually

Begin with low-risk experiments in test environments. Use AI to guide your progression to production systems safely.

Automate Safety Controls

Implement automatic rollback, blast radius controls, and monitoring. Never run experiments without safety nets.

Learn from Every Experiment

Extract insights from both successful and failed experiments. Build organizational resilience knowledge.

Make It a Team Sport

Include engineering, operations, and business stakeholders. Chaos engineering builds team resilience, not just system resilience.

Prompt: “Integrate chaos engineering into our CI/CD pipeline with automated experiment selection and execution.”

.github/workflows/continuous-chaos.yml
name: Continuous Chaos Engineering
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
workflow_dispatch:
inputs:
experiment_type:
description: 'Type of chaos experiment'
required: true
default: 'infrastructure'
type: choice
options:
- infrastructure
- application
- dependency
- security
jobs:
chaos-experiment:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Select Chaos Experiment
id: select-experiment
run: |
# AI selects appropriate experiment based on:
# - Recent changes
# - System health
# - Historical incident patterns
# - Risk assessment
EXPERIMENT=$(./scripts/select-chaos-experiment.sh \
--recent-changes ${{ github.sha }} \
--system-health current \
--risk-tolerance low)
echo "experiment=$EXPERIMENT" >> $GITHUB_OUTPUT
- name: Pre-flight Safety Checks
run: |
# Verify system is healthy before chaos
./scripts/pre-flight-checks.sh \
--slo-status green \
--active-incidents none \
--business-hours-check
- name: Execute Chaos Experiment
run: |
./scripts/run-chaos-experiment.sh \
--experiment ${{ steps.select-experiment.outputs.experiment }} \
--environment staging \
--monitoring enhanced \
--auto-rollback true \
--blast-radius minimal
- name: Analyze Results and Learn
if: always()
run: |
./scripts/analyze-chaos-results.sh \
--generate-insights \
--update-resilience-model \
--create-action-items
  1. Start with System Mapping - Document your architecture, dependencies, and critical paths

  2. Establish Steady State - Define what “normal” looks like with key metrics and baselines

  3. Design Safe Experiments - Begin with low-risk scenarios in test environments

  4. Build Team Capabilities - Train teams on incident response through game days

  5. Automate and Scale - Integrate chaos testing into CI/CD pipelines

  6. Learn and Improve - Extract insights and strengthen system resilience

Agent: "Design chaos engineering experiments for our e-commerce platform:
- Create infrastructure failure scenarios (pod kills, network issues)
- Design application-level chaos tests (database timeouts, API failures)
- Set up comprehensive monitoring and safety controls
- Generate game day scenarios for team training
- Include automated rollback and recovery procedures"
Terminal window
claude "Set up chaos engineering program:
Todo:
- [ ] Map system architecture and dependencies
- [ ] Create failure scenario library
- [ ] Set up Chaos Mesh or Litmus in Kubernetes
- [ ] Design progressive experiment stages
- [ ] Implement safety controls and monitoring
- [ ] Create game day runbooks and scenarios
- [ ] Integrate with CI/CD pipeline
- [ ] Set up automated analysis and learning
Focus on building team confidence and system resilience."