Load, Stress, and Benchmark Testing

Your API handles 500 requests per second in staging and everyone celebrates. Then Black Friday hits, traffic spikes to 3,000 rps, and the database connection pool exhausts within minutes. The response time graph looks like a hockey stick and your CEO is watching the uptime dashboard. Performance testing is not optional for production systems — and AI makes building comprehensive performance test suites dramatically easier.

What You’ll Walk Away With

k6 and Artillery load test generation from AI prompts
Stress testing patterns that find your system’s breaking point safely
Continuous benchmarking in CI that catches performance regressions
AI-assisted analysis of performance bottlenecks from test results
Realistic traffic pattern simulation for your specific use case

Load Test Generation

Generate a k6 load test for our checkout API:

Scenario: Simulate a flash sale with ramping traffic
- Ramp from 0 to 100 virtual users over 2 minutes
- Hold at 100 VUs for 5 minutes (steady state)
- Spike to 500 VUs for 1 minute (flash sale moment)
- Return to 100 VUs for 2 minutes (recovery)
- Ramp down to 0 over 1 minute

API calls per virtual user iteration:
1. POST /api/auth/login (use test credentials from env)
2. GET /api/products?category=sale (browse sale items)
3. POST /api/cart/items (add random product)
4. POST /api/checkout (complete purchase with test payment)

Thresholds:
- p95 response time < 500ms during steady state
- p99 response time < 2000ms during spike
- Error rate < 1% at all times
- Checkout success rate > 99%

Save to /tests/performance/checkout-load.k6.js

claude "Create a comprehensive k6 performance test suite:

1. /tests/performance/checkout-load.k6.js - Checkout flow load test
   - Ramping traffic pattern: 0 -> 100 -> 500 -> 100 -> 0 VUs
   - Realistic user journey (login, browse, cart, checkout)
   - SLA thresholds for response time and error rate

2. /tests/performance/api-stress.k6.js - API endpoint stress test
   - Test each critical endpoint individually
   - Find the breaking point (ramp until errors > 5%)
   - Report max throughput per endpoint

3. /tests/performance/helpers/auth.js - Shared auth helper
   - Login and cache tokens
   - Token refresh handling

4. package.json scripts:
   - test:perf:load - Run load tests
   - test:perf:stress - Run stress tests
   - test:perf:smoke - Quick 30-second smoke test

Include realistic test data generation for each scenario."

Create a performance testing suite for this project:
1. Analyze the API routes to identify critical endpoints
2. Generate k6 load tests for the top 5 most important flows
3. Create stress tests that find breaking points
4. Add performance smoke tests for CI integration
5. Create a PR with the test suite and documentation

Include realistic traffic patterns based on typical SaaS usage.

Stress Testing: Finding the Breaking Point

Copy-paste prompt for stress test generation:

Generate a k6 stress test that finds our API's breaking point:

Strategy: Ramp virtual users linearly from 10 to 1000 over 15 minutes.
For each VU, execute this loop:
1. GET /api/products (list endpoint)
2. GET /api/products/{randomId} (detail endpoint)
3. POST /api/orders (write endpoint with validation)

Track these metrics at each load level:
- Requests per second achieved
- p50, p95, p99 response times
- Error rate (HTTP 5xx responses)
- TCP connection errors
- Iterations completed vs. failed

Define the breaking point as when EITHER:
- p95 response time exceeds 5 seconds, OR
- Error rate exceeds 5%, OR
- Iterations start failing

Output the results in a format I can paste into a Markdown table for our capacity planning doc.
Include checks that fail the test if we cannot handle at least 200 concurrent users.

Continuous Benchmarking in CI

Catching Performance Regressions Automatically

Load and stress runs are long and CPU-hungry, so where you run them matters as much as the script. Each tool has a natural home for k6 jobs:

Author and debug the k6 scripts locally in Agent mode against a staging URL, then commit the workflow file. Cursor is where you iterate on thresholds and scenarios; you do not want long stress runs blocking the editor, so keep the in-editor runs to the 30-second smoke test.

Run the smoke benchmark headlessly as a PR gate: claude -p "run k6 run tests/performance/smoke.k6.js, compare p95 against perf-baseline.json, and fail if it regressed more than 20%" inside the GitHub Actions job. Pair it with a PostToolUse hook so the comparison comment is posted automatically.

Copy-paste prompt for CI performance benchmarks:

Create a performance regression detection pipeline:

1. A smoke-level k6 test that runs in under 60 seconds:
   - 10 VUs for 30 seconds
   - Test the 3 most critical endpoints
   - Strict thresholds: p95 < 200ms, error rate = 0%

2. A GitHub Actions workflow step that:
   - Runs the smoke test on every PR
   - Compares results against the baseline from main branch
   - Fails the PR if p95 regressed by more than 20%
   - Posts a comment with performance comparison table

3. A baseline update mechanism:
   - After merging to main, save the performance results as the new baseline
   - Store in a JSON file committed to the repo

Save the test to /tests/performance/smoke.k6.js
Save the workflow addition to a snippet I can add to .github/workflows/ci.yml

Analyzing Performance Results with AI

After running load tests, AI tools can help interpret the results.

Copy-paste prompt for performance analysis:

Analyze these k6 load test results and identify bottlenecks:

Endpoint: POST /api/checkout
- At 50 VUs: p95=120ms, p99=250ms, error_rate=0%
- At 100 VUs: p95=180ms, p99=450ms, error_rate=0%
- At 200 VUs: p95=850ms, p99=2100ms, error_rate=0.5%
- At 300 VUs: p95=3200ms, p99=8500ms, error_rate=12%
- At 400 VUs: p95=timeout, error_rate=45%

Our stack: Node.js Express, PostgreSQL (connection pool max: 20),
Redis for sessions, Stripe API for payments.

Based on these results:
1. What is the likely bottleneck? (connection pool, CPU, external API, memory)
2. Why does performance degrade non-linearly between 200 and 300 VUs?
3. What specific code or infrastructure changes would increase capacity?
4. What is our realistic production capacity with current architecture?
5. Suggest a monitoring dashboard that would show this bottleneck in real-time

Database Performance Testing

Copy-paste prompt for database performance tests:

Generate database performance benchmarks for our critical queries:

1. User search (full-text search on name + email):
   - Benchmark with 100K, 500K, 1M, and 5M rows
   - Test with and without the GIN index
   - Measure query time and rows scanned

2. Order listing with joins (orders + items + products):
   - Benchmark the paginated query (page 1 vs. page 100 vs. page 1000)
   - Test with 1M orders and 5M order items
   - Measure with and without covering indexes

3. Dashboard aggregation (total revenue by day for last 90 days):
   - Benchmark with 1M and 10M orders
   - Test with raw query vs. materialized view
   - Measure time and memory usage

Create a seed script that generates the test data using faker.
Save to /tests/performance/database-benchmarks.ts
Output results as a comparison table.

When This Breaks

“Load tests pass locally but the production system is slower.” Your local environment does not match production. Run performance tests against a staging environment that mirrors production infrastructure (same database size, same network latency, same connection limits). Never use local databases for load testing.

“Tests give inconsistent results between runs.” Performance tests are inherently noisy. Run each scenario three times and use the median. Establish acceptable variance bands (plus or minus 15%). Fail only when the median exceeds the threshold, not individual runs.

“We cannot run load tests in CI because they take too long.” Use a tiered approach: smoke tests (30 seconds) on every PR, load tests (10 minutes) nightly, full stress tests weekly. The smoke test catches the obvious regressions; the longer tests catch the subtle ones.

“The AI generated load tests that do not match real traffic patterns.” Give the AI your actual traffic data. Export a sample from your analytics: “Our traffic peaks at 2pm EST, 60% of requests are GET /api/products, and the average user session makes 12 API calls over 8 minutes.”

What’s Next

Monitoring and Observability Set up production monitoring to validate performance test predictions.

API Testing Functional API testing to complement performance testing.

CI/CD Pipelines Integrate performance gates into your deployment pipeline.