Synthetic Data Generation
Generate realistic, referentially-intact datasets that respect schema constraints and business rules without exposing sensitive information
Ta treść nie jest jeszcze dostępna w Twoim języku.
Transform test data from your biggest testing bottleneck into a competitive advantage. Learn how to use Cursor IDE and Claude Code to generate realistic, privacy-compliant test data that adapts to your schema changes and scales with your testing needs.
Test data challenges kill productivity and create compliance risks. AI-driven approaches solve these challenges systematically:
Synthetic Data Generation
Generate realistic, referentially-intact datasets that respect schema constraints and business rules without exposing sensitive information
Privacy-First Design
Automatic PII detection, GDPR/HIPAA compliance, and intelligent masking that maintains data utility while protecting privacy
Automated Lifecycle
Version-controlled test data with automated provisioning, refresh scheduling, and environment-specific subsetting
Quality Validation
Continuous data quality monitoring with automated anomaly detection and coverage analysis for comprehensive testing
Prompt: “Generate realistic test data for our e-commerce database that maintains referential integrity and follows business rules.”
// Agent Mode: Test data generationAgent: "Analyze our database schema and generate comprehensive test data:
1. Generate 10,000 users with realistic demographics2. Create 5,000 products across various categories3. Generate 50,000 orders with realistic patterns4. Add 25,000 product reviews with natural distributions5. Ensure all foreign keys and constraints are satisfied6. Include edge cases and boundary conditions7. Maintain realistic business relationships"
// AI generates comprehensive test data factoryclass TestDataFactory { generateUsers(count = 1000) { return Array.from({ length: count }, (_, i) => ({ id: `user_${i + 1}`, email: `user${i + 1}@${this.randomDomain()}`, name: this.generateRealisticName(), age: this.normalDistribution(35, 10, 18, 80), registeredAt: this.randomDateInRange( new Date('2020-01-01'), new Date() ), preferences: { newsletter: Math.random() > 0.3, theme: Math.random() > 0.6 ? 'dark' : 'light', language: this.weightedChoice({ 'en': 0.7, 'es': 0.15, 'fr': 0.10, 'de': 0.05 }) } })); }
generateProducts(count = 500) { const categories = ['electronics', 'clothing', 'home', 'books', 'sports'];
return Array.from({ length: count }, (_, i) => ({ id: `product_${i + 1}`, name: this.generateProductName(), category: categories[Math.floor(Math.random() * categories.length)], price: this.logNormalDistribution(3.5, 1.2, 0.99, 9999.99), description: this.generateProductDescription(), inStock: Math.random() > 0.1, // 90% in stock tags: this.generateTags(), createdAt: this.randomDateInRange( new Date('2019-01-01'), new Date() ) })); }
generateOrders(users, products, count = 5000) { return Array.from({ length: count }, (_, i) => { const user = users[Math.floor(Math.random() * users.length)]; const orderItems = this.generateOrderItems(products); const total = orderItems.reduce((sum, item) => sum + (item.price * item.quantity), 0 );
return { id: `order_${i + 1}`, userId: user.id, items: orderItems, total: parseFloat(total.toFixed(2)), status: this.weightedChoice({ 'completed': 0.85, 'processing': 0.10, 'cancelled': 0.05 }), createdAt: this.businessHoursWeighted(), shippingAddress: this.generateAddress() }; }); }
// Utility methods for realistic data generation normalDistribution(mean, stdDev, min, max) { let value; do { value = this.boxMullerTransform() * stdDev + mean; } while (value < min || value > max); return Math.round(value); }
businessHoursWeighted() { // More orders during business hours and weekdays const now = new Date(); const dayWeight = now.getDay() === 0 || now.getDay() === 6 ? 0.3 : 1.0; const hourWeight = now.getHours() >= 9 && now.getHours() <= 17 ? 1.5 : 0.7;
return this.randomDateWeighted(dayWeight * hourWeight); }}
// Generate complete datasetconst factory = new TestDataFactory();const users = factory.generateUsers(10000);const products = factory.generateProducts(5000);const orders = factory.generateOrders(users, products, 50000);
# Generate comprehensive test data suiteclaude "Create test data generation system for our application:
Database schema:- users (id, email, name, age, preferences, created_at)- products (id, name, category, price, description, in_stock)- orders (id, user_id, total, status, created_at, items[])- reviews (id, user_id, product_id, rating, comment, created_at)
Requirements:1. Generate realistic data that follows business rules2. Maintain referential integrity across all tables3. Include edge cases and boundary conditions4. Create data factories for different scenarios5. Support multiple environments (dev, staging, test)6. Include data validation and quality checks
Generate:- 10K users with realistic demographics- 5K products across various categories- 50K orders with realistic purchase patterns- 25K reviews with natural rating distributions
Ensure GDPR compliance and data privacy."
# Claude generates complete test data system with:# - Schema-aware data factories# - Realistic business relationships# - Privacy-compliant synthetic data# - Automated quality validation
Prompt: “Generate industry-specific test data that follows regulatory requirements and domain expertise.”
// Agent Mode: HIPAA-compliant healthcare test dataAgent: "Generate healthcare test data that complies with HIPAA:
1. Create synthetic patient records2. Generate realistic medical conditions based on age/demographics3. Include appropriate medications and lab results4. Ensure all PII is properly de-identified5. Create patient journey scenarios6. Follow clinical guidelines for data relationships"
// AI generates compliant healthcare test dataclass HealthcareTestData { generatePatients(count = 1000) { return Array.from({ length: count }, (_, i) => ({ // De-identified data only patientId: `PAT_${String(i + 1).padStart(6, '0')}`, demographics: { ageRange: this.getAgeRange(), // Instead of exact age gender: Math.random() > 0.5 ? 'M' : 'F', zipCode: this.getGeneralizedZip(), // 3-digit zip only ethnicity: this.getEthnicityDistribution() },
conditions: this.generateConditions(), medications: this.generateMedications(), labResults: this.generateLabResults(),
// Date shifting for privacy admissionDate: this.shiftDate(this.randomDate(), i),
// Clinical scenarios scenario: this.selectClinicalScenario() })); }
generateConditions() { // Based on ICD-10 prevalence data const commonConditions = [ { code: 'I10', name: 'Hypertension', prevalence: 0.45 }, { code: 'E78.5', name: 'Hyperlipidemia', prevalence: 0.28 }, { code: 'E11', name: 'Type 2 Diabetes', prevalence: 0.11 }, { code: 'J44', name: 'COPD', prevalence: 0.06 } ];
return commonConditions.filter(condition => Math.random() < condition.prevalence ); }
generateMedications() { // Medications based on conditions const medicationMap = { 'I10': ['Lisinopril', 'Amlodipine'], 'E78.5': ['Atorvastatin', 'Simvastatin'], 'E11': ['Metformin', 'Insulin'], 'J44': ['Albuterol', 'Prednisone'] };
return this.conditions.flatMap(condition => medicationMap[condition.code] || [] ); }
shiftDate(originalDate, patientIndex) { // Consistent date shifting per patient for privacy const shift = (patientIndex * 17) % 365; // Deterministic shift const shiftedDate = new Date(originalDate); shiftedDate.setDate(shiftedDate.getDate() + shift); return shiftedDate; }}
// Agent Mode: Financial test data with complianceAgent: "Generate financial test data that meets regulatory requirements:
1. Create transaction data with realistic patterns2. Include fraud detection scenarios3. Generate portfolio data with market movements4. Ensure PCI DSS compliance for payment data5. Add AML/KYC testing scenarios6. Include stress testing data"
// AI generates compliant financial test dataclass FinancialTestData { generateTransactions(count = 10000) { return Array.from({ length: count }, (_, i) => { const baseAmount = this.logNormalDistribution(2.5, 1.5); const isFraud = Math.random() < 0.001; // 0.1% fraud rate
return { transactionId: `TXN_${Date.now()}_${i}`, // Tokenized card number (PCI compliant) cardToken: this.generateCardToken(), amount: isFraud ? this.generateFraudAmount() : baseAmount, merchantCategory: this.selectMerchantCategory(), location: this.generateLocation(), timestamp: this.generateRealisticTimestamp(),
// Risk indicators riskScore: isFraud ? Math.random() * 0.3 + 0.7 : Math.random() * 0.3, velocityFlag: this.checkVelocity(i), locationFlag: this.checkLocationAnomaly(),
// Compliance markers amlFlag: Math.random() < 0.0001, // Very rare kycStatus: 'verified', regulatoryReporting: this.needsReporting(baseAmount) }; }); }
generatePortfolioData() { const assets = ['AAPL', 'GOOGL', 'MSFT', 'TSLA', 'BTC-USD'];
return assets.map(symbol => ({ symbol, currentPrice: this.generatePrice(symbol), historicalPrices: this.generatePriceHistory(symbol), volatility: this.calculateVolatility(symbol), beta: this.calculateBeta(symbol), marketEvents: this.generateMarketEvents() })); }
generateCardToken() { // Generate PCI-compliant tokenized card number return `TOK_${Math.random().toString(36).substr(2, 16).toUpperCase()}`; }
generateFraudAmount() { // Fraud transactions often have specific patterns const patterns = [ () => Math.round(Math.random() * 100) + 0.01, // Small amounts () => Math.round(Math.random() * 500) * 2, // Round amounts () => 9999.99 // Just under limits ];
return patterns[Math.floor(Math.random() * patterns.length)](); }}
// AI-powered PII detection and maskingclass PrivacyProtector { async protectSensitiveData(data: any[], options: PrivacyOptions = {}) { const detected = await this.detectPII(data);
return this.ai.maskData({ data, detected,
strategies: { names: options.preserveFormat ? 'format_preserving_encryption' : 'synthetic_replacement', emails: 'consistent_pseudonymization', phones: 'partial_masking', ssn: 'tokenization', addresses: 'generalization',
// Custom identifiers custom: await this.ai.detectCustomPII({ data, context: options.businessContext, sensitivity: options.sensitivityLevel }) },
// Maintain referential integrity consistency: { crossTable: true, crossDatabase: options.globalConsistency, temporalShift: options.dateShifting } }); }
async detectPII(data: any[]) { // AI detects various PII types return this.ai.scanForPII({ data,
detectors: { standard: ['names', 'emails', 'phones', 'ssn', 'addresses'], contextual: ['account_numbers', 'employee_ids', 'medical_records'], behavioral: ['access_patterns', 'location_traces', 'communication_graphs'],
// AI learns custom patterns learned: await this.ai.learnPIIPatterns({ samples: data.slice(0, 1000), feedback: this.historicalFeedback }) },
confidence: { threshold: 0.8, review: 'flag_uncertain', sampling: 'stratified' } }); }
async generatePrivacyReport(data: any[], masked: any[]) { return { summary: { recordsProcessed: data.length, piiDetected: await this.countPII(data), piiMasked: await this.countPII(masked), dataUtility: await this.measureUtility(data, masked) },
compliance: { gdpr: await this.checkGDPRCompliance(masked), ccpa: await this.checkCCPACompliance(masked), hipaa: await this.checkHIPAACompliance(masked), industrySpecific: await this.checkIndustryCompliance(masked) },
risks: await this.ai.assessReidentificationRisk({ masked, attackModels: ['linkage', 'inference', 'auxiliary_data'], publicData: 'consider' }) }; }}
// Usage exampleconst protector = new PrivacyProtector();
const maskedData = await protector.protectSensitiveData(productionData, { preserveFormat: true, businessContext: 'customer_analytics', sensitivityLevel: 'high', globalConsistency: true, dateShifting: { method: 'consistent', range: [-365, 365] }});
const report = await protector.generatePrivacyReport(productionData, maskedData);
Prompt: “Set up automated test data lifecycle management with version control and environment provisioning.”
// Agent Mode: Test data lifecycle automationAgent: "Create comprehensive test data lifecycle management:
1. Set up data versioning and change tracking2. Automate environment-specific data provisioning3. Create data refresh and cleanup schedules4. Implement data quality monitoring5. Set up cross-environment data synchronization6. Create rollback and recovery procedures"
// AI generates data lifecycle management systemclass TestDataLifecycle { async setupEnvironment(environmentConfig) { const { environment, dataSizes, features } = environmentConfig;
return { // Environment-specific data sizing dataSets: await this.generateEnvironmentData({ environment, sizes: { development: { scale: 0.01, focus: 'edge_cases' }, testing: { scale: 0.1, focus: 'comprehensive' }, staging: { scale: 0.5, focus: 'production_like' }, performance: { scale: 1.0, focus: 'load_testing' } }[environment] }),
// Feature flag data featureData: await this.generateFeatureData(features),
// Environment provisioning provisioning: { database: await this.provisionDatabase(environment), storage: await this.provisionStorage(environment), services: await this.provisionServices(environment) },
// Automated refresh schedule refreshSchedule: this.createRefreshSchedule(environment) }; }
async manageDataVersions() { return { versioning: { strategy: 'semantic_versioning', // v1.2.3 triggers: { major: 'schema_breaking_changes', minor: 'new_test_scenarios', patch: 'data_quality_fixes' }, storage: 'git_lfs_with_metadata' },
changelog: await this.generateChangeLog(), migration: await this.createMigrationScripts(), rollback: await this.setupRollbackProcedures() }; }
async monitorDataQuality() { return { metrics: { completeness: await this.checkDataCompleteness(), consistency: await this.checkDataConsistency(), accuracy: await this.validateBusinessRules(), timeliness: await this.checkDataFreshness() },
alerts: { dataQualityDegradation: 'immediate', schemaViolations: 'immediate', performanceIssues: 'warning', storageCapacity: 'daily' },
automation: { qualityGates: await this.setupQualityGates(), autoRemediation: await this.setupAutoRemediation(), reporting: await this.setupReporting() } }; }}
# Set up comprehensive test data lifecycle managementclaude "Create automated test data lifecycle system:
Todo:- [ ] Set up data versioning with semantic versioning- [ ] Create environment-specific data provisioning- [ ] Implement automated data refresh schedules- [ ] Set up data quality monitoring and alerting- [ ] Create data lineage and change tracking- [ ] Implement rollback and recovery procedures- [ ] Set up cross-environment synchronization- [ ] Create compliance and audit trails
Environments:- Development (1% data, edge cases focus)- Testing (10% data, comprehensive coverage)- Staging (50% data, production-like)- Performance (100% data, load testing)
Tools: Git LFS, database migration tools, monitoring"
# Claude sets up complete lifecycle management
## Best Practices for Test Data Management
<CardGrid> <Card title="Schema-First Approach" icon="database"> Always start with your database schema. AI can generate realistic data that respects all constraints and relationships automatically. </Card>
<Card title="Privacy by Design" icon="shield"> Build privacy protection into your data pipeline from day one. It's exponentially harder to retrofit privacy compliance later. </Card>
<Card title="Version Control Everything" icon="git-branch"> Treat test data like code. Version it, track changes, and maintain lineage for debugging and compliance. </Card>
<Card title="Automate Quality Gates" icon="check-circle"> Set up automated validation for data quality, privacy compliance, and business rule adherence. </Card></CardGrid>
Prompt: “Integrate test data management into our CI/CD pipeline with automated quality gates.”
name: Test Data Pipeline
on: push: paths: ['schema/**', 'migrations/**'] schedule: - cron: '0 2 * * 1' # Weekly refresh
jobs: data-pipeline: runs-on: ubuntu-latest strategy: matrix: environment: [development, testing, staging]
steps: - uses: actions/checkout@v3
- name: Generate Test Data run: | # AI generates environment-specific data npm run generate-test-data -- \ --environment ${{ matrix.environment }} \ --schema ./schema/latest.sql \ --size auto \ --privacy-compliance gdpr,ccpa
- name: Validate Data Quality run: | npm run validate-data-quality -- \ --check referential-integrity \ --check business-rules \ --check privacy-compliance \ --generate-report
- name: Deploy to Environment run: | npm run deploy-test-data -- \ --target ${{ matrix.environment }} \ --backup-existing \ --verify-deployment
## Common Pitfalls
<Aside type="caution"> **Watch Out For These Test Data Management Mistakes**: 1. **Using production data without proper masking** - This is a compliance nightmare waiting to happen 2. **Generating unrealistic data** - Pretty data that doesn't reflect real-world messiness will miss bugs 3. **Ignoring referential integrity** - Broken relationships lead to false test failures 4. **Not refreshing test data** - Stale data leads to tests that pass but production that fails 5. **Over-subsetting** - Too small datasets miss edge cases and performance issues</Aside>
## Integration Examples
### CI/CD Pipeline Integration
```yaml# .gitlab-ci.yml with AI test data managementstages: - prepare-data - test - cleanup
prepare-test-data: stage: prepare-data script: - | # AI generates test data based on branch changes ai-data-gen analyze-changes \ --branch $CI_COMMIT_BRANCH \ --generate-for affected-features
# Create optimized subset ai-data-gen create-subset \ --source production-replica \ --size smart-sizing \ --coverage branch-specific
# Mask sensitive data ai-data-gen mask \ --compliance gdpr,pci \ --preserve-format true
artifacts: paths: - test-data/ expire_in: 1 day
test-with-data: stage: test script: - | # Load test data ai-data-gen load \ --environment $CI_ENVIRONMENT_NAME \ --parallel true
# Run tests with AI monitoring ai-test-runner execute \ --monitor-data-usage \ --optimize-on-fly
# Generate data quality report ai-data-gen report \ --include coverage,quality,privacy