Skip to content

Million+ LOC Strategies

Navigating million-line codebases feels like exploring a vast city without a map. Every change ripples through countless dependencies, and understanding the full impact requires superhuman memory. This guide shows how AI coding assistants transform this complexity into manageable workflows.

When your codebase crosses the million-line threshold, traditional development approaches break down. You’re dealing with:

  • Cognitive Overload: No single developer can hold the entire system architecture in their head
  • Hidden Coupling: Dependencies buried deep across module boundaries
  • Legacy Archaeology: Code written by developers who left years ago
  • Performance Bottlenecks: IDEs and tools that choke on the sheer volume
  • Context Fragmentation: Different teams with different conventions and patterns

The solution isn’t working harder—it’s leveraging AI assistants that can process and understand code at machine scale while you focus on architectural decisions.

Unlimited Working Memory

While humans struggle with 7±2 items in working memory, AI models can analyze hundreds of files simultaneously, tracking dependencies you’d never spot manually.

Semantic Understanding

AI doesn’t just grep for strings—it understands code intent, finding conceptually related functions across different naming conventions and implementations.

Pattern Detection at Scale

Identifies anti-patterns, duplicated logic, and optimization opportunities that would take months of manual code review to discover.

Fearless Refactoring

Make sweeping changes across thousands of files with confidence, as AI tracks all impacts and suggests necessary adjustments.

Before diving into strategies, you need the right tools. These MCP servers transform how AI assistants understand and navigate massive codebases:

When dealing with millions of lines, traditional text search fails. You need semantic understanding.

Installation for Claude Code:

Terminal window
claude mcp add code-context -e OPENAI_API_KEY=your-api-key -e MILVUS_TOKEN=your-zilliz-key -- npx @zilliz/code-context-mcp@latest

Installation for Cursor: Add to ~/.cursor/mcp.json:

{
"mcpServers": {
"code-context": {
"command": "npx",
"args": ["-y", "@zilliz/code-context-mcp@latest"],
"env": {
"EMBEDDING_PROVIDER": "OpenAI",
"OPENAI_API_KEY": "your-api-key",
"MILVUS_TOKEN": "your-zilliz-key"
}
}
}
}

This server uses vector embeddings to understand code semantically. Ask “find all authentication flows” and it understands the concept across different implementations—whether it’s OAuth, JWT, or session-based.

High-Performance Pattern Search: Ripgrep MCP

Section titled “High-Performance Pattern Search: Ripgrep MCP”

For blazing-fast regex and pattern matching across massive codebases:

Installation for Claude Code:

Terminal window
claude mcp add-json "ripgrep" '{"command":"npx","args":["-y","mcp-ripgrep@latest"]}'

Sample Usage:

"Use ripgrep to find all TODO comments with high priority across the codebase"
"Search for all SQL queries that might be vulnerable to injection"
"Find all API endpoints that don't have rate limiting"

For sensitive enterprise codebases that can’t use cloud services:

Luoto Local Code Search uses ChromaDB for on-premise vector search:

Terminal window
# Configure environment
PROJECTS_ROOT=~/enterprise/code
FOLDERS_TO_INDEX=core-services,payment-engine,user-platform
# Add to Cursor
claude mcp add-json "workspace-code-search" '{"url":"http://localhost:8978/sse"}'

This keeps your code local while providing semantic search capabilities—crucial for financial services, healthcare, or defense contractors.

You’ve inherited a 3-million line monolith. Where do you even start? Here’s a battle-tested approach:

Start with broad architectural understanding:

"Analyze this codebase and create a mental model of the system architecture.
Focus on:
1. Core business domains
2. Service boundaries
3. Data flow patterns
4. External dependencies
Present as a high-level overview suitable for a new senior engineer."

Then drill into specific areas:

"Using code-context search, find all payment processing flows.
I need to understand:
- Entry points for payment requests
- State management during processing
- Integration with external payment providers
- Error handling and retry logic"

Understanding how modules interconnect is crucial for safe refactoring:

"Create a dependency graph for the UserService module:
1. What services does it depend on?
2. What services depend on it?
3. Are there any circular dependencies?
4. Which dependencies look problematic or tightly coupled?"

Follow up with specific investigations:

"The UserService depends on 47 other services.
Help me identify which dependencies are truly necessary
vs. which could be refactored to use events or interfaces."

The biggest mistake developers make with large codebases? Trying to load everything at once. Your AI assistant doesn’t need to see all 3 million lines—it needs the right context at the right time.

Think of context like zooming on a map. Start with continent view, then country, then city, then street:

  1. Domain Level (10,000 ft view)

    "What are the main bounded contexts in this system?"
    "How do the payment, user, and inventory domains interact?"
  2. Service Level (1,000 ft view)

    "Within the payment domain, explain the service architecture"
    "What are the main APIs exposed by payment services?"
  3. Component Level (100 ft view)

    "Show me how PaymentProcessor handles credit card transactions"
    "What's the retry strategy for failed payments?"
  4. Implementation Level (ground level)

    "In PaymentProcessor.processCard(), why is there a 30-second timeout?"
    "Should we refactor this synchronized block?"

Pattern 1: Hierarchical CLAUDE.md Files

/CLAUDE.md # System-wide conventions
/services/CLAUDE.md # Service layer patterns
/services/payment/CLAUDE.md # Payment-specific rules
/services/payment/core/CLAUDE.md # Core payment logic rules

Each level inherits from its parent, creating focused context:

# In /services/payment/CLAUDE.md
This service handles all payment processing.
Key principles:
- All amounts in cents to avoid floating point
- Idempotency keys required for all transactions
- PCI compliance: never log full card numbers
Common patterns in this service:
- Repository pattern for data access
- Command pattern for payment operations
- Event sourcing for transaction history

Pattern 2: Context Switching Commands

# Clear context between unrelated tasks
/clear
# Work on payment system
/add services/payment
"Analyze the payment processing flow"
# Switch to user system
/clear
/add services/users
"Review the authentication implementation"

The 80/20 Rule for Context
80% of your questions need only 20% of the codebase. Don’t pollute context with rarely-used utilities:

# Bad: Loading everything
"Load all code and find performance issues"
# Good: Targeted loading
"In the order processing pipeline (services/orders/pipeline/),
identify bottlenecks in Order.process() through Order.ship()"

Progressive Context Expansion
Start narrow and expand only when needed:

# Step 1: Understand the problem
@OrderService.java "Why are orders taking 5+ seconds to process?"
# Step 2: Expand to related services
@orders/ @inventory/ "Is the delay from inventory checks?"
# Step 3: Include infrastructure
@orders/ @database/migrations/ @config/
"Could this be a database indexing issue?"

Refactoring a million-line codebase is like renovating a hospital while surgery is ongoing. You can’t shut everything down for a rewrite. Here’s how to make systematic improvements safely:

Case Study: Migrating from Callbacks to Async/Await

Section titled “Case Study: Migrating from Callbacks to Async/Await”

Your Node.js codebase has 50,000 callback functions. Manual migration would take years. Here’s the AI-assisted approach:

Phase 1: Discovery and Planning

"Using ripgrep, find all callback patterns in the codebase.
Categorize them by:
1. Simple callbacks (single async operation)
2. Callback chains (multiple sequential operations)
3. Parallel callbacks (multiple concurrent operations)
4. Error-first callbacks with complex error handling
5. Callbacks with shared state or closures
Generate a migration complexity report."

Sample response:

Found 47,832 callback patterns:
- Simple callbacks: 31,245 (65%) - Low risk
- Callback chains: 9,823 (21%) - Medium risk
- Parallel callbacks: 4,234 (9%) - High risk
- Complex error handling: 1,830 (4%) - High risk
- Shared state callbacks: 700 (1%) - Very high risk
Recommended migration order: Start with simple callbacks
in utility functions, then move to service layer...

Phase 2: Creating Migration Patterns

"For each callback pattern type, create a safe migration template.
Include:
1. The transformation pattern
2. Edge cases to watch for
3. Testing strategy
4. Rollback plan"

The AI generates reusable patterns:

// Before
function loadUser(id, callback) {
db.query('SELECT * FROM users WHERE id = ?', [id], (err, result) => {
if (err) return callback(err);
callback(null, result[0]);
});
}
// After (with backward compatibility)
async function loadUser(id, callback) {
// Support both callback and promise style
if (callback) {
try {
const result = await db.query('SELECT * FROM users WHERE id = ?', [id]);
callback(null, result[0]);
} catch (err) {
callback(err);
}
} else {
const result = await db.query('SELECT * FROM users WHERE id = ?', [id]);
return result[0];
}
}

Phase 3: Automated Migration

"Using the migration patterns, transform all simple callbacks
in the utils/ directory. For each file:
1. Apply the transformation
2. Preserve backward compatibility
3. Add deprecation comments
4. Update or create tests
5. Track migration status"

For massive refactoring efforts, coordinate multiple AI instances:

  1. Partition the Codebase

    "Analyze module dependencies and suggest how to partition
    the codebase for parallel refactoring by 4 developers.
    Minimize inter-team conflicts."
  2. Create Feature Branches

    Terminal window
    git checkout -b refactor/team1-user-services
    git checkout -b refactor/team2-payment-services
    git checkout -b refactor/team3-inventory-services
    git checkout -b refactor/team4-shared-utils
  3. Synchronize Progress

    "Review the changes in all refactor/* branches.
    Identify potential conflicts or breaking changes
    between teams' work."
  4. Integration Testing

    "Generate integration tests that verify the refactored
    modules work correctly together. Focus on boundary
    interactions between team territories."

In million-line codebases, performance problems hide in unexpected places. A innocent-looking function called in a tight loop can bring down your entire system. Here’s how to systematically hunt and fix performance issues:

Step 1: Algorithmic Complexity Analysis

"Using code analysis, find all potential O(n²) or worse algorithms.
For each one, determine:
1. How often it's called
2. Typical input size
3. Whether it's on a critical path
4. Suggested optimization approach"

Sample findings:

Found 47 potential quadratic algorithms:
CRITICAL - UserMatcher.findDuplicates()
- Called on every user registration
- Processes 2.3M users
- Current time: ~18 seconds
- Fix: Use hash-based approach, reducing to O(n)
HIGH - ReportGenerator.crossTabulate()
- Called in nightly batch jobs
- Processes 100K x 100K matrix
- Current time: ~4 hours
- Fix: Use sparse matrix representation

The #1 performance killer in large applications? Database queries. Here’s a systematic approach:

"Analyze the codebase for database performance anti-patterns:
1. N+1 queries in ORM usage
2. Missing indexes on foreign keys
3. Queries without pagination
4. Unnecessary eager loading
5. Queries inside loops
For each issue found, provide:
- Location in code
- Estimated performance impact
- Specific fix with code example"

The AI identifies:

// Problem: N+1 queries in OrderService
async function getOrdersWithItems(userId) {
const orders = await Order.findAll({ where: { userId } });
// This creates N+1 queries!
for (const order of orders) {
order.items = await OrderItem.findAll({
where: { orderId: order.id }
});
}
return orders;
}

Suggested fix:

// Solution: Use eager loading
async function getOrdersWithItems(userId) {
const orders = await Order.findAll({
where: { userId },
include: [{
model: OrderItem,
as: 'items'
}]
});
return orders;
}
// Or use raw SQL for complex cases
async function getOrdersWithItemsOptimized(userId) {
const query = `
SELECT o.*,
JSON_AGG(oi.*) as items
FROM orders o
LEFT JOIN order_items oi ON oi.order_id = o.id
WHERE o.user_id = $1
GROUP BY o.id
`;
return await db.query(query, [userId]);
}

Memory leaks in large applications are insidious. They build up slowly until your servers start crashing:

"Search for common memory leak patterns:
1. Event listeners without cleanup
2. Closures holding large objects
3. Circular references
4. Growing caches without limits
5. Timers that never clear
Focus on long-running services and background workers."

The AI finds issues like:

// Memory leak in WebSocketManager
class WebSocketManager {
constructor() {
this.connections = new Map();
}
addConnection(userId, socket) {
// LEAK: Old connections never removed!
this.connections.set(userId, socket);
socket.on('message', (data) => {
// LEAK: Closure holds reference to entire manager
this.handleMessage(userId, data);
});
}
}
// Fix with proper cleanup
class WebSocketManager {
constructor() {
this.connections = new Map();
}
addConnection(userId, socket) {
// Remove any existing connection
this.removeConnection(userId);
const messageHandler = (data) => {
this.handleMessage(userId, data);
};
socket.on('message', messageHandler);
socket.on('close', () => this.removeConnection(userId));
this.connections.set(userId, {
socket,
handlers: { message: messageHandler }
});
}
removeConnection(userId) {
const conn = this.connections.get(userId);
if (conn) {
conn.socket.removeAllListeners();
this.connections.delete(userId);
}
}
}

Every large codebase has its archaeological layers—code written in different eras, by different teams, with different philosophies. Some of it predates version control. Here’s how to excavate and modernize safely:

You’ve discovered a 15,000-line stored procedure that runs the entire billing system. No one understands it, but it processes $10M daily. Here’s the approach:

The Archaeological Method

"Analyze this stored procedure (billing_mega_proc.sql) and:
1. Create a high-level flow diagram
2. Extract the core business rules
3. Identify the main data transformations
4. Find any hidden side effects
5. Suggest how to decompose it into services
Treat this like reverse-engineering an alien artifact.
Be extremely careful about subtle business logic."

The AI responds with structured analysis:

Analysis of billing_mega_proc.sql:
CORE BUSINESS RULES DISCOVERED:
1. Tiered pricing with 47 special cases
2. Legacy customer discounts (some hardcoded by ID!)
3. Complex tax calculation with regional variations
4. Audit trail generation (writes to 6 different tables)
HIDDEN BEHAVIORS:
- Sends emails directly from the database (!)
- Updates inventory as side effect
- Creates accounting entries
- Has undocumented "debug mode" when user_id = 1
DECOMPOSITION STRATEGY:
1. Extract pricing engine (pure functions)
2. Separate tax calculator service
3. Move email sending to message queue
4. Create proper audit service

Rather than rewriting everything at once, gradually strangle the legacy system:

"Create a modern API gateway that wraps the legacy billing system.
Requirements:
- RESTful endpoints for all billing operations
- Keep calling the legacy stored procedure internally
- Add proper error handling and logging
- Return consistent JSON responses
- Add OpenAPI documentation"

This gives you:

// Modern API wrapping legacy system
@Post('/billing/calculate')
async calculateBilling(@Body() request: BillingRequest) {
try {
// Still calls the monster stored proc
const result = await this.db.callProcedure(
'billing_mega_proc',
request.customerId,
request.items,
request.date
);
// Transform cryptic response to modern format
return {
success: true,
invoice: this.transformLegacyInvoice(result),
timestamp: new Date().toISOString()
};
} catch (error) {
// Legacy proc returns errors as magic numbers
const errorMessage = this.decodeLegacyError(error);
throw new HttpException(errorMessage, 400);
}
}

When documentation doesn’t exist, tests become your documentation:

"This legacy module has no tests or documentation.
Create a comprehensive test suite that:
1. Documents current behavior through test names
2. Captures all edge cases you can find
3. Tests error conditions and their magic numbers
4. Verifies side effects
The tests should serve as living documentation."

The AI generates characterization tests:

describe('Legacy OrderProcessor - Current Behavior', () => {
describe('Happy Path', () => {
it('processes standard order with single item', async () => {
// Documents current behavior
const result = await processOrder({
customerId: 123,
items: [{ sku: 'WIDGET-1', quantity: 1 }]
});
expect(result.status).toBe(1); // Magic number: 1 = success
expect(result.orderId).toMatch(/^ORD-\d{8}$/); // Format: ORD-YYYYMMDD
});
});
describe('Edge Cases - Discovered Through Testing', () => {
it('applies secret 50% discount for customer ID 777', async () => {
// Undocumented behavior found through exploration!
const result = await processOrder({
customerId: 777, // Magic customer
items: [{ sku: 'ANY-ITEM', quantity: 1, price: 100 }]
});
expect(result.totalPrice).toBe(50);
});
it('fails with error code -99 when inventory is negative', async () => {
// More magic numbers documented
const result = await processOrder({
customerId: 123,
items: [{ sku: 'OUT-OF-STOCK', quantity: 1 }]
});
expect(result.status).toBe(-99); // -99 = inventory error
});
});
});

In million-line codebases, different teams own different territories. The challenge? Making changes that span multiple domains without stepping on each other’s toes.

When Team A needs to understand Team B’s code:

The API Contract Generator

"Analyze the payment service (owned by Team FinTech) and generate:
1. OpenAPI specification for all endpoints
2. Event schemas for all published events
3. Database schemas for shared tables
4. Example requests/responses
5. Common error scenarios and handling
Format this for the mobile team who need to integrate."

This creates clear boundaries:

# Generated payment-service-api.yaml
openapi: 3.0.0
info:
title: Payment Service API
version: 2.3.0
contact:
team: FinTech
slack: #team-fintech
paths:
/payments/process:
post:
summary: Process a payment
description: |
Handles credit card, PayPal, and ACH payments.
Idempotent using X-Idempotency-Key header.
x-rate-limit: 100 requests per minute
x-sla: 99.9% uptime, <500ms p99 latency

Before making breaking changes:

"I need to refactor the UserService.authenticate() method.
Analyze:
1. All services that call this method
2. The parameters they pass
3. How they handle the response
4. What errors they expect
5. Any indirect dependencies
Create a migration plan that won't break other teams."

The AI provides a detailed impact report:

Direct consumers of UserService.authenticate():
1. APIGateway - 1,247 calls
- Uses: username/password auth
- Expects: {userId, token, expiresAt}
- Error handling: Catches AuthError only
2. MobileBackend - 892 calls
- Uses: biometric token auth
- Expects: {userId, token, refreshToken}
- Error handling: Has retry logic
3. AdminPortal - 234 calls
- Uses: SSO token auth
- Expects: {userId, permissions[], token}
- Error handling: Falls back to legacy auth
MIGRATION PLAN:
1. Add new method authenticateV2() with new signature
2. Update authenticate() to call V2 internally
3. Notify teams with 2-week migration window
4. Monitor usage, help teams migrate
5. Deprecate old method after 30 days

Keep documentation in sync with code:

"Generate a C4 model diagram showing:
- Context: How payment system fits in the ecosystem
- Container: Services within payment domain
- Component: Internal structure of payment-processor
- Code: Key classes and their relationships
Use PlantUML syntax so it's version controlled."

The AI generates:

@startuml Payment System - Container Diagram
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Container.puml
Person(customer, "Customer", "Makes purchases")
System_Ext(bank, "Banking System", "Processes transactions")
System_Boundary(payment, "Payment Domain") {
Container(api, "Payment API", "Node.js", "REST API for payment ops")
Container(processor, "Payment Processor", "Java", "Core payment logic")
Container(fraud, "Fraud Detection", "Python", "ML-based fraud checks")
ContainerDb(db, "Payment DB", "PostgreSQL", "Transaction records")
Container(events, "Event Bus", "Kafka", "Payment events")
}
Rel(customer, api, "Makes payment", "HTTPS")
Rel(api, processor, "Process payment", "gRPC")
Rel(processor, fraud, "Check fraud", "HTTP")
Rel(processor, bank, "Charge card", "ISO 8583")
Rel(processor, db, "Store transaction", "SQL")
Rel(processor, events, "Publish events", "Kafka")
@enduml

Let’s tackle some actual challenges you’ll face in million-line codebases:

Your notification logic is scattered across 47 different services. Time to extract it:

The Extraction Process

"Phase 1 - Discovery:
Using semantic search, find ALL notification-related code:
- Email sending
- SMS dispatch
- Push notifications
- In-app alerts
- Webhook deliveries
Include: scheduling, templating, retry logic, and preferences."

Followed by:

"Phase 2 - Untangling:
For each notification code fragment found:
1. What data does it need?
2. What services does it call?
3. What triggers it?
4. How does it handle failures?
5. What would break if we moved it?
Create a dependency graph."

Then systematically extract:

"Phase 3 - Service Design:
Design a notification service that:
1. Handles all current notification types
2. Provides a unified API
3. Maintains backward compatibility
4. Scales independently
5. Includes migration strategy
Generate the API specification and migration plan."

Scenario: Zero-Downtime Database Migration

Section titled “Scenario: Zero-Downtime Database Migration”

Your user table has 500 million rows and needs sharding:

"Analyze our user table access patterns:
1. What queries hit this table?
2. What's the read/write ratio?
3. How are JOINs performed?
4. What's the distribution of queries by user_id?
5. Are there any global queries?
Recommend sharding strategy based on actual usage."

AI discovers:

Analysis Results:
- 89% of queries filter by user_id
- 7% are analytics queries (full table scans)
- 4% are admin queries (by email, phone)
Recommended Strategy:
- Shard by user_id (hash-based, 64 shards)
- Replicate lookup tables for email/phone
- Move analytics to read replica
- Use routing service for shard location

Production is slow. You have 2 hours to find and fix the issue:

"URGENT: Production performance degraded 70% in last hour.
Systematically investigate:
1. Recent deployments
2. Database slow queries
3. External service latencies
4. Resource exhaustion
5. Possible memory leaks
Focus on changes in the last 24 hours."

The AI provides a structured investigation:

Investigation Results:
1. RECENT DEPLOYMENTS (last 24h):
- 14:30: Deployed order-service v2.3.1
- 14:35: Performance degradation started
- Correlation: HIGH
2. CODE CHANGES in v2.3.1:
Found in OrderProcessor.java:
- Added new fraud check calling external API
- No timeout configured!
- External API responding in 30-60 seconds
3. IMMEDIATE FIX:
Add timeout to fraud check:
```java
// In OrderProcessor.java line 234
FraudCheckResult result = fraudClient
.checkFraud(order)
.timeout(Duration.ofMillis(500)) // Add this
.onErrorReturn(FraudCheckResult.ALLOW); // Fail open
  1. LONG-TERM FIX:
    • Make fraud check async
    • Add circuit breaker
    • Cache fraud check results
## Continuous Improvement Workflows
### Weekly Codebase Health Checks

“Run weekly codebase health analysis:

  1. New code smells introduced
  2. Test coverage changes
  3. Performance regression risks
  4. Security vulnerabilities
  5. Technical debt accumulation

Compare with last week and highlight trends.”

### Proactive Dependency Management

“Analyze all dependencies for:

  1. Security vulnerabilities (CVEs)
  2. Deprecated versions
  3. Breaking changes in new versions
  4. License compliance issues
  5. Unmaintained packages

Create prioritized update plan with risk assessment.”

## Key Takeaways for Million-Line Success
<CardGrid>
<Card title="Right Tool, Right Job" icon="wrench">
Use semantic search (Zilliz) for understanding, ripgrep for patterns, local indexing for sensitive code. Don't try to load millions of lines into context.
</Card>
<Card title="Incremental Everything" icon="rocket">
Never attempt big-bang refactoring. Use feature flags, dual writes, and gradual rollouts. Let AI help you plan safe, incremental changes.
</Card>
<Card title="Test as Documentation" icon="document">
In legacy systems, comprehensive tests become your documentation. Use AI to generate characterization tests that capture current behavior.
</Card>
<Card title="Human + AI Partnership" icon="users">
AI handles the mechanical work—finding patterns, generating boilerplate, tracking dependencies. Humans provide domain knowledge and architectural vision.
</Card>
</CardGrid>
Working with million-line codebases doesn't have to be overwhelming. With the right AI tools and workflows, you can navigate, understand, and safely modify even the most complex systems. The key is thinking systematically and letting AI handle the scale while you focus on the strategy.