Advanced Context Management for Code Generation
Richard Wang
August 22, 2025
Modern AI-powered development environments face a fundamental challenge: how do you provide an LLM with the right context from massive codebases without overwhelming the model or wasting bandwidth? At Clad Labs, we've implemented a sophisticated system that leverages Merkle trees, abstract syntax trees, and intelligent vector storage to solve this problem at scale.
The Context Window Problem
Large Language Models have made remarkable progress, with context windows ranging from 200K to 2M tokens. For small projects, you could theoretically dump the entire codebase into the prompt. Tools like Claude Code and Gemini CLI work reasonably well with this approach for modest-sized repositories.
However, this strategy falls apart quickly. Enterprise codebases can easily exceed tens of millions of lines of code. Even with a 2M token context window, you can't fit everything. More importantly, you shouldn't - most of the codebase is irrelevant to the current task. The challenge isn't just about fitting data into a context window; it's about intelligently selecting which pieces of code are actually relevant.
Efficient Change Detection with Merkle Trees
The first problem we need to solve is incremental synchronization. When you're actively developing, files change constantly. Naive approaches would re-index the entire codebase on every change, which is prohibitively expensive for both bandwidth and compute.
We use Merkle trees to solve this elegantly. A Merkle tree is a cryptographic data structure where every leaf node stores the hash of a data block, and every parent node stores the hash of its children. This creates a hierarchical fingerprint of your entire codebase.
Here's how it works in practice:
Tree Construction:
- Every file gets a cryptographic hash based on its contents (the leaf nodes)
- Every directory gets a hash based on the hashes of its children
- This continues recursively up to the root of your project
The Handshake Protocol: When the IDE starts, a handshake occurs between client and server:
- The client sends its root hash to the server
- If root hashes match: Nothing has changed. Zero data transferred. Done.
- If they don't match: The client and server perform tree traversals to locate differences
The beauty of this approach is its efficiency. Finding where the client and server disagree takes O(log n) bandwidth relative to the codebase size. You only need to compare hashes at each level of the tree until you find the divergent branches.
Implementation Details:
We implement Merkle trees at the client level using Rust hooks integrated with the TypeScript frontend. The system automatically obfuscates file names for privacy before constructing the tree. After initial indexing, a second Merkle tree is constructed server-side for comparison.
Only the files that have actually changed get sent to the server for re-chunking, re-embedding, and updating in our vector database. We also run an index sync cron job every 10 minutes to catch any changes that might have been missed.
Privacy and Security Considerations:
The system respects both .gitignore and .cursorignore files, ensuring that sensitive files are never indexed. Additionally, we have pattern matching to detect and ignore API keys, passwords, environment variables, and other sensitive data. Your secrets stay local.
Semantic Code Chunking with Abstract Syntax Trees
Once we've identified which files have changed, the next challenge is how to break them down into meaningful units. This is where naive approaches typically fail.
Why Traditional Chunking Fails for Code:
Text chunking strategies used for documents (fixed token length, paragraph-based, sentence boundaries) are catastrophically bad for source code. Code isn't just text - it has structure, syntax, and semantic meaning. Splitting a class definition in the middle, or separating a function from its documentation, destroys the very relationships that make code comprehensible.
Abstract Syntax Trees to the Rescue:
An Abstract Syntax Tree (AST) is a tree representation of the syntactic structure of source code. Each node in the tree represents a construct in the language - classes, methods, functions, loops, conditionals, etc.
We use Tree-sitter, a parser generator tool, to create language-aware ASTs. Tree-sitter provides several critical advantages:
- Language Agnostic: It supports dozens of programming languages with a consistent interface
- Incremental Parsing: It can efficiently update the AST when code changes, rather than reparsing everything
- Error Tolerant: It produces useful ASTs even from incomplete or syntactically invalid code (crucial during active development)
Semantic Chunking Strategy:
Using the AST, we chunk code along natural semantic boundaries:
- Complete function definitions (including docstrings and type annotations)
- Class definitions with their methods
- Module-level documentation
- Import statements (which provide crucial context about dependencies)
This approach ensures that each chunk is self-contained and semantically meaningful. When the LLM receives a chunk, it gets complete, contextual information rather than arbitrary fragments.
Vector Embeddings and Similarity Search
After semantic chunking, we need to convert code into a form that enables fast similarity search. This is where embeddings come in.
Creating Embeddings:
We generate embeddings using specialized models (likely OpenAI's text-embedding-3-large or custom fine-tuned models optimized for code). These embeddings are high-dimensional vectors (typically 1536 or 3072 dimensions) that capture semantic meaning.
The key insight is that semantically similar code will have embeddings that are close together in vector space. A function that processes user authentication will have an embedding near other authentication-related code, even if the exact tokens differ.
Query-Time Retrieval:
When you ask the AI a question or request code generation:
- Your query gets embedded using the same model
- We perform a similarity search in the vector database
- The most relevant code chunks are retrieved
- These chunks, along with your query, form the context for the LLM
Scale and Performance with Turbopuffer
Managing billions of vectors efficiently requires specialized infrastructure. We use Turbopuffer, a vector database designed specifically for object storage backends like S3.
Why Turbopuffer:
Traditional vector databases like those using HNSW (Hierarchical Navigable Small World) or DiskANN work well for in-memory workloads but have significant drawbacks for large-scale, persistent storage:
- High write amplification (updates require rewriting large portions of the index)
- Many round trips to storage for queries
- Expensive to store entirely in RAM
SPFresh: Centroid-Based ANN Index:
Turbopuffer uses SPFresh (presumably "Simple, Practical, Fresh"), a centroid-based approximate nearest neighbor (ANN) index. Here's how it works:
- Clustering: Vectors are grouped into clusters, each represented by a centroid
- Fast Centroid Search: A separate, fast index locates the nearest centroids to a query
- Candidate Retrieval: Only vectors in the nearest clusters are examined in detail
This architecture minimizes round trips to object storage - you can fetch all candidates in a cluster with a single request. Write amplification is also much lower since updates typically only affect a single cluster.
Hot/Cold Data Tiering:
Turbopuffer implements intelligent caching:
- Cold Storage: The full dataset lives on low-cost object storage (S3)
- Hot Cache: Actively used data is automatically promoted to NVMe SSDs and RAM
When you're actively working on a codebase, its index data moves to the hot tier automatically. This ensures sub-100ms query latencies for active projects while keeping costs minimal for idle codebases.
Scale Metrics:
The performance characteristics are impressive:
- 95% cost reduction compared to traditional vector databases
- Over 100 billion vectors stored
- Write peaks of 10GB/s during bulk ingestion
- Support for 10+ million isolated namespaces (one per user/project)
Putting It All Together
Here's the complete flow when you're coding:
1. Change Detection (Continuous):
- File system monitors detect changes
- Merkle tree nodes are updated for changed files
- Background sync identifies deltas every 10 minutes
2. Incremental Indexing:
- Only changed files are sent to the server
- AST parsing extracts semantic chunks
- Chunks are embedded using the embedding model
- Vectors are upserted into Turbopuffer
3. Query-Time Retrieval:
- Your prompt is embedded
- Similarity search finds relevant code chunks
- Additional context (imports, related classes) is added
- The combined context is sent to the LLM
4. Response Generation:
- The LLM generates code with full contextual awareness
- Results are streamed back to your editor
- The cycle continues as you make changes
Performance Implications
This architecture enables several critical capabilities:
Scalability: The system scales logarithmically rather than linearly. A 10x increase in codebase size requires only a small increase in sync time and storage costs.
Real-Time Responsiveness: Hot data caching means that code completions and queries feel instantaneous, even when working with multi-gigabyte codebases.
Privacy Preservation: Client-side filtering and Merkle tree obfuscation ensure that sensitive data never leaves your machine.
Cost Efficiency: By using object storage for cold data and only caching hot data, storage costs remain low even for millions of users.
Future Directions
This architecture opens up exciting possibilities:
Cross-Project Learning: Anonymized patterns from millions of codebases can improve embeddings and retrieval without compromising privacy.
Contextual Ranking: Beyond simple similarity, we can rank results based on recency, user behavior, and code execution patterns.
Hybrid Search: Combining vector similarity with traditional code search (regex, AST queries) can catch edge cases where embeddings alone miss important context.
The challenge of providing relevant context to LLMs is fundamental to building useful AI coding assistants. By combining Merkle trees for efficient synchronization, AST-based semantic chunking, specialized vector databases, and intelligent caching, we can deliver experiences that feel magical while remaining practical at scale.
The future of development tools isn't about bigger context windows - it's about smarter context selection. And that requires treating code not as text, but as the structured, semantic artifact it truly is.