Document Processing and Chunking Strategies
Why chunking matters
You can't embed an entire 50-page document as one vector — it would be too generic to match specific questions well. Documents must be split into smaller chunks (paragraphs or sections) before embedding.
Fixed-size vs semantic chunking
Fixed-size chunking (e.g., every 500 tokens) is simple but can split a sentence or idea mid-thought. Semantic chunking splits along natural boundaries (paragraphs, sections, headers) preserving meaning — usually the better default.
Chunk overlap for context continuity
Add a small overlap (e.g., 50-100 tokens) between consecutive chunks so an idea spanning a chunk boundary isn't lost entirely in either chunk — this small addition often meaningfully improves retrieval quality.
Key Takeaways
- Documents must be split into chunks before embedding — one giant vector is too generic.
- Semantic chunking along natural boundaries usually beats fixed-size splitting.
- Add chunk overlap to preserve ideas that span chunk boundaries.
- Chunking strategy directly impacts retrieval quality.
Chunk a document two ways
Take a multi-paragraph document and split it once with fixed-size chunking and once along paragraph boundaries. Compare which preserves meaning better.