Lesson 411 lessons

Document Processing and Chunking Strategies

Why chunking matters

You can't embed an entire 50-page document as one vector — it would be too generic to match specific questions well. Documents must be split into smaller chunks (paragraphs or sections) before embedding.

Fixed-size vs semantic chunking

Fixed-size chunking (e.g., every 500 tokens) is simple but can split a sentence or idea mid-thought. Semantic chunking splits along natural boundaries (paragraphs, sections, headers) preserving meaning — usually the better default.

Chunk overlap for context continuity

Add a small overlap (e.g., 50-100 tokens) between consecutive chunks so an idea spanning a chunk boundary isn't lost entirely in either chunk — this small addition often meaningfully improves retrieval quality.

Key Takeaways

Documents must be split into chunks before embedding — one giant vector is too generic.
Semantic chunking along natural boundaries usually beats fixed-size splitting.
Add chunk overlap to preserve ideas that span chunk boundaries.
Chunking strategy directly impacts retrieval quality.

Chunk a document two ways

Take a multi-paragraph document and split it once with fixed-size chunking and once along paragraph boundaries. Compare which preserves meaning better.

Take Lesson Exam