Production RAG — Caching, Cost, and Scale
Caching embeddings and retrievals
Cache the embedding for any document chunk permanently (it never changes unless the source text changes), and cache retrieval results for identical or near-identical queries to avoid redundant vector database calls.
Managing cost at scale
Track embedding cost (one-time per chunk), retrieval cost (vector DB queries), and generation cost (LLM tokens per answer) separately — generation is usually the largest cost, so optimizing context size (top-K, chunk size) matters most.
Keeping the knowledge base fresh
Build an automated re-indexing pipeline that detects source document changes and re-embeds only the affected chunks — re-processing your entire knowledge base on every small update wastes cost and time.
Key Takeaways
- Cache embeddings permanently and retrievals for repeated queries.
- Generation is usually the largest cost — optimize context size to control it.
- Build automated re-indexing that only updates changed document chunks.
- Production RAG requires ongoing cost and freshness management, not a one-time build.
Design a re-indexing strategy
Write out a plan for how your RAG system would detect a changed source document and re-embed only the affected chunks, without reprocessing everything.