Evaluating RAG System Quality
Two things to measure separately
Retrieval quality (did the system find the right chunks?) and generation quality (did the model produce a correct, well-grounded answer from those chunks?) are separate failure points — measure and debug them independently.
Building a small evaluation set
Create 15-20 question/expected-answer pairs covering your knowledge base's key topics. Run your RAG pipeline against these regularly to catch regressions when you change chunking, embeddings, or prompts.
Checking for hallucination
Specifically test cases where the answer isn't in your knowledge base at all — a well-built RAG system should say "I don't have information on this" rather than confidently making something up.
Key Takeaways
- Measure retrieval quality and generation quality as separate failure points.
- Build a 15-20 case evaluation set covering key knowledge base topics.
- Re-run the eval set whenever you change chunking, embeddings, or prompts.
- Explicitly test that the system admits ignorance rather than hallucinating.
Build a mini RAG evaluation set
Write 8 question/expected-answer pairs including 2 questions with no answer in your knowledge base, and run them through your pipeline.