Lesson 811 lessons

Evaluating RAG System Quality

Two things to measure separately

Retrieval quality (did the system find the right chunks?) and generation quality (did the model produce a correct, well-grounded answer from those chunks?) are separate failure points — measure and debug them independently.

Building a small evaluation set

Create 15-20 question/expected-answer pairs covering your knowledge base's key topics. Run your RAG pipeline against these regularly to catch regressions when you change chunking, embeddings, or prompts.

Checking for hallucination

Specifically test cases where the answer isn't in your knowledge base at all — a well-built RAG system should say "I don't have information on this" rather than confidently making something up.

Key Takeaways

Measure retrieval quality and generation quality as separate failure points.
Build a 15-20 case evaluation set covering key knowledge base topics.
Re-run the eval set whenever you change chunking, embeddings, or prompts.
Explicitly test that the system admits ignorance rather than hallucinating.

Build a mini RAG evaluation set

Write 8 question/expected-answer pairs including 2 questions with no answer in your knowledge base, and run them through your pipeline.

Take Lesson Exam