Evaluating RAG: the metrics that actually matter
Retrieval quality, faithfulness, and answer relevance — what to measure when 'it feels better' isn't good enough for production.

Most RAG systems ship on vibes. Someone asks three questions, the answers look good, and it goes to prod. Then a customer asks the fourth question and the wheels come off. Here's how to measure what's actually happening.
Separate retrieval from generation
A RAG failure is either a retrieval problem (you fetched the wrong context) or a generation problem (you had the right context and the model still got it wrong). Measure them separately or you'll fix the wrong half.
Retrieval metrics
- Context recall — did we retrieve the chunks that contain the answer?
- Context precision — how much of what we retrieved was actually relevant?
Generation metrics
- Faithfulness — is the answer grounded in the retrieved context, or hallucinated?
- Answer relevance — does it actually address the question asked?
Make it runnable
Build a golden set of question/answer/context triples and score every change against it:
from evals import score
results = score(
dataset="support-golden-v3",
metrics=["context_recall", "faithfulness", "answer_relevance"],
pipeline=my_rag_pipeline,
)
assert results["faithfulness"] > 0.9 # fail the build if it regresses
Faithfulness below your threshold should break CI the same way a failing unit test does. Hallucination is a regression, not a vibe.
Watch it in production, too
Offline evals catch regressions before release; production sampling catches the long tail. Log a random slice of live traffic, score it nightly, and alert when faithfulness drifts. The questions your users actually ask are never the ones in your golden set — that's exactly why you sample.