NewFresh guides on DevOps, AI, cloud and security — read the latest
AI & ML
AI & ML

Evaluating RAG: the metrics that actually matter

Retrieval quality, faithfulness, and answer relevance — what to measure when 'it feels better' isn't good enough for production.

A production AI platform with document retrieval streams and evaluation dashboards.

Most RAG systems ship on vibes. Someone asks three questions, the answers look good, and it goes to prod. Then a customer asks the fourth question and the wheels come off. Here's how to measure what's actually happening.

Separate retrieval from generation

A RAG failure is either a retrieval problem (you fetched the wrong context) or a generation problem (you had the right context and the model still got it wrong). Measure them separately or you'll fix the wrong half.

Retrieval metrics

  • Context recall — did we retrieve the chunks that contain the answer?
  • Context precision — how much of what we retrieved was actually relevant?

Generation metrics

  • Faithfulness — is the answer grounded in the retrieved context, or hallucinated?
  • Answer relevance — does it actually address the question asked?

Make it runnable

Build a golden set of question/answer/context triples and score every change against it:

from evals import score

results = score(
    dataset="support-golden-v3",
    metrics=["context_recall", "faithfulness", "answer_relevance"],
    pipeline=my_rag_pipeline,
)
assert results["faithfulness"] > 0.9   # fail the build if it regresses

Faithfulness below your threshold should break CI the same way a failing unit test does. Hallucination is a regression, not a vibe.

Watch it in production, too

Offline evals catch regressions before release; production sampling catches the long tail. Log a random slice of live traffic, score it nightly, and alert when faithfulness drifts. The questions your users actually ask are never the ones in your golden set — that's exactly why you sample.

Share
All articles