RAG Evaluation That Actually Correlates with Users

Ravinder·January 13, 2025·8 min read

AIRAGEvaluationLLM

RAG Evaluation That Actually Correlates with Users

A team ships a RAG system. RAGAS faithfulness: 0.87. Context precision: 0.81. Answer relevancy: 0.89. The dashboard is green. Three weeks later, users are filing bug reports because the system confidently answers from the wrong document. The metrics never caught it.

This is the metric ceiling problem. Automated RAG metrics measure what they can measure. They don't measure what users actually care about. The gap between those two things destroys trust in production systems.

Here's how to build evaluation that doesn't lie.

What RAGAS Actually Measures (And What It Doesn't)

RAGAS gives you four core metrics. Each has a precise definition that diverges from user intent in specific ways.

Metric	What it measures	What it misses
Faithfulness	Are claims in the answer supported by context?	Whether the context itself is correct
Context Precision	Are top-ranked chunks relevant?	Whether the right chunk was retrieved at all
Context Recall	Did retrieved context cover the ground truth?	Requires a reference answer to compare against
Answer Relevancy	Does the answer address the question?	Factual accuracy — a relevant wrong answer scores high

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
 
data = {
    "question": ["What is the refund window?"],
    "answer": ["You have 30 days to request a refund."],
    "contexts": [["Customers may request refunds within 30 days of purchase."]],
    "ground_truth": ["The refund window is 30 days."],
}
 
dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)

The critical blind spot: RAGAS faithfulness checks if the answer is grounded in the retrieved context — not whether the context is from the right document, not whether the context is accurate, not whether the answer is safe to act on.

A system that retrieves the wrong policy document and answers faithfully from it will score 1.0 on faithfulness.

The Metric Ceiling Problem

RAGAS scores plateau. Once you get faithfulness above ~0.85, further improvements don't correlate with user satisfaction. This is because:

LLMs grade leniently. RAGAS uses an LLM to judge faithfulness. The judge model tends to give credit for paraphrasing even when meaning shifts.
Ground truth drift. Your reference answers go stale as your knowledge base updates.
Distribution shift. Your eval set was built from common queries. Edge cases — where real failures happen — aren't represented.

flowchart LR subgraph Eval["Evaluation Stack"] direction TB A1[Automatic Metrics\nRAGAS / BLEU / ROUGE] --> A2[Covers: speed, scale] B1[Human-in-Loop\nSpot checks, adversarial] --> B2[Covers: edge cases, intent] C1[Regression Tests\nGolden set comparisons] --> C2[Covers: regressions on known failures] D1[Production Monitoring\nThumbsdown, escalations] --> D2[Covers: real user failures] end

You need all four layers. Automatic metrics alone will plateau and stop telling you anything useful.

Building a Real Eval Set

Most teams build eval sets the wrong way: they generate questions from their documents using an LLM and use those as ground truth. The problem is that LLM-generated questions are easy. They test the average case. They don't test the cases where the system fails users.

A real eval set has three populations:

Canonical queries (40%): The common, well-formed questions your system handles well. These catch regressions.

Adversarial queries (30%): Questions designed to expose failure modes.

Queries with correct keywords but wrong intent
Queries whose answer spans multiple chunks
Queries that have no good answer in the corpus
Ambiguous queries that require clarification

Real failure queries (30%): Queries taken from actual user sessions where the system gave a wrong or unhelpful answer. These are the most valuable because they represent confirmed failures.

# Structured eval record
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class EvalRecord:
    query: str
    expected_answer: str
    relevant_doc_ids: list[str]       # for retrieval recall
    query_type: str                   # "canonical" | "adversarial" | "failure"
    failure_mode: Optional[str]       # "no_answer" | "multi_chunk" | "ambiguous"
    created_at: str
    annotator: str
 
# Save as JSONL for versioning
import json
 
def save_eval_set(records: list[EvalRecord], path: str):
    with open(path, "w") as f:
        for r in records:
            f.write(json.dumps(r.__dict__) + "\n")

Version your eval set. Treat changes to it as you would changes to production code — review them, don't just overwrite.

Automatic Metrics That Are Actually Useful

Beyond RAGAS, these metrics have better signal:

G-Eval with explicit criteria: Instead of asking the LLM "is this faithful?", give it a rubric.

from openai import OpenAI
 
client = OpenAI()
 
G_EVAL_PROMPT = """
You are evaluating a RAG system's answer. Score the answer on this criterion:
 
CRITERION: Factual Grounding
Score 1–5 where:
5 = Every factual claim in the answer is directly supported by a verbatim or close paraphrase of the context.
4 = All key claims supported; minor stylistic paraphrasing.
3 = Most claims supported; one claim is implied but not explicit.
2 = Some claims unsupported or require inference beyond the context.
1 = Answer contains claims that contradict or are absent from the context.
 
Context: {context}
Answer: {answer}
 
Output: {{"score": <1-5>, "reasoning": "<one sentence>"}}
"""
 
def g_eval_grounding(context: str, answer: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": G_EVAL_PROMPT.format(
                context=context, answer=answer
            )}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

Citation accuracy: If your system returns source citations, verify they're correct.

def check_citation_accuracy(answer: str, cited_chunks: list[str]) -> float:
    """Fraction of cited chunks that actually support the answer."""
    supported = 0
    for chunk in cited_chunks:
        result = g_eval_grounding(context=chunk, answer=answer)
        if result["score"] >= 4:
            supported += 1
    return supported / len(cited_chunks) if cited_chunks else 0.0

No-answer detection: Your system should refuse to answer when there's no relevant context. Measure refusal rate on unanswerable queries.

UNANSWERABLE_QUERIES = [
    "What is the CEO's home address?",
    "What will the stock price be next quarter?",
    # ... domain-specific questions outside your corpus
]
 
def measure_refusal_rate(rag_pipeline, unanswerable: list[str]) -> float:
    refused = 0
    for query in unanswerable:
        answer = rag_pipeline.answer(query)
        # Check for refusal signals
        refusal_phrases = ["I don't have information", "not in my knowledge base", "I cannot find"]
        if any(p.lower() in answer.lower() for p in refusal_phrases):
            refused += 1
    return refused / len(unanswerable)

Human-in-the-Loop: Where to Actually Spend Human Time

Human eval is expensive. Spend it on the cases automated metrics can't handle.

Priority 1 — Adversarial failures: Run your adversarial eval set weekly. Have a human judge every case where the automated score is below threshold AND the user-facing behavior looks wrong.

Priority 2 — Automated disagreements: When faithfulness is high but answer relevancy is low (or vice versa), something is structurally wrong. Human review finds it.

Priority 3 — New query clusters: As usage grows, cluster new queries by embedding. Any cluster that's large but not in your eval set is a gap. Sample from it for human annotation.

from sklearn.cluster import KMeans
import numpy as np
 
def find_uncovered_query_clusters(
    production_queries: list[str],
    eval_queries: list[str],
    embedder,
    n_clusters: int = 20,
) -> list[int]:
    """Return cluster IDs that have production queries but no eval coverage."""
    all_embeddings = embedder.encode(production_queries + eval_queries)
    prod_embs = all_embeddings[:len(production_queries)]
    eval_embs = all_embeddings[len(production_queries):]
 
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(all_embeddings)
 
    prod_clusters = set(kmeans.predict(prod_embs))
    eval_clusters = set(kmeans.predict(eval_embs))
 
    uncovered = prod_clusters - eval_clusters
    return list(uncovered)

Regression Testing Without the Pain

Regression testing for RAG is hard because LLM outputs are non-deterministic. You can't do string equality checks.

Two approaches that work:

Score thresholds: Run every release against your golden set. If any metric drops more than a defined threshold, block the release.

REGRESSION_THRESHOLDS = {
    "faithfulness": 0.05,    # max allowed drop
    "recall@10": 0.03,
    "citation_accuracy": 0.04,
    "refusal_rate_unanswerable": 0.10,
}
 
def regression_check(baseline_scores: dict, new_scores: dict) -> list[str]:
    failures = []
    for metric, max_drop in REGRESSION_THRESHOLDS.items():
        drop = baseline_scores[metric] - new_scores[metric]
        if drop > max_drop:
            failures.append(
                f"{metric}: dropped {drop:.3f} (threshold {max_drop})"
            )
    return failures

Semantic equivalence: For known queries, check if the new answer is semantically equivalent to the baseline.

def semantic_equivalence(answer_a: str, answer_b: str, embedder) -> float:
    emb_a = embedder.encode([answer_a])[0]
    emb_b = embedder.encode([answer_b])[0]
    return float(np.dot(emb_a, emb_b) / (np.linalg.norm(emb_a) * np.linalg.norm(emb_b)))
 
# Flag if new answer diverges too much from baseline
EQUIVALENCE_THRESHOLD = 0.85

Production Signal Is Your Best Eval

Every user thumbs-down, every support escalation, every "that's wrong" is a free labeled example. Capture them.

# Log every RAG response with context for post-hoc analysis
import uuid
from datetime import datetime
 
def log_rag_response(query, contexts, answer, user_id):
    return {
        "id": str(uuid.uuid4()),
        "timestamp": datetime.utcnow().isoformat(),
        "user_id": user_id,
        "query": query,
        "retrieved_chunk_ids": [c.id for c in contexts],
        "answer": answer,
        "feedback": None,  # filled when user signals good/bad
    }

A thumbs-down rate above 5% on any query cluster is a retrieval or generation problem waiting to be diagnosed. Below 2% is your bar for "good enough to ship."

Key Takeaways

RAGAS faithfulness being high doesn't mean your answers are accurate — it means they're grounded in whatever you retrieved.
Build eval sets with adversarial and real-failure queries, not just LLM-generated canonical ones.
Human eval time is best spent on cases where automated metrics disagree with each other.
Run retrieval eval (recall@K) separately from generation eval — they fail independently.
Semantic equivalence thresholds are more useful than string matching for regression testing non-deterministic outputs.
Production thumbs-down rate is the ground truth; everything else is a proxy.