For the last few days I've been working on a project around replying to customer support tickets, and as part of it I was looking into RAG evaluations. This post is me writing down what I learned, mostly as a future reference for myself.
Imagine a human agent asks a customer support bot/copilot:
"What is our refund policy when a parcel is marked delivered but the customer says it never arrived?"
The copilot answers:
"Customers may return items within 30 days in original packaging."
That answer is grounded, it comes straight from the internal knowledge base, the wiki/Notion/Confluence pages we indexed. Every word traces back to a retrieved document. It's also accurate, fluent, and useless, because it answers a question nobody asked.
Now the other way, the copilot hallucinates an answer about refund gift cards. Also a bad answer, but a different kind of bad, because no document backs it.
The problem is that both answers score "bad", and a single "is the answer good" number can't tell you which kind of bad you're looking at, or what to fix.
A RAG pipeline has three components: the user query, the retrieved context, and the generated answer. Quality is a property of the relationships between the three, not of the answer alone.
The three scores
Context relevance
Context relevance judges the retrieved chunks against the query, before any generation happens. Practically answers if the chunks we pulled out of the knowledge base have anything to do with the question?
Take this question:
What do I do if a customer claims they didn't receive their order but the carrier marked it as completed?
A good system returns the disputed-delivery SOP and the carrier liability clause. A bad system returns the returns policy, a chunk about delivery delays, and the holiday shipping schedule, three chunks that are all about delivery and can't answer the question between them.
Suppose the knowledge base also holds a fraud rule, "two disputed deliveries in 90 days triggers manual review", and the retriever never surfaces it. Every chunk that did come back is relevant and the answer reads complete but a every crucial process was missed.
Answer faithfulness
Faithfulness/groundedness judges the answer against only the retrieved context without carying about the query. The knowledge base says "we will issue a refund or reship the item once the carrier finishes their investigation, which takes up to 10 business days", and the copilot adds on top "...and customers will get a 10% discount voucher." Nobody wrote that voucher down anywhere. The model just felt generous.
The more subtle version of that failure is when the knowledge base says "we will refund you after the carrier investigation" and the copilot answers "we will refund you immediately, and the investigation will run in parallel". Every word is on topic, the tone is right, and the policy has been reversed.
Answer relevance
The third score flips it around, judges the answer against the query, ignoring the context. The 30-day returns answer in the intro is fully faithful to a retrieved chunk and irrelevant to the customer's question about the refund policy. 100% grounded in the wrong context.
Debugging
Why track three scores instead of one? Because it helps debugging to see which part of the system fails.
Low context relevance is a retriever problem, caught at the source. A judge (either LLM or human) comparing those chunks to the query flags them as off-topic on the spot. Whatever the copilot answers afterwards does not change the diagnosis. The retriever has failed.
Low faithfulness with good context is a generator problem. The AI was handed the correct information on a silver platter and still made things up, the retriever delivered the disputed-delivery SOP with the 10-day investigation timeline, and the copilot invented the 10% discount voucher anyway. Context relevance is high, so the retriever is innocent and the fix is on the generation side.
High faithfulness with low answer relevance is the same retriever disease as the first case, diagnosed from the other end. The difference is that here the wrong document does not look wrong, a judge (LLM or human) can't say that the document is not relevant. The agent asks "How do I cancel an order that is already being picked in the warehouse?" and the retriever grabs the subscription cancellation policy, because "cancel" matches beautifully in embedding space. It's a cancellation policy after all, and the copilot writes a flawless, fully grounded walkthrough of how to end a subscription.
The four abilities
The three scores help identify the problem. A second set of questions measures the system as a whole, and the easiest way to keep them apart is to ask where the answer actually lives in your knowledge base.
Noise robustness measures how well the system can come to an answer, given the answer is surrounded by junk. Ask for the weight limit for express shipments to Switzerland and retrieval returns the rate card with the limit (30kg), a marketing post that the route exists, and a customs FAQ. A robust system answers "30kg" from the rate card. A fragile one averages over its context and produces something like "Express shipping is available to Switzerland and may be subject to VAT."
Negative rejection measures if the system resist to invent an answer that doesn't exist. For example ask about refunds for shipments to Antarctica when the knowledge base covers only the EU and US.
Information integration measure how well the system can synthesise an answer based on various pieces in the knowledge base. "Can a customer in Norway return a damaged standing desk for free?" No single document answers it. The damage policy says damaged items return free with photo evidence, the country matrix says Norway is non-EU so the customer pays return customs unless it is seller error, and the furniture SOP says oversized items need a freight pickup. The correct answer composes all three.
Counterfactual robustness, measures how well can the system behave given there is contradicting info in the knowledge base. The knowledge base holds a 2019 policy PDF saying the insured maximum is 500 euros that nobody deleted, next to the current policy saying 1,000. The passing system flags the conflict, or prefers the newest source. The failing system uses the stale number into a confident answer.
Build the eval set by failure mode
The practical consequence, don't collect "100 representative questions". Collect cases per failure mode, because each ability needs a differently constructed test.
type EvalCase = {
query: string;
ability: "noise" | "rejection" | "integration" | "counterfactual";
// what makes this case test that ability
setup: string;
pass: (answer: string) => boolean;
};
const cases: EvalCase[] = [
{
query: "Weight limit for express to Switzerland?",
ability: "noise",
setup: "rate card + 2 topical-but-empty chunks forced into context",
pass: (a) => a.includes("30") && !a.includes("VAT"),
},
{
query: "Refund policy for shipments to Antarctica?",
ability: "rejection",
setup: "no covering document exists, EU policy is nearest neighbor",
pass: (a) => /don't have|not defined|no policy/i.test(a),
},
{
query: "Max insured value for a standard shipment?",
ability: "counterfactual",
setup: "stale 2019 policy (500) and current 2025 policy (1000) both indexed",
pass: (a) => a.includes("1,000") || /conflict|superseded/i.test(a),
},
];
The three scores tell which component broke. The four abilities tell whether the system survives contact with a real knowledge base, which is never curated, never complete, and full of stale info nobody deleted.