Loading notes...
Loading notes...
RAG • Chapter 9
RAG engineering module on Evaluation Metrics & RAGAS.
You cannot improve what you cannot measure. Evaluating RAG pipelines is complex because it requires evaluating both the Retrieval system and the Generation system.
Advanced System Mechanics
Standard ML metrics like Precision and Recall evaluate the retrieval phase. However, evaluating the LLM's text requires frameworks like RAGAS (Retrieval Augmented Generation Assessment). RAGAS uses LLMs as judges to measure metrics such as: Faithfulness (is the answer grounded in context?), Answer Relevance (does it actually answer the query?), and Context Precision (was the useful context ranked highest?).
Implementation Blueprint
# Pseudocode for RAGAS Evaluation Metric: Faithfulness
def calculate_faithfulness(question, context, generated_answer, evaluator_llm):
# Prompt evaluator to extract claims from the generated answer
claims = evaluator_llm.extract_claims(generated_answer)
# Prompt evaluator to check if each claim is supported by context
supported = 0
for claim in claims:
if evaluator_llm.verify(claim, context):
supported += 1
return supported / len(claims)