RAG • Chapter 9

Evaluation Metrics & RAGAS

RAG engineering module on Evaluation Metrics & RAGAS.

6 note blocks4 exam topics

🎯 Exam Focus Areas

Evaluate chunking and embedding strategies.Understand Vector DB indexing architectures like HNSW.Analyze RAG prompts for injection vulnerabilities.Calculate and utilize RAGAS evaluation metrics.

You cannot improve what you cannot measure. Evaluating RAG pipelines is complex because it requires evaluating both the Retrieval system and the Generation system.

Advanced System Mechanics

Standard ML metrics like Precision and Recall evaluate the retrieval phase. However, evaluating the LLM's text requires frameworks like RAGAS (Retrieval Augmented Generation Assessment). RAGAS uses LLMs as judges to measure metrics such as: Faithfulness (is the answer grounded in context?), Answer Relevance (does it actually answer the query?), and Context Precision (was the useful context ranked highest?).

1Understand the vector space implications of this concept.
2Identify potential hallucination risks.
3Optimize for low latency and high relevance.
4Ensure robust system prompts.

Implementation Blueprint

# Pseudocode for RAGAS Evaluation Metric: Faithfulness
def calculate_faithfulness(question, context, generated_answer, evaluator_llm):
    # Prompt evaluator to extract claims from the generated answer
    claims = evaluator_llm.extract_claims(generated_answer)
    
    # Prompt evaluator to check if each claim is supported by context
    supported = 0
    for claim in claims:
        if evaluator_llm.verify(claim, context):
            supported += 1
            
    return supported / len(claims)

📝 Quick Revision Points

1Review the differences between similarity metrics.
2Practice the LangChain/LlamaIndex code snippets.
3Understand the HyDE architecture deeply.
4Memorize the security guardrail implementations.

← PreviousAdvanced RAG: HyDE & Parent Document Next →Security, Privacy & Prompt Injection

Loading notes...

# Pseudocode for RAGAS Evaluation Metric: Faithfulness def calculate_faithfulness(question, context, generated_answer, evaluator_llm): # Prompt evaluator to extract claims from the generated answer claims = evaluator_llm.extract_claims(generated_answer) # Prompt evaluator to check if each claim is supported by context supported = 0 for claim in claims: if evaluator_llm.verify(claim, context): supported += 1 return supported / len(claims)