RAG • Chapter 4

Document Chunking Strategies

RAG engineering module on Document Chunking Strategies.

6 note blocks4 exam topics

🎯 Exam Focus Areas

Evaluate chunking and embedding strategies.Understand Vector DB indexing architectures like HNSW.Analyze RAG prompts for injection vulnerabilities.Calculate and utilize RAGAS evaluation metrics.

Before text can be embedded and stored, large documents must be split into smaller, manageable pieces called 'chunks'. If chunks are too large, the embeddings become diluted; if too small, they lose necessary context.

Advanced System Mechanics

Common strategies include Fixed-size chunking (e.g., 500 tokens with 50-token overlap), Recursive Character chunking (splitting by paragraphs, then sentences, then words), and Semantic chunking (using NLP to split by logical topic boundaries). Overlap is critical to ensure that a sentence split across two chunks doesn't lose its meaning.

1Understand the vector space implications of this concept.
2Identify potential hallucination risks.
3Optimize for low latency and high relevance.
4Ensure robust system prompts.

Implementation Blueprint

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "Your massive document text goes here... It has multiple paragraphs."

# Recursive chunking prioritizes splitting at paragraphs (\n\n), then sentences
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk}")

📝 Quick Revision Points

1Review the differences between similarity metrics.
2Practice the LangChain/LlamaIndex code snippets.
3Understand the HyDE architecture deeply.
4Memorize the security guardrail implementations.

← PreviousVector Databases Architecture Next →Semantic Search & Distance Metrics

Loading notes...

from langchain.text_splitter import RecursiveCharacterTextSplitter text = "Your massive document text goes here... It has multiple paragraphs." # Recursive chunking prioritizes splitting at paragraphs (\n\n), then sentences splitter = RecursiveCharacterTextSplitter( chunk_size=100, chunk_overlap=20, separators=["\n\n", "\n", " ", ""] ) chunks = splitter.split_text(text) for i, chunk in enumerate(chunks): print(f"Chunk {i}: {chunk}")