Loading notes...
Loading notes...
RAG • Chapter 4
RAG engineering module on Document Chunking Strategies.
Before text can be embedded and stored, large documents must be split into smaller, manageable pieces called 'chunks'. If chunks are too large, the embeddings become diluted; if too small, they lose necessary context.
Advanced System Mechanics
Common strategies include Fixed-size chunking (e.g., 500 tokens with 50-token overlap), Recursive Character chunking (splitting by paragraphs, then sentences, then words), and Semantic chunking (using NLP to split by logical topic boundaries). Overlap is critical to ensure that a sentence split across two chunks doesn't lose its meaning.
Implementation Blueprint
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = "Your massive document text goes here... It has multiple paragraphs."
# Recursive chunking prioritizes splitting at paragraphs (\n\n), then sentences
splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk}")