Loading notes...
Loading notes...
RAG • Chapter 6
RAG engineering module on Integrating LLMs (OpenAI, HuggingFace).
Once relevant context is retrieved, it must be passed to a Large Language Model. You can utilize proprietary APIs like OpenAI or deploy open-weight models via HuggingFace.
Advanced System Mechanics
Frameworks like LangChain and LlamaIndex provide standardized interfaces to swap LLMs seamlessly. When using an API, the prompt, context, and user query are combined into a system and user message format. When running locally with HuggingFace, one must consider GPU VRAM limits, often utilizing quantization (like 4-bit AWQ or GGUF) to fit models like Llama-3 into memory.
Implementation Blueprint
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Ensure OPENAI_API_KEY is in environment
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Use the following context to answer: {context}"),
("user", "{question}")
])
chain = prompt | llm
response = chain.invoke({"context": "RAG improves accuracy.", "question": "What does RAG do?"})
print(response.content)