RAG Architecture
RAG Architecture
RAG Pipeline Components
A complete RAG system has these parts:
┌──────────────────────────────────────────────────┐
│ │
│ 1. DOCUMENT PROCESSING │
│ • Load documents │
│ • Split into chunks │
│ • Generate embeddings │
│ • Store in vector DB │
│ │
└──────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ │
│ 2. QUERY PROCESSING │
│ • User asks question │
│ • Convert to embedding │
│ • Search vector DB │
│ • Retrieve top-k similar documents │
│ │
└──────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ │
│ 3. PROMPT AUGMENTATION │
│ • Format retrieved docs as context │
│ • Construct prompt with context + question │
│ • Send to LLM │
│ │
└──────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ │
│ 4. RESPONSE GENERATION │
│ • LLM reads context │
│ • Generates grounded response │
│ • (Optional) Cite sources │
│ │
└──────────────────────────────────────────────────┘1. Document Processing (One-Time Setup)
Chunking Strategy
Split documents into manageable pieces:
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks."""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
# Example
document = "Very long document text here..."
chunks = chunk_text(document, chunk_size=200, overlap=20)Why chunk?:
- Embeddings work best on paragraphs, not entire books
- Retrieves precise, relevant sections
- Fits within context windows
Chunk Size Guidelines:
- Too small (< 100 words): Loses context
- Too large (> 1000 words): Less precise retrieval
- Sweet spot: 200-500 words with 10-20% overlap
Embedding & Storage
import chromadb
import ollama
# Initialize
client = chromadb.PersistentClient(path="./rag_db")
collection = client.get_or_create_collection(name="documents")
# Add chunks
for i, chunk in enumerate(chunks):
collection.add(
documents=[chunk],
ids=[f"chunk_{i}"],
metadatas=[{"source": "document.pdf", "page": i // 10}]
)2. Retrieval Process
Query Embedding
def retrieve_context(query, n_results=3):
"""Retrieve relevant chunks for query."""
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results['documents'][0] # Top resultsRetrieval Strategies
1. Top-K Retrieval (Simplest):
# Get top 3 most similar chunks
chunks = retrieve_context(query, n_results=3)2. Threshold-Based:
# Only use chunks above similarity threshold
results = collection.query(query_texts=[query], n_results=10)
filtered = [
doc for doc, dist in zip(results['documents'][0], results['distances'][0])
if dist < 0.5 # Lower distance = more similar
]3. MMR (Maximal Marginal Relevance):
# Retrieve diverse results (avoid redundancy)
# ChromaDB doesn't support this natively, but you can implement:
def mmr_rerank(query_emb, doc_embs, lambda_param=0.5):
"""Re-rank to balance relevance and diversity."""
# Implementation left as exercise
pass3. Prompt Construction
Basic Template
from jinja2 import Template
rag_template = Template("""
You are a helpful assistant. Answer the question based on the context provided.
Context:
{% for chunk in context_chunks %}
{{ chunk }}
{% endfor %}
Question: {{ question }}
Answer based ONLY on the context above. If the answer is not in the context, say "I don't have enough information to answer that."
Answer:
""")
def create_rag_prompt(question, context_chunks):
return rag_template.render(
question=question,
context_chunks=context_chunks
)Advanced Template (with Citations)
citation_template = Template("""
Answer the question using the provided context. Cite your sources using [1], [2], etc.
Context:
{% for i, chunk in enumerate(context_chunks) %}
[{{ i + 1 }}] {{ chunk }}
{% endfor %}
Question: {{ question }}
Provide a detailed answer with citations.
Answer:
""")4. Generation
def rag_query(question, n_results=3):
"""Complete RAG pipeline."""
# 1. Retrieve context
context = retrieve_context(question, n_results)
# 2. Create prompt
prompt = create_rag_prompt(question, context)
# 3. Generate response
response = ollama.generate(
model='llama3.2',
prompt=prompt,
options={'temperature': 0.3} # Lower for factual
)
return {
'answer': response['response'],
'context': context,
'sources': [...] # Optional: track sources
}
# Usage
result = rag_query("What are the library's hours?")
print(result['answer'])RAG Enhancements
1. Re-ranking
Improve retrieval quality:
def rerank_results(query, initial_results, model='llama3.2'):
"""Re-rank results using LLM."""
reranked = []
for doc in initial_results:
prompt = f"""
Rate how relevant this document is to the query (0-10):
Query: {query}
Document: {doc}
Relevance score (just the number):
"""
score = ollama.generate(model=model, prompt=prompt)
reranked.append((doc, float(score['response'])))
reranked.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in reranked]2. Hybrid Search
Combine semantic + keyword:
def hybrid_search(query, semantic_weight=0.7):
"""Combine semantic and keyword search."""
# Semantic results
semantic = collection.query(query_texts=[query], n_results=10)
# Keyword results (simplified - use BM25 in practice)
keyword = keyword_search(query, n_results=10)
# Merge with weights
combined = merge_results(
semantic,
keyword,
semantic_weight=semantic_weight
)
return combined3. Query Expansion
Improve retrieval with multiple queries:
def expand_query(original_query):
"""Generate alternative phrasings."""
prompt = f"""
Generate 3 alternative ways to ask this question:
{original_query}
Return just the 3 alternatives, one per line.
"""
response = ollama.generate(model='llama3.2', prompt=prompt)
alternatives = response['response'].strip().split('\n')
return [original_query] + alternatives
# Use all variants for retrieval
queries = expand_query("library hours")
all_results = []
for q in queries:
all_results.extend(retrieve_context(q, n_results=2))
# Deduplicate and use top resultsPerformance Considerations
Latency Breakdown
Typical RAG query:
- Embedding generation: 100-200ms
- Vector search: 10-50ms
- LLM generation: 2-10 seconds (depends on length)
Total: ~3-10 seconds
Optimization Strategies
- Cache Embeddings: Don’t re-embed same queries
- Limit Retrieved Chunks: 3-5 is often enough
- Batch Processing: Process multiple queries together
- Use Smaller Models: Consider llama3.2:1b for speed
Next: Implementation
Ready to build a complete RAG system?