Here's the problem with ChatGPT: it knows EVERYTHING about the internet up to 2023, but NOTHING about your company's internal documentation, your product specifications, or that critical incident report from last week.

You could fine-tune a model (expensive, slow, needs ML expertise), or you could use RAG - Retrieval Augmented Generation. It's like giving GPT-4 a library card to your private knowledge base.

By 2025, RAG has gone from "experimental PDF chatbot" to the backbone of enterprise AI. Let me show you how to build it properly.

If you're new to AI systems, start with understanding generative AI fundamentals before diving into RAG architecture.

What is RAG? (Explain It Like I'm Five)

Normal GPT-4: A genius who memorized Wikipedia but has never heard of your company.

RAG: That same genius, but now you can hand them your company handbook and say "read this first, THEN answer the question."

The Process:

  1. User asks: "What's our remote work policy?"
  2. System searches your documents for relevant sections
  3. System gives GPT those sections as context
  4. GPT generates an answer based on YOUR data, not generic internet knowledge

Result? AI that knows your stuff without you spending $100k on fine-tuning.

The Two Pipelines

Pipeline 1: Ingestion (Preparing Your Knowledge)

This happens ONCE (or whenever you update your docs):

  1. Load documents (PDFs, Notion pages, Google Docs)
  2. Chunk them into pieces (can't feed a 200-page PDF in one go)
  3. Embed chunks into vectors (convert text to numbers that capture meaning)
  4. Store vectors in a database

Pipeline 2: Retrieval (Answering Questions)

This happens EVERY TIME a user asks something:

  1. Convert user's question to a vector
  2. Search vector database for similar chunks
  3. Rank results (best matches first)
  4. Feed top chunks to GPT as context
  5. Generate answer

Now let's build this.

Setup: The Stack

pip install langchain langchain-community langchain-openai \
    chromadb pypdf tiktoken openai

What we're using:

  • LangChain - Framework for building AI apps
  • ChromaDB - Local vector database (great for development)
  • OpenAI - Embeddings and LLM
  • PyPDF - PDF loading

Set your API key:

import os
os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

Step 1: Document Loading and Chunking

Chunking is THE critical decision. Too small (50 words)? You lose context. Too large (5000 words)? You get irrelevant noise.

The Sweet Spot (2025): 1000 characters with 200 character overlap.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} pages")

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Characters per chunk
    chunk_overlap=200,      # Overlap to maintain context between chunks
    add_start_index=True    # Track where chunks came from
)

chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Why overlap? Imagine a chunk ends mid-sentence about "remote work requires..." and the next chunk starts with "...manager approval." With overlap, both chunks contain the full context.

Advanced: Semantic Chunking (respects document structure):

from langchain_text_splitters import MarkdownHeaderTextSplitter

# For Markdown docs, split by headers
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

This keeps related content together instead of arbitrary character limits.

Step 2: Embeddings and Vector Storage

Embeddings convert text to vectors (arrays of numbers). Similar meanings = similar vectors.

Example:

  • "What is RAG?" → [0.2, 0.8, 0.1, ...]
  • "Explain Retrieval Augmented Generation" → [0.21, 0.79, 0.11, ...]
  • "How to make pizza" → [0.9, 0.1, 0.5, ...]

First two are close (similar meaning). Third is far (different topic).

2025 Embedding Models:

ModelCostQualitySpeedBest For
OpenAI text-embedding-3-small$$GoodFastMost use cases
OpenAI text-embedding-3-large$$$BetterMediumHigh accuracy needs
BGE-M3 (open-source)FreeGoodFastSelf-hosted

For detailed guidance on working with embedding APIs and other AI endpoints, including error handling and optimization strategies.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Create embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"  # Save to disk
)

print("Vector store created!")

Store your document corpus in AWS S3 for scalable object storage, which integrates seamlessly with embedding pipelines.

What just happened?

  1. Each chunk was converted to a vector
  2. All vectors stored in ChromaDB
  3. Database saved to disk (so you don't rebuild every time)

Step 3: Basic Retrieval

Now we can search semantically:

# Create a retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Return top 5 matches
)

# Search
query = "What is the remote work policy?"
relevant_docs = retriever.invoke(query)

for i, doc in enumerate(relevant_docs):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content[:200])  # First 200 chars

This finds the 5 most relevant chunks. But we're not done yet.

Combine RAG with LangChain agents that use retrieval as a tool for ReAct-pattern knowledge retrieval.

Step 4: Building the RAG Chain

Now we connect retrieval to GPT:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create prompt template
template = """You are an assistant answering questions about company policies.
Use ONLY the information from the context below. If the answer isn't in the context, say "I don't have that information."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Helper function to format documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build the chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Use it!
response = rag_chain.invoke("What's the vacation policy?")
print(response)

Use Few-Shot and Chain-of-Thought prompting patterns to improve RAG query generation and response synthesis.

Implement RAG in production chatbots with OpenAI GPT-4 for knowledge-grounded conversations.

What's happening here?

  1. User question goes to retriever
  2. Retriever finds relevant chunks
  3. Chunks get formatted and inserted into prompt
  4. GPT generates answer based on retrieved context
  5. Answer returned to user

Step 5: Advanced Retrieval - Multi-Query

Users ask vague questions. "How do I work from home?" could match "remote work policy" or "home office equipment" or "VPN setup."

Solution: Generate multiple versions of the question, search for all, combine results.

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)

# This automatically generates query variations and searches for all
results = multi_query_retriever.invoke("working remotely")

# Behind the scenes, it might generate:
# - "What is the remote work policy?"
# - "How do I work from home?"
# - "What are the guidelines for telecommuting?"

Step 6: Re-Ranking (The Secret Sauce)

Vector search isn't perfect. Sometimes chunk #7 is more relevant than chunk #2, but vector similarity missed it.

Solution: Retrieve 20 chunks, then use a specialized model to RE-RANK them, keep only top 3.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank

# Reranker (local, fast)
compressor = FlashrankRerank(top_n=3)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)

# This retrieves 20, reranks, returns best 3
compressed_docs = compression_retriever.invoke("remote work equipment")

Impact: Accuracy goes from 70% to 90%+ in my testing. Worth the extra complexity.

Step 7: Hybrid Search (Keyword + Semantic)

Semantic search is great for concepts, terrible for specific terms.

Example:

  • Semantic search: "Tell me about employee benefits" ✅
  • Keyword search: "What's the SKU for product ABC-123?" ✅

Solution: Combine both.

from langchain.retrievers import BM25Retriever, EnsembleRetriever

# Keyword search (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Vector search
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine both
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]  # Equal weight (adjust based on testing)
)

results = ensemble_retriever.invoke("product SKU ABC-123")

Step 8: Metadata Filtering (Speed Up Search)

Searching through 100,000 chunks is slow. If you know the category, filter first.

# When creating chunks, add metadata
from langchain.schema import Document

doc_with_metadata = Document(
    page_content="Remote work requires manager approval...",
    metadata={
        "source": "employee_handbook.pdf",
        "category": "remote_work",
        "date": "2025-01-15",
        "department": "hr"
    }
)

# Later, filter during retrieval
filtered_retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {"category": "remote_work"}  # Only search this category
    }
)

Use case: User is asking about HR policies? Filter to HR docs only. 10x faster.

Step 9: Streaming Responses

Nobody wants to wait 8 seconds for an answer. Stream it.

# Modify the chain to support streaming
for chunk in rag_chain.stream("What's the vacation policy?"):
    print(chunk, end="", flush=True)

Users see text appearing in real-time. Feels instant.

Step 10: Evaluation (Are You Getting Good Results?)

You can't improve what you don't measure.

The RAGAS Framework:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,           # Did AI make stuff up?
    answer_relevancy,       # Is the answer relevant to the question?
    context_precision,      # Are retrieved chunks relevant?
)

# Your test set
test_questions = [
    {
        "question": "What's the remote work policy?",
        "ground_truth": "Employees can work remotely 3 days per week with manager approval"
    }
]

# Run evaluation
results = evaluate(
    dataset=test_questions,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(results)

Good scores: >0.8 on all metrics. Bad scores: <0.6 means your chunking or retrieval needs work.

Production Considerations

1. Database Choice

Development:

  • ChromaDB (local, simple)

Production (small-medium scale):

  • Pinecone (managed, easy)
  • Weaviate (open-source, self-hosted)

Production (large scale):

  • Elasticsearch with vector search
  • pgvector (Postgres extension)

2. Caching

Don't re-retrieve for identical questions:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_retrieve(query_hash):
    return retriever.invoke(query)

# Use hash of query as cache key
query_hash = hashlib.md5(query.encode()).hexdigest()

3. Update Strategy

When documents change:

Option A: Incremental Update

# Add new chunks
new_chunks = load_and_chunk_new_doc()
vectorstore.add_documents(new_chunks)

Option B: Full Rebuild

# Weekly rebuild of entire index
vectorstore = Chroma.from_documents(all_chunks, embeddings)

4. Cost Optimization

Embeddings cost money:

  • text-embedding-3-small: $0.02 / 1M tokens
  • 1000 chunks × 200 tokens each = 200k tokens = $0.004

Not bad, but if you're re-embedding often, it adds up.

Solution: Only embed new/changed documents.

The Complete Production-Ready System

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

class RAGSystem:
    def __init__(self, persist_directory="./chroma_db"):
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
        self.vectorstore = None
        self.persist_directory = persist_directory

    def ingest_documents(self, pdf_paths):
        """Load and index documents"""
        all_chunks = []

        for pdf_path in pdf_paths:
            loader = PyPDFLoader(pdf_path)
            docs = loader.load()

            splitter = RecursiveCharacterTextSplitter(
                chunk_size=1000,
                chunk_overlap=200
            )
            chunks = splitter.split_documents(docs)
            all_chunks.extend(chunks)

        self.vectorstore = Chroma.from_documents(
            documents=all_chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_directory
        )

        print(f"Indexed {len(all_chunks)} chunks")

    def load_vectorstore(self):
        """Load existing vector store"""
        self.vectorstore = Chroma(
            persist_directory=self.persist_directory,
            embedding_function=self.embeddings
        )

    def query(self, question):
        """Query the RAG system"""
        if not self.vectorstore:
            raise ValueError("No vector store loaded. Call ingest_documents() or load_vectorstore() first")

        retriever = self.vectorstore.as_retriever(search_kwargs={"k": 5})

        template = """Answer based only on this context:

{context}

Question: {question}

Answer:"""

        prompt = ChatPromptTemplate.from_template(template)

        def format_docs(docs):
            return "\n\n".join(doc.page_content for doc in docs)

        chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

        return chain.invoke(question)

# Usage
rag = RAGSystem()

# One-time ingestion
rag.ingest_documents(["handbook.pdf", "policies.pdf"])

# Later sessions, just load
rag.load_vectorstore()

# Query
answer = rag.query("What's the vacation policy?")
print(answer)

Common Pitfalls (And How to Avoid Them)

1. Chunks too small: Model lacks context to answer properly. Fix: Increase chunk size to 1000-1500 characters.

2. No overlap: Context breaks mid-concept. Fix: Use 10-20% overlap.

3. Bad prompting: AI still hallucinates despite RAG. Fix: Add "Answer ONLY from context. If not found, say 'I don't know'" to prompt.

4. Retrieving too few chunks: Miss relevant information. Fix: Retrieve 10-20, then re-rank to top 3-5.

5. No evaluation: You don't know if it's working well. Fix: Use RAGAS or manual testing with known questions.

The Bottom Line

RAG is the most practical way to make AI an expert on YOUR data without fine-tuning. The 2025 stack is mature, the tools are production-ready, and the cost is reasonable.

Start simple:

  1. Load documents
  2. Chunk them
  3. Embed and store
  4. Build a basic RAG chain

Then add sophistication:

  • Multi-query retrieval
  • Re-ranking
  • Hybrid search
  • Metadata filtering

Test it. Measure it. Iterate.

Within a week, you can have an AI that knows your company handbook better than most employees. And unlike employees, it never forgets and doesn't need coffee breaks.

Now go make GPT-4 an expert on your stuff.