Industry Ready Java Spring Boot, React & Gen AI — Live Course
AI EngineeringBuild your first rag

Lang Chain RAG with Chroma Vector Store

What is RAG?

  • RAG (Retrieval-Augmented Generation) is a technique where an LLM answers a question using retrieved document context in addition to its own language capability
  • It connects an LLM with an external knowledge source (here: product_details.txt) using retrieval
  • Instead of relying only on model memory, RAG fetches relevant content and injects it into the prompt as context
  • Core idea: Query → Retrieve relevant chunks → Add as context → Generate answer

Why Do We Need RAG?

  • LLMs do not automatically know your local/custom data like product_details.txt
  • Without retrieval, answers can be:
    • incomplete (missing product-specific details)
    • inaccurate (guessing)
    • inconsistent (hallucination risk)
  • RAG improves answers by ensuring responses are grounded in the most relevant text chunks
  • Efficient because it sends only top relevant chunks, not the entire document, into the prompt

How RAG Works in This Pipeline

  • Step 1: Load Knowledge Source
    • Read product_details.txt as documents so the pipeline can process the data
  • Step 2: Split into Chunks
    • Break large text into smaller pieces to improve retrieval accuracy
    • Use over-lap to avoid losing meaning at chunk boundaries
  • Step 3: Create Embeddings
    • Convert each chunk into a vector representation (semantic meaning as numbers)
  • Step 4: Store in Vector Database (Chroma)
    • Store embeddings + original chunk text for similarity search
    • Persist locally so embeddings don’t need to be rebuilt each run
  • Step 5: Retrieve Relevant Context
    • For every query, retrieve top matching chunks (example: k=2 chunks)
  • Step 6: Prompt + Generate
    • Insert retrieved chunks into the prompt as {context}
    • Pass {question} as the user query
    • LLM generates the answer using provided context

Data Loading and Chunking

Using product_details.txt as Knowledge Base

  • Acts as the source of truth for answering product-related queries
  • The RAG pipeline uses this file as its external knowledge store

Why Chunking is Required

  • Retrieval works best when searching over smaller meaningful chunks
  • Chunking supports:
    • more accurate similarity matching
    • better control over prompt context size
    • improved answer grounding

Chunk Size and Chunk Overlap

  • Chunk size controls how much text is in one chunk
  • Chunk overlap repeats a small portion across chunks to preserve continuity
  • Proper chunking directly impacts retrieval quality

Vector Store and Retrieval

Why Chroma is Used

  • Chroma stores embeddings and supports fast similarity search
  • Works as the pipeline’s vector database to retrieve top relevant content

Persisting the Vector Store

  • Persist directory stores the DB locally
  • Prevents re-computation of embeddings on every run
  • Makes retrieval faster after first build

Retriever with Top-K Results

  • Retriever selects the most relevant chunks for the query
  • k=2 means:
    • only the top 2 most relevant chunks are used as context
    • keeps context focused and reduces token usage

Prompting and Answer Generation

Prompt Role in RAG

  • Prompt defines how the LLM should use:
    • {context} (retrieved chunks)
    • {question} (user query)
  • Clear prompt ensures:
    • answers stay grounded in retrieved content
    • responses remain concise and structured

End-to-End Flow Summary

  • Load product_details.txt
  • Split into chunks (size + overlap)
  • Embed chunks
  • Store embeddings in Chroma
  • Retrieve top relevant chunks for a query
  • Inject retrieved context into prompt
  • LLM generates final answer

Code Implementation

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Loading data
loader = TextLoader("product_details.txt", encoding="utf8")
docs = loader.load()

# splitter -> chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = splitter.split_documents(docs)

# store in vectorStore
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings(),
    persist_directory="chroma_db"
)

# retrive
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

template = """You are a helpful AI assistant.
Use the following pieces of context to answer the question at the end.
{context}
Question : {question}
Answer in a concise manner.
"""

prompt = PromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4o")

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

response = chain.invoke("suggest products for vacation")
print(response)

Last updated on