LlamaIndex
LlamaIndex is a data framework for building LLM applications that need to ingest, structure, and query private data. Where LangChain is a general-purpose orchestration framework, LlamaIndex is laser-focused on the data problem: how do you take your documents, databases, APIs, and knowledge bases and make them queryable by a language model?
If LangChain is a Swiss Army knife for LLM applications, LlamaIndex is a purpose-built power tool for data indexing and retrieval. It provides specialized index structures, advanced query engines, and a rich ecosystem of data connectors that make building RAG pipelines faster and more flexible.
Why LlamaIndex Exists
Every RAG application has the same fundamental challenge: turning unstructured data into something an LLM can reason over. This involves:
- Loading — Getting data from PDFs, databases, APIs, Notion, Confluence, Slack, code repos
- Parsing — Extracting clean text from complex formats (tables, images, nested structures)
- Indexing — Organizing data into structures optimized for different query patterns
- Querying — Translating natural language questions into retrieval operations
- Synthesizing — Combining retrieved information into coherent answers
LlamaIndex provides purpose-built abstractions for each of these stages.
Core Architecture
Documents and Nodes
The two fundamental data structures:
- Document — A container for a source document (a PDF page, a web page, a database row). Has text content and metadata.
- Node — A chunk of a document. Nodes are the atomic units that get indexed, embedded, and retrieved. Each node maintains a reference to its parent document and optional references to sibling nodes.
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
# Create documents
documents = [
Document(
text="LlamaIndex is a data framework for LLM applications...",
metadata={"source": "docs", "category": "framework"},
),
]
# Parse into nodes (chunks)
parser = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = parser.get_nodes_from_documents(documents)
# Each node has:
# - node.text (the chunk content)
# - node.metadata (inherited from document + node-specific)
# - node.relationships (parent doc, prev/next siblings)import { Document, SentenceSplitter } from "llamaindex";
const documents = [
new Document({
text: "LlamaIndex is a data framework...",
metadata: { source: "docs", category: "framework" },
}),
];
const splitter = new SentenceSplitter({ chunkSize: 1024, chunkOverlap: 200 });
const nodes = splitter.getNodesFromDocuments(documents);Index Types
LlamaIndex provides multiple index structures, each optimized for different query patterns. This is its key differentiator from other frameworks.
Vector Store Index
The most common index type. Embeds every node and stores them in a vector database for similarity search.
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Build index from documents
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("How does LlamaIndex handle chunking?")
print(response)Summary Index (formerly List Index)
Stores all nodes in a flat list. At query time, it iterates through every node (or a subset) and synthesizes a response. Best for summarization tasks where you need the model to consider all the data.
from llama_index.core import SummaryIndex
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine(
response_mode="tree_summarize"
)
response = query_engine.query("Provide a comprehensive summary of all documents")Tree Index
Builds a tree structure where leaf nodes are the original chunks and parent nodes are LLM-generated summaries. Queries traverse the tree from root to relevant leaves.
from llama_index.core import TreeIndex
index = TreeIndex.from_documents(documents, num_children=5)
query_engine = index.as_query_engine()
response = query_engine.query("What are the main themes across these documents?")Keyword Table Index
Extracts keywords from each node and builds a keyword-to-node mapping. Queries are matched against keywords to find relevant nodes. Useful when semantic similarity is not the right retrieval strategy.
from llama_index.core import KeywordTableIndex
index = KeywordTableIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("error handling retry backoff")Index Type Comparison
| Index Type | Retrieval Strategy | Best For | Latency | Cost |
|---|---|---|---|---|
| VectorStoreIndex | Embedding similarity | Semantic search, Q&A | Low | Embedding calls |
| SummaryIndex | Iterate all nodes | Summarization | High | Many LLM calls |
| TreeIndex | Tree traversal | Hierarchical documents | Medium | Summary generation |
| KeywordTableIndex | Keyword matching | Exact term lookup | Low | Keyword extraction |
Start with VectorStoreIndex
For 90% of use cases, VectorStoreIndex is the right choice. Use other index types when your retrieval requirements specifically call for them — summarization, hierarchical navigation, or keyword matching.
Query Engines
Query engines are the primary interface for asking questions of your indexed data. They handle retrieval, optional reranking, and response synthesis.
Basic Query Engine
# Simple query engine from an index
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact", # compact, refine, tree_summarize, simple
)
response = query_engine.query("What is the architecture of the system?")
print(response.response) # The synthesized answer
print(response.source_nodes) # The retrieved chunks
print(response.metadata) # Query metadataResponse Modes
| Mode | How It Works | Best For |
|---|---|---|
compact | Stuff as many nodes as possible into one LLM call | Short answers, most queries |
refine | Iterate through nodes, refining the answer with each | Detailed answers from many sources |
tree_summarize | Recursively summarize groups of nodes | Long document summarization |
simple | Just concatenate node text (no LLM synthesis) | Debugging, inspection |
accumulate | Generate a response per node, then combine | Multi-perspective answers |
Sub-Question Query Engine
Decomposes complex questions into sub-questions, queries different data sources for each, then synthesizes a final answer.
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.question_gen.openai import OpenAIQuestionGenerator
from llama_index.core.query_engine import SubQuestionQueryEngine
# Create tools for different data sources
tools = [
QueryEngineTool(
query_engine=financial_index.as_query_engine(),
metadata=ToolMetadata(
name="financials",
description="Financial reports and earnings data",
),
),
QueryEngineTool(
query_engine=product_index.as_query_engine(),
metadata=ToolMetadata(
name="products",
description="Product documentation and specifications",
),
),
]
query_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=tools,
question_gen=OpenAIQuestionGenerator.from_defaults(),
)
# This decomposes into sub-questions automatically
response = query_engine.query(
"Compare the revenue growth with the number of new product launches"
)Router Query Engine
Routes queries to the most appropriate index or query engine based on the question content:
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
query_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[
QueryEngineTool(
query_engine=code_index.as_query_engine(),
metadata=ToolMetadata(
name="code_docs",
description="Technical API documentation and code examples",
),
),
QueryEngineTool(
query_engine=policy_index.as_query_engine(),
metadata=ToolMetadata(
name="policies",
description="Company policies, HR documents, compliance guides",
),
),
],
)Chat Engines
Chat engines add conversation memory to query engines, enabling multi-turn interactions:
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
similarity_top_k=5,
)
# Multi-turn conversation
response1 = chat_engine.chat("What is the system architecture?")
response2 = chat_engine.chat("What databases does it use?") # uses context from prior turn
response3 = chat_engine.chat("How are they configured?") # maintains full history
# Reset conversation
chat_engine.reset()| Chat Mode | How It Works | Best For |
|---|---|---|
condense_question | Rewrites follow-up questions as standalone queries | Simple conversational Q&A |
context | Retrieves context for every turn | When every question needs fresh retrieval |
condense_plus_context | Condenses question AND retrieves context | Most production use cases |
simple | Sends all history to LLM (no retrieval) | Casual chat, no knowledge base |
react | Full ReAct agent with tools | Complex multi-step reasoning |
Data Connectors (LlamaHub)
LlamaHub is the ecosystem of 130+ data connectors for loading data from virtually any source. Think of it as the ingestion layer.
from llama_index.readers.web import SimpleWebPageReader
from llama_index.readers.database import DatabaseReader
from llama_index.readers.notion import NotionPageReader
from llama_index.readers.github import GithubRepositoryReader
# Web pages
web_docs = SimpleWebPageReader(html_to_text=True).load_data(
["https://docs.example.com/api"]
)
# SQL database
db_docs = DatabaseReader(uri="postgresql://user:pass@localhost/mydb").load_data(
query="SELECT title, content FROM articles WHERE published = true"
)
# Notion
notion_docs = NotionPageReader(integration_token="secret_xxx").load_data(
page_ids=["page-id-1", "page-id-2"]
)
# GitHub repository
github_docs = GithubRepositoryReader(
owner="org",
repo="repo",
filter_file_extensions=([".py", ".md"], GithubRepositoryReader.FilterType.INCLUDE),
).load_data(branch="main")Common Data Connectors
| Connector | Source | Package |
|---|---|---|
SimpleDirectoryReader | Local files (PDF, DOCX, CSV, etc.) | Built-in |
SimpleWebPageReader | Web pages | llama-index-readers-web |
DatabaseReader | SQL databases | llama-index-readers-database |
NotionPageReader | Notion pages | llama-index-readers-notion |
SlackReader | Slack channels | llama-index-readers-slack |
GithubRepositoryReader | GitHub repos | llama-index-readers-github |
ConfluenceReader | Confluence spaces | llama-index-readers-confluence |
GoogleDocsReader | Google Docs | llama-index-readers-google |
S3Reader | AWS S3 buckets | llama-index-readers-s3 |
Advanced Patterns
Metadata Filtering
Combine vector similarity with metadata filters for precise retrieval:
from llama_index.core.vector_stores.types import (
MetadataFilters,
MetadataFilter,
FilterOperator,
)
filters = MetadataFilters(
filters=[
MetadataFilter(key="category", value="engineering", operator=FilterOperator.EQ),
MetadataFilter(key="year", value=2025, operator=FilterOperator.GTE),
]
)
query_engine = index.as_query_engine(
similarity_top_k=10,
filters=filters,
)Auto-Retrieval
Let the LLM generate metadata filters from the natural language query:
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import VectorStoreInfo, MetadataInfo
vector_store_info = VectorStoreInfo(
content_info="Technical documentation for software products",
metadata_info=[
MetadataInfo(name="product", type="str", description="Product name"),
MetadataInfo(name="version", type="str", description="Software version"),
MetadataInfo(name="category", type="str", description="Doc category"),
],
)
retriever = VectorIndexAutoRetriever(
index,
vector_store_info=vector_store_info,
)
# The retriever automatically infers filters from the query
# "How do I configure caching in ProductX v3?" ->
# filters: product=ProductX, version=v3
nodes = retriever.retrieve("How do I configure caching in ProductX v3?")Reranking
Add a reranker to improve retrieval quality by reordering initial results:
from llama_index.postprocessor.cohere_rerank import CohereRerank
reranker = CohereRerank(top_n=5)
query_engine = index.as_query_engine(
similarity_top_k=20, # retrieve more initially
node_postprocessors=[reranker], # rerank to top 5
)Persistent Storage
from llama_index.core import StorageContext, load_index_from_storage
# Save index to disk
index.storage_context.persist(persist_dir="./storage")
# Load from disk later
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)For vector store backends, see Vector Databases.
LlamaIndex vs LangChain
This is the question every team asks. They are not direct competitors — they have different strengths and can be used together.
| Dimension | LlamaIndex | LangChain |
|---|---|---|
| Core focus | Data indexing and retrieval | General LLM orchestration |
| Index structures | Multiple (vector, tree, keyword, summary) | Vector store only |
| Query engines | Rich (sub-question, router, SQL, etc.) | Basic retrieval chains |
| Data connectors | 130+ via LlamaHub | 160+ via community |
| Agent support | Basic (via query tools) | Extensive (LangGraph) |
| Memory | Chat engine memory | Multiple memory types |
| Observability | LlamaTrace, callbacks | LangSmith integration |
| Composability | Pipeline-based | LCEL pipe syntax |
| Learning curve | Moderate | Moderate-steep |
| Best for | Data-heavy RAG applications | Multi-step agent workflows |
When to Use LlamaIndex
- Your application is primarily about querying private data (documents, databases, APIs)
- You need advanced retrieval patterns (sub-question decomposition, router queries)
- You want multiple index types for different query patterns
- Your RAG pipeline is the core product, not a feature of a larger agent system
When to Use LangChain
- You are building agents with complex control flow
- You need the broader ecosystem (memory, callbacks, tool integrations)
- Your application involves multi-step reasoning beyond retrieval
- You want tight LangSmith/LangGraph integration
Using Both Together
LlamaIndex query engines can be used as tools within LangChain agents:
from llama_index.core.tools import QueryEngineTool
from langchain.agents import create_react_agent
# LlamaIndex query engine as a LangChain tool
llama_tool = QueryEngineTool.from_defaults(
query_engine=index.as_query_engine(),
name="knowledge_base",
description="Search the internal knowledge base for technical documentation",
)
# Use in a LangChain agent
agent = create_react_agent(model, tools=[llama_tool, other_tools])The best approach is often both
Use LlamaIndex for data ingestion and retrieval (what it does best), and LangChain/LangGraph for orchestration and agent logic (what they do best). They are complementary, not competing.
Common Pitfalls
- Not tuning chunk size. The default chunk size (1024 tokens) is not optimal for every use case. Test with 256, 512, 1024, and 2048 to find the best size for your data and queries.
- Ignoring metadata. Metadata filtering dramatically improves retrieval precision. Always attach source, category, date, and other relevant metadata to your documents.
- Using SummaryIndex for Q&A. SummaryIndex iterates through all nodes — this is slow and expensive for point questions. Use VectorStoreIndex instead.
- Not persisting indexes. Rebuilding indexes from scratch on every restart wastes time and API calls. Use persistent storage or a vector database backend.
- Skipping evaluation. Use LlamaIndex's built-in evaluation modules or external tools to measure retrieval quality and answer accuracy.
Further Reading
- RAG Architecture Deep Dive — The patterns LlamaIndex implements
- Embeddings & Semantic Search — How vector search works under the hood
- Vector Databases — Storage backends for LlamaIndex indexes
- LangChain Deep Dive — The complementary orchestration framework
- LangSmith & LLM Observability — Monitoring LlamaIndex pipelines
- LlamaIndex Documentation — Official docs
- LlamaHub — Data connector ecosystem