Cơ bảnHướng dẫnClaude APINguồn: Anthropic

LlamaIndex + Claude — RAG pipeline cơ bản

Minh TuấnCTO, Transform GroupTheo dõi

26/03/2026 572 0 5 phút đọc

Nghe bài viết

00:00

1 Áp dụng ngay: LlamaIndex giải quyết ba vấn đề cốt lõi của RAG: Data Ingestion — Load và parse hàng trăm loại file PDF, Word, CSV, web — phần này cung cấp quy trình cụ thể giúp bạn triển khai hiệu quả mà không cần thử nghiệm nhiều lần.
2 Một điều ít người đề cập: Bước đầu tiên là cấu hình LlamaIndex sử dụng Claude làm LLM và Voyage AI làm embedding model: from llamaindex.core. Hiểu rõ bối cảnh áp dụng sẽ quyết định 80% thành công khi triển khai.
3 Không thể bỏ qua: Sau khi có documents, tạo VectorStoreIndex — LlamaIndex sẽ tự động chunk, embed, và lưu: from llamaindex.core import. Đây là kiến thức nền tảng mà mọi người làm việc với AI đều cần hiểu rõ.
4 Tận dụng Claude hiệu quả: LlamaIndex có nhiều response mode khác nhau: compact — Gộp tất cả context vào một prompt mặc định, nhanh nhất — mẹo quan trọng là cung cấp đủ ngữ cảnh để AI trả về kết quả chính xác hơn 80% so với prompt chung chung.
5 Thành thật mà nói: Cho UX tốt hơn, bật streaming để hiển thị response từng token: streamingengine = index.asqueryengine similaritytopk=3,. Phương pháp này hiệu quả trong hầu hết trường hợp, nhưng bạn cần điều chỉnh cho phù hợp ngữ cảnh riêng.

LlamaIndex là framework mã nguồn mở chuyên về data ingestion và RAG (Retrieval-Augmented Generation). Kết hợp với Claude, bạn có thể xây dựng pipeline RAG hoàn chỉnh chỉ với vài chục dòng code — từ load documents đến semantic search và generation.

Bài viết này dành cho người mới bắt đầu với RAG. Bạn sẽ học được cách LlamaIndex hoạt động, cách kết nối với Claude, và cách tạo query engine đầu tiên của mình.

LlamaIndex là gì?

LlamaIndex giải quyết ba vấn đề cốt lõi của RAG:

Data Ingestion — Load và parse hàng trăm loại file (PDF, Word, CSV, web pages...)
Indexing — Chuyển documents thành vector index có thể search
Querying — Tìm kiếm semantic và generate câu trả lời với LLM

Thay vì tự viết từng bước, LlamaIndex cung cấp abstractions đơn giản để bạn tập trung vào business logic.

Cài đặt

pip install llama-index llama-index-llms-anthropic llama-index-embeddings-voyageai

Thiết lập API keys:

import os
os.environ["ANTHROPIC_API_KEY"] = "your_api_key"
os.environ["VOYAGE_API_KEY"] = "your_voyage_key"

Cấu hình LlamaIndex với Claude

Bước đầu tiên là cấu hình LlamaIndex sử dụng Claude làm LLM và Voyage AI làm embedding model:

from llama_index.core import Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.voyageai import VoyageEmbedding

# Cấu hình LLM
Settings.llm = Anthropic(
    model="claude-opus-4-5",
    max_tokens=1024
)

# Cấu hình Embedding model
Settings.embed_model = VoyageEmbedding(
    model_name="voyage-3",
    voyage_api_key=os.environ.get("VOYAGE_API_KEY")
)

# Cấu hình chunk size
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("LlamaIndex configured with Claude + Voyage AI")

Load Documents

LlamaIndex có sẵn nhiều loại loader. Ví dụ đơn giản nhất là load từ thư mục:

from llama_index.core import SimpleDirectoryReader

# Load tất cả files trong thư mục
loader = SimpleDirectoryReader("./data")
documents = loader.load_data()

print(f"Loaded {len(documents)} documents")
for doc in documents[:3]:
    print(f"  - {doc.metadata.get('file_name', 'unknown')}: {len(doc.text)} chars")

Hoặc tạo documents thủ công từ text:

from llama_index.core import Document

documents = [
    Document(
        text="""Claude là AI assistant của Anthropic, được thiết kế để an toàn và hữu ích.
        Claude có thể viết code, phân tích dữ liệu, tóm tắt tài liệu, và nhiều tác vụ khác.
        API của Claude cho phép tích hợp vào ứng dụng của bạn.""",
        metadata={"source": "claude_intro", "language": "vi"}
    ),
    Document(
        text="""Anthropic là công ty AI tập trung vào AI Safety. Được thành lập năm 2021,
        Anthropic phát triển Claude với mục tiêu tạo ra AI trustworthy và aligned with human values.
        Văn phòng tại San Francisco, với team hơn 500 người.""",
        metadata={"source": "anthropic_about", "language": "vi"}
    )
]

Tạo Vector Index

Sau khi có documents, tạo VectorStoreIndex — LlamaIndex sẽ tự động chunk, embed, và lưu:

from llama_index.core import VectorStoreIndex

# Tạo index từ documents
# LlamaIndex tự động: chunk -> embed -> store
index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True
)

print("Index created successfully!")

Lưu và load index

Để không phải re-embed mỗi lần chạy, lưu index xuống disk:

from llama_index.core import StorageContext, load_index_from_storage

# Lưu index
index.storage_context.persist(persist_dir="./index_storage")
print("Index saved to disk")

# Load lại index (lần sau không cần re-embed)
storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
index = load_index_from_storage(storage_context)
print("Index loaded from disk")

Query Engine cơ bản

Query engine là interface để hỏi đáp với documents. LlamaIndex tạo query engine từ index:

from llama_index.core import QueryBundle

# Tạo query engine
query_engine = index.as_query_engine(
    similarity_top_k=3,   # Retrieve top 3 chunks
    response_mode="compact"  # Compact mode gộp context
)

# Hỏi câu hỏi
response = query_engine.query("Claude được tạo ra bởi công ty nào?")

print(f"Câu trả lời: {response.response}")
print(f"
Nguồn tham khảo:")
for node in response.source_nodes:
    print(f"  Score: {node.score:.3f} | {node.metadata.get('source', 'unknown')}")

Response Modes

LlamaIndex có nhiều response mode khác nhau:

compact — Gộp tất cả context vào một prompt (mặc định, nhanh nhất)
tree_summarize — Tóm tắt từng chunk rồi tổng hợp (tốt cho tài liệu dài)
refine — Tinh chỉnh câu trả lời qua từng chunk (chậm hơn nhưng chính xác)
simple_summarize — Gộp toàn bộ context, tóm tắt đơn giản

# So sánh response modes
for mode in ["compact", "tree_summarize", "refine"]:
    engine = index.as_query_engine(
        similarity_top_k=3,
        response_mode=mode
    )
    response = engine.query("Anthropic là gì?")
    print(f"
[{mode}] {response.response[:200]}...")

Chat Engine — Chatbot với Memory

Ngoài query engine, LlamaIndex còn có chat engine để tạo chatbot với conversation history:

from llama_index.core.chat_engine import CondensePlusContextChatEngine

# Tạo chat engine
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    verbose=True
)

# Chat với memory
response1 = chat_engine.chat("Anthropic là gì?")
print(f"Bot: {response1.response}")

response2 = chat_engine.chat("Họ có bao nhiêu nhân viên?")
print(f"Bot: {response2.response}")  # Hiểu "họ" là Anthropic

# Reset conversation
chat_engine.reset()

Streaming Response

Cho UX tốt hơn, bật streaming để hiển thị response từng token:

streaming_engine = index.as_query_engine(
    similarity_top_k=3,
    streaming=True
)

# Stream response
streaming_response = streaming_engine.query("Claude có thể làm gì?")
for text in streaming_response.response_gen:
    print(text, end="", flush=True)
print()  # Newline cuối

Custom Prompts

Tùy chỉnh system prompt để phù hợp với use case của bạn:

from llama_index.core import PromptTemplate

# Custom QA prompt bằng tiếng Việt
qa_template = PromptTemplate(
    """Bạn là trợ lý AI hữu ích. Dựa trên thông tin bên dưới,
hãy trả lời câu hỏi một cách chính xác và ngắn gọn bằng tiếng Việt.
Nếu không tìm thấy câu trả lời trong context, hãy nói "Tôi không có thông tin về vấn đề này."

Context:
{context_str}

Câu hỏi: {query_str}
Trả lời:"""
)

query_engine = index.as_query_engine(
    similarity_top_k=3,
    text_qa_template=qa_template
)

Kết luận

Với chưa đến 50 dòng code, bạn đã có một RAG pipeline hoàn chỉnh với LlamaIndex + Claude. Framework này xử lý tất cả complexity của RAG (chunking, embedding, retrieval) để bạn tập trung vào điều quan trọng nhất: chất lượng câu trả lời.

Bước tiếp theo: Học về ReAct Agent với LlamaIndex để xây dựng AI agent có khả năng lý luận phức tạp, hoặc khám phá Multi-Document Agent để truy vấn nhiều nguồn dữ liệu cùng lúc.

Gợi ý cho bạn

Multi-Modal RAG với LlamaIndex + Claude Vision

LlamaIndex + Claude — RAG pipeline cơ bản

Điểm nổi bật

LlamaIndex là gì?

Cài đặt

Cấu hình LlamaIndex với Claude

Load Documents

Tạo Vector Index

Lưu và load index

Query Engine cơ bản

Response Modes

Chat Engine — Chatbot với Memory

Streaming Response

Custom Prompts

Kết luận

Bài viết liên quan

Gợi ý cho bạn

Multi-Modal RAG với LlamaIndex + Claude Vision

RAG với MongoDB + Claude — Xây chatbot có kiến thức

Multi-Document Agent — Truy vấn nhiều tài liệu với LlamaIndex

RAG với Pinecone + Claude — Vector database cho AI

Tin liên quan nên xem

SubQuestion Engine — Phân tách câu hỏi phức tạp tự động

Router Query Engine — Tự động chọn index phù hợp

ReAct Agent với LlamaIndex + Claude — Lý luận + Hành đ��ng

RAG Agent với LangChain + Pinecone + Claude

LlamaIndex + Claude — RAG pipeline cơ bản

Điểm nổi bật

LlamaIndex là gì?

Cài đặt

Cấu hình LlamaIndex với Claude

Load Documents

Tạo Vector Index

Lưu và load index

Query Engine cơ bản

Response Modes

Chat Engine — Chatbot với Memory

Streaming Response

Custom Prompts

Kết luận

Bài viết liên quan

Gợi ý cho bạn

Multi-Modal RAG với LlamaIndex + Claude Vision

RAG với MongoDB + Claude — Xây chatbot có kiến thức

Multi-Document Agent — Truy vấn nhiều tài liệu với LlamaIndex

RAG với Pinecone + Claude — Vector database cho AI

Tin liên quan nên xem

SubQuestion Engine — Phân tách câu hỏi phức tạp tự động

Router Query Engine — Tự động chọn index phù hợp

ReAct Agent với LlamaIndex + Claude — Lý luận + Hành đ��ng

RAG Agent với LangChain + Pinecone + Claude

Đăng ký nhận bản tin