Introduction to LlamaIndex

Unleash the power of LLMs over your data! 🚀 - @akshay_pachaar ✍️

LlamaIndex is the go to framework when it comes to building a RAG (Retrieval-Augmented Generation) based LLM application.

LlamaIndex combines the generative powers of an LLM with the knowledge of various data sources(YouTube, pdfs, wikipedia etc). The knowledge gathered from these data sources acts as a source of truth which makes sure LLMs don’t hallucinate (generate inaccurate results).

Today, we understand what LlamaIndex is, it’s various components & how to use it.

Vector Stores:

Vector stores are a special kind of database designed to store high-dimensional data and provide essential tools for retrieval and similarity search.

Essentially, a vector store stores the original documents as vector embeddings that capture the semantic meaning of the information within each document.

When searching for something in this database, the query is also converted into a vector embedding. Once everything is represented as vectors, it becomes easier to find relevant and similar vectors to the query vector, leading to the discovery of relevant information.

Semantic search can be applied to all data formats, as we vectorize the data before storing it in a database.

Here’s how a typical VectorDB can be represented:

Vector Database

Data Connectors

The data that we store in out VectorDB comes from various source e.g. APIs, YouTube, PDFs, SQL Databases, Wikipedia… etc.

Now to connect these various datasources to out RAG system LlamaIndex offers various data connectors, also called Readers. They are used to parse the data coming from different formats into standard Document objects, that contains text and metadata associated with it.

LlamaHub is an open-source project that hosts data connectors. The LlamaHub repository provides data connectors for ingesting data in various formats into the LLM.

Here’s an example of using the WikipediaReader:

from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()

documents = loader.load_data(pages=['Delhi', 'Mumbai', 'Tokyo', 'Rome'])

print(len(documents))
4

Nodes:

Once the data is loaded as documents, LlamaIndex transforms these documents in Node objects. Nodes are created by breaking the original documents into smaller chunks (size specified by user).

Apart from the it’s original content each node also contain metadata & contextual information corresponding to where and how each chunk was present in the original document.

LlamaIndex has a NodeParser class that does this.

Here’s an example of a NodeParser called SimpleNodeParser that converts a list of provided documents into Node objects:

from llama_index.node_parser import SimpleNodeParser


# Initialize the parser
parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=20)

# Parse documents into nodes
nodes = parser.get_nodes_from_documents(documents)
print(len(nodes))
180

Indices:

The ability to index and search through various data formats like docs, pdfs, database queries, videos…etc. is what makes LlamaIndex so powerful.

The unstructured data is first converted into embeddings that capture the semantic meaning, once data is in form of embeddings it becomes easier to query and search through it.

Depending on the usecase LlamaIndex offers various indices:

  • Summary Index: A summary from each document is extracted and stored in all the nodes corresponding to that document. It provides a larger context within each node.

  • Tree Index: The tree index is a tree-structured index, where each node is a summary of the children nodes. During index construction, the tree is constructed in a bottoms-up fashion until we end up with a set of root_nodes.

  • VectorStoreIndex: It’s suitable for small-scale applications and easily scalable to accommodate larger datasets using high-performance vector databases.

Today we’ll be using DeepLakeVectorStore, you can create and account for free here and then get an access token, just like an openai_api_key & store the same in an environment variable.

Once you have your account created, here’s how you can connect to the VectorStore:

Using the DeepLakeVectorStore class you can provide the dataset path as an argument. You just need to replace it with your_org_id (which is basically your username) & your_dataset_name (name it whatever pleases you).

from llama_index.vector_stores import DeepLakeVectorStore

my_activeloop_org_id = "your_org_id"
my_activeloop_dataset_name = "LlamaIndex-101"

dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

# Create an index over the documnts
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)
Your Deep Lake dataset has been successfully created!

Activeloop also provides a nice view into your created Vector Dataset, check this:

Next, we need to establish a storage context by utilizing the StorageContext class (it basically let’s you manage your vector database using llamaindex) and selecting the Deep Lake dataset as the data source. This storage context should then be passed to a VectorStoreIndex class to construct the index (generate embeddings) and save the outcomes within the specified dataset.

from llama_index.storage.storage_context import StorageContext
from llama_index import VectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Outputs:

Query Engines:

Now that we have all the data indexed, use these indices query for specific information that we are looking in our data.

There is where a Query Engine comes in, it's a wrapper that combines a Retriever and a Response Synthesizer into a pipeline.

The pipeline uses the query string to fetch nodes and then sends them to the LLM to generate a response.

You can create a query engine by calling the as_query_engine() method on an index you've created.

Here’s how you can query the knowledge that you’ve stored in your vector DB!

query_engine = index.as_query_engine()
response = query_engine.query("What is historical significance of Delhi?")
print( response.response )
Delhi has been historically significant as it has served as the capital of various empires and kingdoms throughout history. It has been a prominent political, cultural, and commercial center in India for centuries. Delhi's historical significance is rooted in its role as a seat of power, witnessing the rise and fall of different dynasties and playing a crucial part in shaping the country's history.

You can find all the code in my GitHub repository!

In the upcoming blogs, we will continue this journey by exploring various components involved in building a production-grade RAG application. Additionally, we will work on projects to enhance the learning experience.

Stay tuned!

Subscribe to keep reading

This content is free, but you must be subscribed to ML Spring to continue reading.

Already a subscriber?Sign In.Not now