Using Langchain and Open Source Vector DB Chroma for Semantic Search with OpenAI's LLM

1. Introduction

In the world of AI-native applications, Chroma DB and Langchain have made significant strides. Chroma DB is an open-source embedding (vector) database, designed to provide efficient, scalable, and flexible ways to store and search embeddings. Langchain, on the other hand, is a comprehensive framework for developing applications powered by language models.

You might recall our previous blog where we covered Langchain's capabilities with Pinecone Vector Database. Today, we are here to showcase Chroma DB. Why? Because Chroma DB is open-source, default vector DB is used by Langchain, and has gained significant popularity in recent times.

In this blog, we will delve into how to use Chroma DB for semantic search using Langchain's utilities. Specifically, we will discuss indexing documents, retrieving semantically similar documents, implementing persistence, integrating Large Language Models (LLMs), and employing question-answering and retriever chains.

2. Setting up the Environment

To start off, let's set up our environment. For this exercise, we will need the following libraries:

!pip install  openai langchain sentence_transformers chromadb unstructured -q

3. Loading and Splitting the Documents

Now that we've set up our environment, let's start by loading and splitting documents using Langchain utilities.

from langchain.document_loaders import DirectoryLoader

directory = '/content/pets'

def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

Once we load the documents, we split them using the RecursiveCharacterTextSplitter from Langchain.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents,chunk_size=1000,chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)
print(len(docs))

This text splitter, ideally recommended for general text, operates based on a specific list of characters. It attempts to divide the text on these characters in sequential order until the resulting chunks are sufficiently small. By default, it splits text using this list of characters: ["\n\n", "\n", " ", ""].

The goal is to maintain paragraphs, and subsequently, sentences and words, together for as long as feasibly possible, given that they typically form the most potent semantic units within a text.

4. Embedding Text Using Langchain

After splitting the documents, the next step is to embed the text using Langchain. Let's go ahead and use the SentenceTransformerEmbeddings from Langchain.

from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

5. Creating Vector Store with Chroma DB

Vector stores serve as a prevalent method for handling and searching through unstructured data. The standard process involves creating embeddings from the unstructured data, saving these generated vectors, and then, during a query, embedding the unstructured query to retrieve the 'most similar' vectors to this embedded query. The role of a vector store is primarily to facilitate this storage of embedded data and execute the similarity search.

Importantly, Langchain offers support for various vector stores, including Chroma, Pinecone, and others. This flexibility enables users to choose the most suitable vector store based on their specific requirements and preferences.

Let's create a vector store using the Chroma DB from the documents we loaded and split.

from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings)

6. Retrieving Semantically Similar Documents

Now that we've created the vector store, we can use it to execute a query and retrieve semantically similar documents.

query = "What are the different kinds of pets people commonly own?"
matching_docs = db.similarity_search(query)

matching_docs[0]

(Document(page_content='Pet animals come in all shapes and sizes, each suited to different lifestyles and home environments. Dogs and cats are the most common, known for their companionship and unique personalities.....', metadata={'source': '/content/pets/Different Types of Pet Animals.txt'}), 0.7325009703636169)

7. Persistence in Chroma DB

Persistence is an important aspect of any database. In this step, we will create a persistent Chroma DB instance.

If you want to save to disk, simply initialize the Chroma client and pass the directory where you want the data to be saved.

persist_directory = "chroma_db"

vectordb = Chroma.from_documents(
    documents=docs, embedding=embeddings, persist_directory=persist_directory
)

vectordb.persist()

8. Using OpenAI Large Language Models (LLM) with Chroma DB

Next, we will see how to integrate OpenAI's Large Language Models (LLM) with Chroma DB.

import os
os.environ["OPENAI_API_KEY"] = "key"

from langchain.chat_models import ChatOpenAI
model_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=model_name)

9. Extracting Answers from Documents

LangChain introduces a useful abstraction called a 'Chain' for representing sequences of calls to components. These components can include other chains, making it possible to build complex, nested sequences of operations. One specific type of chain is the question-answering (QA) chain.

The QA chain is specifically designed for answering questions based on a provided set of documents. It does this by performing a similarity search for the input question against the embedded documents and then using a model to generate an answer based on the most relevant documents.

By using the question-answering chain provided by Langchain, we can extract answers from documents.

from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(llm, chain_type="stuff",verbose=True)

query = "What are the emotional benefits of owning a pet?"
matching_docs = db.similarity_search(query)
answer =  chain.run(input_documents=matching_docs, question=query)
answer

Output: Owning a pet can provide emotional support, reduce stress and anxiety, and can even help their owners lead healthier lives......

10. Utilizing RetrieverQA Chain

Finally, we utilize the RetrieverQA chain in Langchain to implement a retriever query.

from langchain.chains import RetrievalQA
retrieval_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=db.as_retriever())
retrieval_chain.run(query)

Output: Owning a pet can provide emotional support and reduce stress. Pets can also offer comfort and consistency .....

11. Further Reading

To help you deepen your understanding, we recommend the following articles:

A Detailed Exploration of Chroma DB: This blog post will provide you with in-depth knowledge about Chroma DB and its Python library.
Pinecone Vector Database and Langchain: This blog post discusses using Pinecone vector database in tandem with Langchain, similar to what we did in this blog post with Chroma DB.

12. Video Walkthrough

For those who prefer a more interactive form of learning, we have prepared a video walkthrough of this entire process. It complements this blog post and provides a step-by-step guide to help you visualize the process.

https://youtu.be/5NG8mefEsCU

13. Conclusion

In this blog post, we showcased how to use Chroma DB, an open-source embedding database, in tandem with Langchain for semantic search. We demonstrated how to load and split documents, create embeddings, and use those embeddings to store and search documents in Chroma DB.

We also discussed how to integrate Large Language Models (LLM) provided by OpenAI with Chroma DB and extract answers from documents using Langchain's question-answering chain. Additionally, we utilized Langchain's RetrieverQA chain to further enhance the querying process.

By leveraging Langchain and Chroma DB, developers can create sophisticated applications powered by large language models that can handle complex information retrieval tasks. It opens the door to creating AI-native applications that can leverage the power of vector databases and language models.

We hope this guide has been informative and useful in your journey to develop AI-native applications. Remember, the power of AI is in your hands. Keep exploring and keep innovating!

Full Code: https://github.com/PradipNichite/Youtube-Tutorials/blob/main/Chroma_DB_with_Langchain.ipynb

FutureSmart AI Blog