Building a Document-based Question Answering System with LangChain, Pinecone, and LLMs like GPT-4 and ChatGPT

Building a Document-based Question Answering System with LangChain, Pinecone, and LLMs like GPT-4 and ChatGPT

Featured on Hashnode

1. Introduction

In this blog post, we will delve into the creation of a document-based question-answering system using LangChain and Pinecone, taking advantage of the latest advancements in large language models (LLMs), such as OpenAI GPT-4 and ChatGPT.

LangChain is a powerful framework designed for developing applications driven by language models, while Pinecone serves as an efficient vector database for building high-performance vector search applications. Our use case focuses on answering questions over specific documents, relying solely on the information within those documents to generate accurate and context-aware answers.

By combining the prowess of semantic search with the impressive capabilities of LLMs like GPT, we will demonstrate how to build a state-of-the-art Document QnA system that leverages cutting-edge AI technologies.

2. Why is Semantic Search + GPT QnA better than fine-tuning GPT?

Before diving into the implementation, let's understand the advantages of using semantic search with GPT QnA over fine-tuning GPT:

Broader knowledge coverage:

Semantic Search + GPT QnA uses a two-step process that first finds relevant passages from a large corpus of documents and then generates answers based on those passages. This approach can provide more accurate and up-to-date information, leveraging the latest information from various sources. Fine-tuning GPT, on the other hand, relies on the knowledge encoded in the model during training, which may become outdated or incomplete over time.

Context-specific answers:

Semantic Search + GPT QnA can generate more context-specific and precise answers by grounding answers in specific passages from relevant documents. However, fine-tuned GPT models might generate answers based on the general knowledge embedded in the model, which might be less precise or unrelated to the question's context.

Adaptability:

The Semantic Search component can be easily updated with new information sources or tuned to different domains, making it more adaptable to specific use cases or industries. In contrast, fine-tuning GPT requires re-training the model, which can be time-consuming and computationally expensive.

Better handling of ambiguous queries:

Semantic Search can help disambiguate queries by identifying the most relevant passages related to the question. This can lead to more accurate and relevant answers compared to a fine-tuned GPT model, which may struggle with ambiguity without proper context.

3. LangChain Modules

LangChain provides support for several main modules:

  • Models: The various model types and model integrations LangChain supports.

  • Indexes: Language models are often more powerful when combined with your own text data - this module covers best practices for doing exactly that.

  • Chains: Chains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

4. Setting up the environment

To start, we need to install the required packages and import the necessary libraries.

Installing required packages:

!pip install --upgrade langchain openai -q
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q
!apt-get install poppler-utils

Importing necessary libraries:

import os
import openai
import pinecone
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

5. Loading documents

First, we need to load the documents from a directory using the DirectoryLoader from LangChain. In this example, we assume the documents are stored in a directory called 'data'.

directory = '/content/data'

def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

6. Splitting documents

Now, we need to split the documents into smaller chunks for processing. We will use the RecursiveCharacterTextSplitter from LangChain, which by default tries to split on the characters ["\n\n", "\n", " ", ""].

def split_docs(documents, chunk_size=1000, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)
print(len(docs))

7. Embedding documents with OpenAI

Once the documents are split, we need to embed them using OpenAI's language model. First, we need to install the tiktoken library.

!pip install tiktoken -q

Now, we can use the OpenAIEmbeddings class from LangChain to embed the documents.

embeddings = OpenAIEmbeddings(model_name="ada")

query_result = embeddings.embed_query("Hello world")
len(query_result)

8. Vector search with Pinecone

Next, we will use Pinecone to create an index for our documents. First, we need to install the pinecone-client.

!pip install pinecone-client -q

Then, we can initialize Pinecone and create a Pinecone index.

pinecone.init(
    api_key="pinecone api key",
    environment="env"
)

index_name = "langchain-demo"

index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

We are creating a new Pinecone vector index using the Pinecone.from_documents() method. This method takes three arguments:

  1. docs: A list of documents that have been split into smaller chunks using the RecursiveCharacterTextSplitter. These smaller chunks will be indexed in Pinecone to make it easier to search and retrieve relevant documents later on.

  2. embeddings: An instance of the OpenAIEmbeddings class, which is responsible for converting text data into embeddings (i.e., numerical representations) using OpenAI's language model. These embeddings will be stored in the Pinecone index and used for similarity search.

  3. index_name: A string representing the name of the Pinecone index. This name is used to identify the index in Pinecone's database, and it should be unique to avoid conflicts with other indexes.

The Pinecone.from_documents() the method processes the input documents, generates embeddings using the provided OpenAIEmbeddings instance, and creates a new Pinecone index with the specified name. The resulting index object can perform similarity searches and retrieve relevant documents based on user queries.

9. Finding similar documents

Now, we can define a function to find similar documents based on a given query.

def get_similiar_docs(query, k=2, score=False):
  if score:
    similar_docs = index.similarity_search_with_score(query, k=k)
  else:
    similar_docs = index.similarity_search(query, k=k)
  return similar_docs

10. Question answering using LangChain and OpenAI LLM

With the necessary components in place, we can now create a question-answering system using the OpenAI class from LangChain and a pre-built question-answering chain.

# model_name = "text-davinci-003"
# model_name = "gpt-3.5-turbo"
model_name = "gpt-4"
llm = OpenAI(model_name=model_name)

chain = load_qa_chain(llm, chain_type="stuff")

def get_answer(query):
  similar_docs = get_similiar_docs(query)
  answer = chain.run(input_documents=similar_docs, question=query)
  return answer

11. Example queries and answers

Finally, let's test our question answering system with some example queries.

query = "How is India's economy?"
answer = get_answer(query)
print(answer)

query = "How have relations between India and the US improved?"
answer = get_answer(query)
print(answer)

Conclusion

In this blog post, we demonstrated how to build a document-based question-answering system using LangChain and Pinecone. By leveraging semantic search and large language models, this approach provides a powerful and flexible solution for extracting information from a large corpus of documents. You can further customize this system to suit your specific needs or domain.

Colab notebook: https://github.com/PradipNichite/Youtube-Tutorials/blob/main/Langchain_Semnatic_Serach_Pinecone.ipynb

Call to Action

Now that you have seen how to build a document-based question answering system using LangChain and Pinecone, we encourage you to explore further and try it out for yourself.

  • Watch the YouTube video: If you prefer a visual guide, we have created a video demonstrating the process. This video can help solidify your understanding and provide an alternative learning experience.

  • Read more in-depth articles: For more detailed information and insights on AI-related topics, don't forget to visit the FutureSmart AI Blog. Our blog features in-depth articles that cover various aspects of AI, machine learning, and natural language processing.

  • Check out AI Demos: If you're interested in exploring more AI tools and their applications, head over to AIDemos.com. AIDemos is a directory of video demos showcasing the latest AI tools and technologies. Our goal is to educate and inform users about the possibilities of AI and help them stay updated on the latest advancements.

Happy learning, and enjoy exploring the world of AI!