Summarizing Documents Made Easy With Langchain Summarizer

Summarizing Documents Made Easy With Langchain Summarizer

Introduction

In today's post, we will delve into the fascinating world of natural language processing and learn how we can harness LangChain's capabilities to create a document summarizer that is not just accurate but efficient too.

We will be exploring three different summarization techniques, each implemented using LangChain's unique chain types: stuff, map_reduce, and refine. This post will guide you through the process of using LangChain to summarize a list of documents, breaking down the steps involved in each technique.

Whether you are a seasoned developer or just starting with natural language processing, this post is the perfect starting point for anyone interested in exploring the world of document summarization with LangChain. So, let's get started and see how LangChain can help us create an effective document summarizer!

Langchain

LangChain, a cutting-edge framework, provides a seamless interface for creating advanced language model-based applications. It features data awareness and agentic behavior, enabling the model to interact with its environment and make dynamic decisions based on user input.

With LangChain's data-awareness feature, the language model can connect to external data sources, enabling it to analyze data from various sources. There are several main modules that LangChain provides support for like Models, Prompts, Indexes, etc and these modules can be used in a variety of ways for different use cases like Summarization, Utilization, Evaluation, etc.

Langchain Chain

Language models have been revolutionizing natural language processing, enabling computers to understand and even generate human-like language. While using a single language model (LLM) may suffice for simpler applications, the power of these models truly shines when they are combined in a chain. But how can we chain these models together seamlessly and effectively? That's where LangChain comes in.

LangChain provides a standard interface for chaining LLMs, allowing users to easily combine multiple models to achieve more complex tasks. Whether it's using multiple LLMs in sequence or integrating them with other expert systems, LangChain streamlines the process of building and utilizing these chains.

With chains, we can bring together various building blocks to form a powerful chain of actions. Just picture this - a chain that can take user input, format it using a PromptTemplate, and then seamlessly pass the formatted response to a language model. But it doesn't stop there. We can even create even more intricate chains by combining multiple chains or by integrating chains with other components. The possibilities are endless, and with chains, we have the key to unlocking the full potential of our applications.

Benefits of LangChain as a Summarizer Tool.

When it comes to summarizing large or multiple documents using natural language processing (NLP), the sheer volume of data can be overwhelming, which may lead to slower processing times and even memory issues. This means that we may need to invest in a high-performance computing infrastructure to handle large volumes of data. But with Langchain we can break large documents into smaller chunks and either process them simultaneously or serially based on the type of chain we use without dealing with the maximum token issue.

Another challenge can be combining information from multiple documents to create a summary because the documents may use different terminology, have conflicting information, or cover different aspects of the topic. This can be solved using langchain because it stores some information about its previous document in the current document and thus create a chain of documents that can be used to justify the importance of the context during summarization and also maintain proper order of sentences in the summarized content.

Document Summarizer App

Initialize OpenAi Key

import os

os.environ["OPENAI_API_KEY"] = "Your openai key"

Summarization Chain Setup

from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate

llm = OpenAI(temperature=0)

Upload Document And Split It Into Chunks

text_splitter = CharacterTextSplitter()

with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()
docs = text_splitter.split_text(state_of_the_union)[:4]

We upload the document and then split the document into smaller chunks using the CharacterTextSplitter() method and then store the output which is a list inside the texts variable.

It is important to chunk the document because processing large documents as a single unit can be computationally expensive and time-consuming.

Output

Summarization With 'map_reduce' Chain

When it comes to document processing, breaking a large document into smaller, more manageable chunks is essential. But how do you combine those chunks into a comprehensive summary or answer? That's where LangChain's MapReduceDocumentsChain comes in.

This powerful tool uses an initial prompt on each chunk of data to generate a summary or answer based solely on that section of the document.

But that's not all - the MapReduceDocumentsChain takes things a step further by running a different prompt to combine all the initial outputs, creating a comprehensive and coherent summary or answer for the entire document. And with its implementation in LangChain, this method can handle even the largest and most complex documents with ease.

from langchain.chains.summarize import load_summarize_chain
import textwrap

chain = load_summarize_chain(llm, 
                             chain_type="map_reduce",
                             verbose = True)

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

We are creating an object of load_summarizer_chain where we are passing three arguments.
model - We pass the large language model of our choice which will query the user input
chain_type - We pass the type of langchain chain to use for summarization of docs
verbose - It is a boolean argument and if set to True it will show us all the intermediate steps between processing the user request and generating the output

Summarized Output

TCS has a purpose-driven approach to business and values that have enabled it to cope with
industry-wide supply side challenges. It has invested in its people, providing hospitalization
support and a massive pan-India vaccination drive, resulting in a strong employer brand. The
workforce has grown to over half a million and is highly diverse, with employees logging 60.3
million learning hours and acquiring 3.5 million digital competencies. TCS has also made progress in
improving gender diversity in the senior management ranks and has been working with communities to
provide health, STEM education, skills development, and digital divides. It has also made a
financial contribution to international humanitarian organizations and reduced its absolute carbon
footprint by 66% over base year 2016.

Pros

This can scale to larger documents (and more documents) than StuffDocumentsChain. The calls to the LLM on individual documents are independent and can therefore be parallelized.

Cons

Requires many more calls to the LLM than StuffDocumentsChain. Loses some information during the final combining call.

Summarization With 'stuff' Chain

Stuffing is the simplest method, whereby you simply stuff all the related data into the prompt as context to pass to the language model. This is implemented in LangChain as the StuffDocumentsChain.

The main downside of this method is that it only works one smaller piece of data. Once you are working with many pieces of data, this approach is no longer feasible. The next approach is designed to help deal with that.

Creating Custom Prompt Template

prompt_template = """Write a concise bullet point summary of the following:
{text}

CONSCISE SUMMARY IN BULLET POINTS:"""

BULLET_POINT_PROMPT = PromptTemplate(template=prompt_template,
                        input_variables=["text"])

Generating Summarized Output

chain = load_summarize_chain(llm, 
                             chain_type="stuff", 
                             prompt=BULLET_POINT_PROMPT)

output_summary = chain.run(docs)

wrapped_text = textwrap.fill(output_summary, 
                             width=100,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)

Summarized Output

- Purpose-driven approach to business and values have shaped TCS' culture and work environment
-
Investing in people and giving them opportunities to realize their potential
- Decentralized
decision-making, empowering leaders on the front lines, and providing support
- Treating the
organization as an extended family and standing by each member in their hour of need
- Strong
employer brand validated by third-party assessments and accolades
- Workforce crossed the half-
million mark in the first half of the year
- Highly diverse workforce with over 153 nationalities
represented
- Women in the workforce exceeding 200,000
- Gender diversity in senior management ranks
improved
- Organic talent development focus area
- Lowest attrition in the industry
- Community and
planet programs reaching 1.7 million beneficiaries
- Net zero carbon footprint by 2030

Pros

Only makes a single call to the LLM. When generating text, the LLM has access to all the data at once.

Cons

Most LLMs have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.

Summarization With 'refine' Chain

This method involves an initial prompt on the first chunk of data, generating some output. For the remaining documents, that output is passed in, along with the next document, asking the LLM to refine the output based on the new document.

chain = load_summarize_chain(llm, chain_type="refine")

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

Summarized Output

TCS has a purpose-driven approach to business and values that have shaped its culture and work
environment. It invests in its people and provides them with opportunities to reach their full
potential. During the pandemic, TCS provided hospitalization support and a massive pan-India
vaccination drive. This has resulted in a strong employer brand and helped the company cope with
industry-wide supply side challenges. The workforce has grown to over half a million people, with
153 nationalities represented and over 200,000 women. TCS has also been involved in community and
planet initiatives, helping over 1.7 million beneficiaries and making a financial contribution of 1
million Euros to international humanitarian organizations. The company has also made progress in
becoming net zero by 2030.

Pros

Can pull in the more relevant context, and may be less lossy than MapReduceDocumentsChain.

Cons

Requires many more calls to the LLM than StuffDocumentsChain. The calls are also NOT independent, meaning they cannot be paralleled like MapReduceDocumentsChain. There are also some potential dependencies on the ordering of the documents.

Conclusion

In conclusion, Langchain makes it quite easy to create state-of-the-art AI applications s by integrating its custom methods and agents. We learned about creating a summarizer by leveraging the power of langchain chains and OpenAi embeddings. We also learned how langchain provides a better approach to summarizing large documents compared to conventional processes.

If you are more interested to learn about Langchains integration with SQL , then be sure to check out the below YouTube tutorial.

Building a Document-based Question Answering System with LangChain, Pinecone, and LLMs like GPT-4.

To learn about more interesting and cool applications of LLMs look into our other Blogs and YouTube channel.

Also, want to learn about the state-of-the-art stuff in AI? Don't forget to subscribe to AI Demos. A place to learn about the latest and cutting-edge tools in AI!