Langchain's Pandas & CSV Agents: Revolutionizing Data Querying using OpenAI LLMs
In this blog, we will explore Langchain's Pandas Agent and CSV Agent, explaining how they work and their key features. We will also compare the agents to traditional query languages, provide real-world examples of how they have been used to improve the querying of tabular data and discuss potential future developments and applications. This blog is aimed at data scientists, developers, or business analysts who want to learn more about using Langchain's agents to query tabular data efficiently.
Introduction:
Langchain is a framework designed to develop applications powered by language models, with a focus on being data-aware and agentic. It is particularly useful for handling large datasets and providing efficient querying capabilities.
Features:
Large Language Models (LLMs): LangChain provides different types of models for natural language processing, including LLMs, which can be used to process unstructured text data and retrieve information based on user queries.
PromptTemplates: This feature enables developers to construct input prompts for their models using multiple components. In the case of queries, developers can use PromptTemplates to construct prompts for user queries, which can then be passed to LLMs for processing.
Indexes and Memory Components: LangChain's indexing feature structures documents so that LLMs can efficiently retrieve information based on user queries, while memory components enable applications such as chatbots to remember previous interactions and provide personalized responses to users, including for similar queries.
Chains: Chains in LangChain are a sequence of models that are linked together to accomplish a specific goal. An instance of a chain for a chatbot application could involve utilizing a language model to comprehend user inputs, a memory component to store past interactions, and a decision-making component to create relevant responses.
Agents: Agents in LangChain interact with user inputs and process them using different models. Agents determine which actions to take and in what order. For example, the CSV Agent can be used to load data from CSV files and perform queries, while the Pandas Agent can be used to load data from Pandas data frames and process user queries. Agents can be chained together to build more complex applications.
Pandas and CSV agents:
LangChain's Pandas Agent is a tool used to process large datasets by loading data from Pandas data frames and performing advanced querying operations. Its key features include the ability to group and aggregate data, filter data based on complex conditions, and join multiple data frames. This agent is ideal for developers who work with large datasets and require advanced querying capabilities.
The CSV Agent in LangChain is another tool used for querying structured data. It loads data from CSV files and supports basic querying operations like selecting and filtering columns, sorting data, and querying based on a single condition. It is an efficient option for developers who require a simple, yet powerful querying tool for structured data.
Both of these allow developers to create more complex and efficient applications that utilize different data sources and processing steps.
Pandas Agent:
!pip install langchain
import os
os.environ["OPENAI_API_KEY"] = ""
The first line is a command to install a langchain
using pip
, which is a package installer for Python.
The second and third lines of code are Python statements that set an environment variable OPENAI_API_KEY
to a string value that represents an OpenAI API key. The OpenAI API key is used for authenticating API requests to the OpenAI language models.
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
import pandas as pd
This code imports modules from the langchain package and pandas package. It uses the create_pandas_dataframe_agent
function from langchain to create a data agent that can be used to convert data between different formats. It also imports the OpenAI language model from langchain.llms
. The pandas package is used to create a pandas DataFrame.
df = pd.read_csv('sales_data.csv')
df[0:10]
Now, we read a CSV file named sales_data.csv
and store the data in a pandas dataframe. Then, we'll print the first 10 rows of the dataframe.
I have only utilized a small amount of the actual sales_data
available here.
pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
The above code creates a LangChain agent for processing natural language using OpenAI's language model with a temperature of 0, and then creates a Pandas DataFrame agent from the provided CSV file sales_data.csv
.
pd_agent.run("Name of the city of first two sales?")
> Entering new AgentExecutor chain...
Thought: I need to find the city of the first two sales
Action: python_repl_ast
Action Input: df.head(2)['CITY']
Observation: 0 NYC
1 Reims
Name: CITY, dtype: object
Thought: I now know the final answer
Final Answer: NYC and Reims
> Finished chain.
NYC and Reims
The pd_agent
is a LangChain agent created using the create_pandas_dataframe_agent
function, which takes a language model as input and a Pandas dataframe containing the data. When the run
method is called on the pd_agent
with the input text "Name of the city of first two sales?", it goes through a sequence of steps to generate an answer.
Initially, the agent identifies the task, then selects the appropriate action to retrieve the required information from the dataframe. After that, it observes the output and combines the observations, and generates the final answer.
pd_agent.run("Find the total sales for each product line in the year 2003")
> Entering new AgentExecutor chain...
Thought: I need to group the data by product line and year
Action: python_repl_ast
Action Input: df.groupby(['PRODUCTLINE', 'YEAR_ID'])['SALES'].sum()
Observation: PRODUCTLINE YEAR_ID
Motorcycles 2003 32131.60
2004 51084.53
2005 3940.23
Name: SALES, dtype: float64
Thought: I now know the total sales for each product line in the year 2003
Final Answer: 32131.60
> Finished chain.
32131.60
Similarly, here agent identifies the task as finding the total sales for each product line in the year 2003. It then selects the appropriate action, which is to execute a Python REPL command to group the data by product line and year, and sum the sales for each group. It observes the output of the action, which is the total sales for each product line in the year 2003. Finally, the agent combines the observations to generate the final answer 32131.60
and returns it.
CSV Agent:
The CSV agent functions similarly to the Pandas DataFrame agent by utilizing the Python agent to execute code. However, the CSV agent specifically relies on the Pandas DataFrame agent for its interaction with CSV files.
from langchain.agents import create_csv_agent
csv_agent = create_csv_agent(OpenAI(temperature=0), 'sales_data.csv', verbose=True)
The code is importing the create_csv_agent
function from the langchain.agents
module, which is used to create an agent that can interact with a CSV file. The OpenAI object is passed as an argument to the function to provide the agent with access to the OpenAI language model. The CSV file named sales_data.csv
is also passed as an argument to the function.
pd_agent.run("Sales value of two of items?")
> Entering new AgentExecutor chain...
Thought: I need to find the sales value of two items
Action: python_repl_ast
Action Input: df.loc[[0,1], 'SALES'].sum()
Observation: 5636.9
Thought: I now know the final answer
Final Answer: 5636.9
> Finished chain.
5636.9
The agent receives a task to find the sales value of two items. It selects the appropriate action, which is to execute a Python REPL command to retrieve the sales values of the first two rows from the SALES
column of the dataframe and sum them up. It observes the output of the action, which is the sales value of 5636.9
. Finally, the agent combines the observations to generate the final answer of 5636.9
and returns it.
Comparison of pandas and CSV agents with traditional query languages
CSV and Pandas agents provide an alternative approach to querying data that differs from traditional query languages. Rather than writing code or SQL queries to extract and manipulate data, these agents allow users to pose natural language questions and receive answers in a conversational style.
This approach can be more intuitive for users who are not familiar with coding or SQL. Additionally, it can be more efficient for users who need to quickly retrieve information without spending time writing complex queries.
However, there are some limitations to using agents for data querying. The natural language approach can sometimes be imprecise, leading to inaccurate results. Additionally, agents may not be able to handle more complex queries that require multiple operations or logic.
For advanced users who require more control over data extraction and manipulation, traditional query language like SQL may be a better option. It provides a greater level of flexibility and customization, allowing users to write more complex queries and perform more advanced data analysis.
Future developments and applications
The future potential of pandas and CSV agents is extensive and diverse. As AI and NLP technologies continue to evolve, these agents can become even more robust tools for data analysis and manipulation.
One potential area for development is the integration of these agents with other APIs and technologies. For example, by leveraging APIs such as SERPAPI or web scraping tools, these agents could be used to collect data from a variety of sources and provide even more precise and insightful results.
Likewise, these agents can be used in chatbots and other conversational interfaces. By integrating a pandas or CSV agent into a chatbot, businesses could provide real-time insights and recommendations to their customers based on their data. For example, a bank could use a chatbot equipped with a pandas agent to help customers understand their spending patterns and make more informed decisions about their finances.
Also, these agents could be used in the development of automated data processing systems. By using a CSV agent to import data from various sources and a pandas agent to analyze and manipulate the data, businesses could streamline their data processing workflows. This could be particularly useful for businesses with large amounts of data that need to be analyzed and processed quickly and efficiently.
Conclusion
Hence, LangChain's Pandas Agent and CSV Agent are powerful tools that provide data scientists, and business analysts with efficient and advanced querying capabilities. The ability to chain different agents together to build more complex applications opens up a wide range of possibilities for data-driven applications. With further advancements in the LangChain framework, we can anticipate even more innovative and exciting developments in this field.
You can check out this video for more information on LangChain, SQL Agents & OpenAI LLMs: Query Database Using Natural Language.
Also, AIDemos.com is an incredible resource for anyone looking to explore the potential of AI.