Skip to main content

CSV

LLM对于在各种数据来源上构建问答系统非常有效。在本节中,我们将介绍如何在存储在CSV文件中的数据上构建问答系统。与使用SQL数据库类似,与CSV文件一起工作的关键是让LLM访问工具以对数据进行查询和交互。实现这一点的两种主要方法是:

  • 推荐:将CSV文件加载到SQL数据库中,并使用SQL用例文档中概述的方法。
  • 将LLM允许访问Python环境,以便可以使用诸如Pandas之类的库与数据进行交互。

⚠️ Security note ⚠️

以上提及的两种方法都带有重大风险。使用 SQL 需要执行模型生成的 SQL 查询。使用像 Pandas 这样的库需要让模型执行 Python 代码。由于更容易紧密控制 SQL 连接权限和消毒 SQL 查询,而不是沙盒化 Python 环境,我们强烈建议通过 SQL 与 CSV 数据进行交互。有关一般安全最佳实践,请在这里查看。

Setup

本指南的依赖项:

%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas

设置所需的环境变量:

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Using LangSmith is recommended but not required. Uncomment below lines to use.
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

下载 Titanic 数据集,如果您还没有它:

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv
import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.shape)
print(df.columns.tolist())
(887, 8)
['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

SQL

使用 SQL 与 CSV 数据交互是推荐的方法,因为相比任意 Python,更容易限制权限和清理查询。

大多数 SQL 数据库都可以很容易地将 CSV 文件加载为一个表(如 DuckDB、SQLite 等)。完成这一步骤后, 您可以使用 SQL 使用案例指南中概述的所有链式和代理创建技术。以下是我们如何在 SQLite 中快速执行此操作的示例:

from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

engine = create_engine("sqlite:///titanic.db")
df.to_sql("titanic", engine, index=False)
887
db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
db.run("SELECT * FROM titanic WHERE Age < 2;")
sqlite
['titanic']
"[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]"

创建一个用于与 SQL 代理交互的代理。

from langchain_community.agent_toolkits import create_sql_agent
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)
agent_executor.invoke({"input": "what's the average age of survivors"})


> Entering new AgentExecutor chain...

Invoking: `sql_db_list_tables` with `{}`


titanic
Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`



CREATE TABLE titanic (
"Survived" BIGINT,
"Pclass" BIGINT,
"Name" TEXT,
"Sex" TEXT,
"Age" FLOAT,
"Siblings/Spouses Aboard" BIGINT,
"Parents/Children Aboard" BIGINT,
"Fare" FLOAT
)

/*
3 rows from titanic table:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 3 Mr. Owen Harris Braund male 22.0 1 0 7.25
1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38.0 1 0 71.2833
1 3 Miss. Laina Heikkinen female 26.0 0 0 7.925
*/
Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS AverageAge FROM titanic WHERE Survived = 1'}`
responded: To find the average age of survivors, I will query the "titanic" table and calculate the average of the "Age" column for the rows where "Survived" is equal to 1.

Here is the SQL query:

```sql
SELECT AVG(Age) AS AverageAge
FROM titanic
WHERE Survived = 1
```

Executing this query will give us the average age of the survivors.

[(28.408391812865496,)]The average age of the survivors is approximately 28.41 years.

> Finished chain.
{'input': "what's the average age of survivors",
'output': 'The average age of the survivors is approximately 28.41 years.'}

这种方法很容易推广到多个CSV文件,因为我们可以将它们加载到数据库中作为各自的表。查看SQL指南获取更多信息。

Pandas

除了 SQL,我们还可以使用数据分析库,如 pandas 和 LLMs 的代码生成能力与 CSV 数据交互。同样地,这种方法并不适用于生产环境,除非你已经建立了全面的安全保障措施。因此,我们的代码执行实用程序和构造器位于 langchain-experimental 包中。

Chain

大多数LLM已经接受过足够的熊猫Python代码训练,只需被要求即可生成。

ai_msg = llm.invoke(
"I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else."
)
print(ai_msg.content)
```python
correlation = df['Age'].corr(df['Fare'])
correlation
```

我们可以将这种能力与Python执行工具结合起来,创建一个简单的数据分析链。我们首先要加载我们的CSV表作为一个数据框架,并给这个工具访问该数据框架的权限:

import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tools import PythonAstREPLTool

df = pd.read_csv("titanic.csv")
tool = PythonAstREPLTool(locals={"df": df})
tool.invoke("df['Fare'].mean()")
32.30542018038331

为了帮助强化我们 Python 工具的正确使用,我们将使用以下功能来调用:

llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)
llm_with_tools.invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_6TZsNaCqOcbP7lqWudosQTd6', 'function': {'arguments': '{\n  "query": "df[[\'Age\', \'Fare\']].corr()"\n}', 'name': 'python_repl_ast'}, 'type': 'function'}]})

我们将添加一个OpenAI工具输出解析器,以将函数调用提取为字典:

from langchain.output_parsers.openai_tools import JsonOutputKeyToolsParser

parser = JsonOutputKeyToolsParser(tool.name, first_tool_only=True)
(llm_with_tools | parser).invoke(
"I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
{'query': "df[['Age', 'Fare']].corr()"}

并与提示结合,这样我们可以只指定问题,而无需在每次调用时指定数据框信息:

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

```
{df.head().to_markdown()}
```

Given a user question, write the Python code to answer it. \
Return ONLY the valid Python code and nothing else. \
Don't assume you have access to any libraries other than built-in Python ones and pandas."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
code_chain = prompt | llm_with_tools | parser
code_chain.invoke({"question": "What's the correlation between age and fare"})
{'query': "df[['Age', 'Fare']].corr()"}

最后,我们将添加我们的Python工具,以便实际执行生成的代码:

chain = prompt | llm_with_tools | parser | tool  # noqa
chain.invoke({"question": "What's the correlation between age and fare"})
0.11232863699941621

就这样,我们有了一个简单的数据分析链。我们可以通过查看LangSmith追踪来查看中间步骤。

我们可以在结尾处添加一个额外的LLM调用来生成对话式回应,这样我们不仅仅是回应工具的输出。为此,我们需要在我们的提示中添加一个"MessagesPlaceholder"的聊天历史。

from operator import itemgetter

from langchain_core.messages import ToolMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

```
{df.head().to_markdown()}
```

Given a user question, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas.
Respond directly to the question once you have enough information to answer it."""
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
system,
),
("human", "{question}"),
# This MessagesPlaceholder allows us to optionally append an arbitrary number of messages
# at the end of the prompt using the 'chat_history' arg.
MessagesPlaceholder("chat_history", optional=True),
]
)


def _get_chat_history(x: dict) -> list:
"""Parse the chain output up to this point into a list of chat history messages to insert in the prompt."""
ai_msg = x["ai_msg"]
tool_call_id = x["ai_msg"].additional_kwargs["tool_calls"][0]["id"]
tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x["tool_output"]))
return [ai_msg, tool_msg]


chain = (
RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)
.assign(tool_output=itemgetter("ai_msg") | parser | tool)
.assign(chat_history=_get_chat_history)
.assign(response=prompt | llm | StrOutputParser())
.pick(["tool_output", "response"])
)
chain.invoke({"question": "What's the correlation between age and fare"})
{'tool_output': 0.11232863699941621,
'response': 'The correlation between age and fare is approximately 0.112.'}

这里是本次运行的LangSmith跟踪链接:https://smith.langchain.com/public/ca689f8a-5655-4224-8bcf-982080744462/r

Agent

复杂问题时,LLM 可以通过迭代执行代码并保留先前执行的输入和输出来帮助。这就是 Agents 起作用的地方。它们允许 LLM 决定工具需要被调用多少次,并跟踪到目前为止所做的执行。create_pandas_dataframe_agent 是一个内置代理,使得在数据框中操作变得更容易:

from langchain_experimental.agents import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(llm, df, agent_type="openai-tools", verbose=True)
agent.invoke(
{
"input": "What's the correlation between age and fare? is that greater than the correlation between fare and survival?"
}
)


> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': "df[['Age', 'Fare']].corr()"}`


Age Fare
Age 1.000000 0.112329
Fare 0.112329 1.000000
Invoking: `python_repl_ast` with `{'query': "df[['Fare', 'Survived']].corr()"}`


Fare Survived
Fare 1.000000 0.256179
Survived 0.256179 1.000000The correlation between age and fare is 0.112329, while the correlation between fare and survival is 0.256179. Therefore, the correlation between fare and survival is greater than the correlation between age and fare.

> Finished chain.
{'input': "What's the correlation between age and fare? is that greater than the correlation between fare and survival?",
'output': 'The correlation between age and fare is 0.112329, while the correlation between fare and survival is 0.256179. Therefore, the correlation between fare and survival is greater than the correlation between age and fare.'}

这次运行的 LangSmith 跟踪信息如下: https://smith.langchain.com/public/8e6c23cc-782c-4203-bac6-2a28c770c9f0/r

Multiple CSVs

处理多个CSV文件(或数据框)时,我们只需将多个数据框传递给我们的Python工具。我们的 create_pandas_dataframe_agent 构造函数可以直接做到这一点,我们可以传入一个数据框列表而不只是一个。如果我们自己构建一个链,可以这样做:

df_1 = df[["Age", "Fare"]]
df_2 = df[["Fare", "Survived"]]

tool = PythonAstREPLTool(locals={"df_1": df_1, "df_2": df_2})
llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)
df_template = """```python
{df_name}.head().to_markdown()
>>> {df_head}
```"""
df_context = "\n\n".join(
df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)
for _df, df_name in [(df_1, "df_1"), (df_2, "df_2")]
)

system = f"""You have access to a number of pandas dataframes. \
Here is a sample of rows from each dataframe and the python code that was used to generate the sample:

{df_context}

Given a user question about the dataframes, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas. \
Make sure to refer only to the variables mentioned above."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])

chain = prompt | llm_with_tool | parser | tool
chain.invoke(
{
"question": "return the difference in the correlation between age and fare and the correlation between fare and survival"
}
)
-0.14384991262954416

这是本次运行的LangSmith跟踪: https://smith.langchain.com/public/653e499f-179c-4757-8041-f5e2a5f11fcc/r

Sandboxed code execution

有一些类似 E2B 和 Bearly 的工具,提供了用于执行 Python 代码的沙箱环境,以实现更安全的代码执行链和代理。

Next steps

对于更高级的数据分析应用程序,我们建议查看:

  • SQL 使用案例: 处理 SQL 数据库和 CSV 文件时遇到的许多挑战对任何结构化数据类型都是普遍的,因此即使您使用 Pandas 进行 CSV 数据分析,阅读 SQL 技术也是有用的。
  • 工具使用 :在与调用工具的链和代理一起工作时的一般最佳实践指南
  • 代理商:了解建立LLM代理商的基础知识。
  • 集成:像 E2B 和 Bearly 这样的沙盒环境,例如 SQLDatabase 这样的实用工具,以及类似 Spark DataFrame 的相关代理。

Help us out by providing feedback on this documentation page: