LLMs之SQL:《Text2SQL is Not Enough: Unifying AI and Databases with TAG》翻译与解读

一个处女座的程序猿 2024-10-18 13:31:01 阅读 99

LLMs之SQL:《Text2SQL is Not Enough: Unifying AI and Databases with TAG》翻译与解读

导读:2024年8月27日伯克利和斯坦福团队发布了TAG框架。TAG 模型为回答数据库上的自然语言查询提供了一个新的范式,它结合了语言模型的推理能力和数据库的计算能力,能够处理更广泛、更复杂的查询,为用户提供更强大的数据分析能力。

>> 背景痛点

● 用户需求:用户问题常需复杂的领域知识、世界知识、精确计算和语义推理的结合。当前方法无法有效处理需要语义推理或世界知识的问题。

● 现有方法局限性:现有的 Text2SQL 和 RAG 方法无法满足用户对数据库查询的复杂需求,例如需要语义推理世界知识查询。Text2SQL 仅限于可转化为关系代数表达的自然语言查询,覆盖范围有限。而 RAG 只处理少量数据记录的简单的点查询,无法充分利用数据库系统的强大计算能力。

>> 解决方案

● 提出了 Table-Augmented Generation (TAG) 模型,将自然语言查询与数据库的交互整合到一个统一框架中。提出了一种统一的、通用的范式,称为表格增强生成,用于回答数据库上的自然语言问题。

>> 核心思路步骤:TAG 模型包含三个步骤:查询合成、查询执行和答案生成。

● 查询合成 (Query Synthesis): 将用户的自然语言查询转换为可执行的数据库查询 (SQL)。

● 查询执行 (Query Execution): 在数据库系统上执行生成的查询,获取相关数据。

● 答案生成 (Answer Generation): 利用自然语言查询和获取的数据,通过语言模型生成最终的自然语言答案。

>> 优势

● 统一性与广泛性:TAG 模型统一了 Text2SQL 和 RAG 方法,能够处理更广泛的用户问题。 TAG 模型能够处理各种类型的自然语言查询,包括需要语义推理世界知识的查询。

● 效率: TAG 模型利用数据库的计算能力来处理数据,并通过语言模型进行推理和生成,提高了效率。通过将数据库与语言模型的能力结合,显著提高了查询准确率,比传统方法提高了20%到65%。

● 灵活性: TAG 模型允许使用各种数据库系统和语言模型,并可以根据具体需求进行定制。

● 研究潜力:提供了新的研究机会,可以进一步利用语言模型的世界知识和推理能力来增强数据查询。

>> 实验结果

● 研究人员构建了 TAG 基准,包含需要世界知识和语义推理的查询。

● 实验结果表明,传统的 Text2SQL 和 RAG 方法在处理这些查询时效果不佳,而手写的 TAG 管道取得了显著的性能提升。

TAG 模型通过结合数据库的精确计算能力和语言模型的语义推理能力,提供了一种更强大和全面的解决方案来处理复杂的自然语言查询。这一模型的提出不仅解决了现有方法的局限性,还为未来的研究和应用提供了新的方向。

目录

《Text2SQL is Not Enough: Unifying AI and Databases with TAG》翻译与解读

Abstract

1 Introduction

Figure 1: An example TAG implementation for answering the user’s natural language question over a table about movies. The TAG pipeline proceeds in three stages: query synthesis, query execution, and answer generation图1:一个例子展示TAG实现如何回答关于电影表格的用户自然语言问题。TAG流程分为三个阶段:查询合成、查询执行和答案生成。

4 EVALUATION

4.1 Benchmark Methodology基准测试方法论

Dataset数据集

Queries查询

Evaluation metrics评估指标

Experimental setup实验设置

6 CONCLUSION


《Text2SQL is Not Enough: Unifying AI and Databases with TAG》翻译与解读

地址

论文地址:https://arxiv.org/abs/2408.14717

时间

2024年8月27日

作者

加州大学伯克利、斯坦福大学

Abstract

AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabil-ities of language models (LMs) alongside the scalable computa-tional power of data management systems. These combined ca-pabilities would empower users to ask arbitrary natural language questions over custom data sources. However, existing methods and benchmarks insufficiently explore this setting. Text2SQL meth-ods focus solely on natural language questions that can be ex-pressed in relational algebra, representing a small subset of the questions real users wish to ask. Likewise, Retrieval-Augmented Generation (RAG) considers the limited subset of queries that can be answered with point lookups to one or a few data records within the database. We propose Table-Augmented Generation (TAG), a unified and general-purpose paradigm for answering natural lan-guage questions over databases. The TAG model represents a wide range of interactions between the LM and database that have been previously unexplored and creates exciting research opportuni-ties for leveraging the world knowledge and reasoning capabili-ties of LMs over data. We systematically develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly, confirming the need for further research in this area. We release code for the benchmark at https://github.com/TAG-Research/TAG-Bench.

通过数据库解决自然语言问题的人工智能系统有望释放出巨大的价值。这样的系统将允许用户利用语言模型(LMs)强大的推理和知识能力,以及数据管理系统可扩展的计算能力。这些结合的能力将使用户能够对自定义数据源提出任意自然语言问题。然而,现有的方法和基准测试未能充分探索这一场景。Text2SQL方法仅专注于可以用关系代数表达的自然语言问题,这代表了实际用户希望提问的问题的一小部分。同样地,检索增强生成(RAG)考虑的是可以在数据库内通过指向查找一或几个数据记录来回答的查询有限子集。我们提出了表格增强生成(TAG),这是一种统一且通用的方法论,用于回答基于数据库的自然语言问题。TAG模型代表了以前未被探索过的LM与数据库之间广泛的交互,并为利用LM在数据上的世界知识和推理能力创造了令人兴奋的研究机会。我们系统性地开发了基准测试来研究TAG问题,并发现标准方法正确回答的问题不超过20%,这证实了在这一领域进一步研究的必要性。我们在https://github.com/TAG-Research/TAG-Bench上发布了基准测试代码。

1 Introduction

Language models promise to revolutionize data management by let-ting users ask natural language questions over data, which has led to a great deal of research in Text2SQL and Retrieval-Augmented Generation (RAG) methods. In our experience, however (includ-ing from internal workloads and customers at Databricks), users’ questions often transcend the capabilities of these paradigms, de-manding new research investment towards systems that combine the logical reasoning abilities of database systems with the natural language reasoning abilities of modern language models (LMs).

In particular, we find that real business users’ questions often require sophisticated combinations of domain knowledge, world knowledge, exact computation, and semantic reasoning. Database systems clearly provide a source of domain knowledge through the up-to-date data they store, as well as exact computation at scale (which LMs are bad at).

语言模型通过让用户以自然语言形式询问数据,承诺彻底改变数据管理,这导致了大量的Text2SQL和检索增强生成(RAG)方法的研究。然而,在我们的经验中(包括Databricks内部工作负载和客户),用户的问题往往超越了这些范式的功能,需要新的研究投资于结合数据库系统的逻辑推理能力和现代语言模型(LMs)的自然语言推理能力的系统。

特别地,我们发现真实的业务用户问题通常需要复杂的领域知识、世界知识、精确计算和语义推理的组合。数据库系统显然通过它们存储的最新数据提供了领域知识的来源,以及大规模的精确计算(这是LMs所不擅长的)。

LMs offer to extend the existing capabilities of databases in two key ways. First, LMs possess semantic reasoning capabilities over textual data, a core element of many natural language user queries. For example, a Databricks customer survey showed users wish to ask questions like which customer reviews of product X are positive?, or why did my sales drop during this period?. These questions present complex reasoning-based tasks, such as sentiment analysis over free-text fields or summarization of trends. LMs are well-suited to these tasks, which cannot be modeled by the exact computation or relational primitives in traditional database systems.

Secondly, the LM, using knowledge learned during model train-ing and stored implicitly by the model’s weights, can powerfully augment the user’s data with world knowledge that is not cap-tured explicitly by the database’s table schema. As an example, a Databricks internal AI user asked what are the QoQ trends for the "re-tail" vertical? over a table containing attributes for account names, products and revenue. To answer this query the system must under-stand how the business defines QoQ (e.g., the quarter over quarter trends from the last quarter to the current quarter or this quarter last year to this quarter this year), as well as which companies are considered to be in the retail vertical. This task is well-suited to leverage the knowledge held by a pre-trained or fine-tuned LM.

语言模型以两种关键方式扩展了数据库现有的能力。首先,LMs拥有对文本数据的语义推理能力,这是许多自然语言用户查询的核心元素。例如,Databricks的一项客户调查显示,用户希望问诸如“产品X的哪些客户评价是正面的?”或“为什么我的销售额在这个期间下降了?”等问题。这些问题提出了基于复杂推理的任务,如自由文本字段的情绪分析或趋势总结。LMs非常适合这些任务,而这些任务不能由传统数据库系统中的精确计算或关系原语建模。

其次,使用在模型训练过程中学习并隐式存储在模型权重中的知识,LM可以有力地用数据库表模式中未明确捕捉到的世界知识来补充用户的数据。比如,一位Databricks内部AI用户询问了关于包含账户名称、产品和收入属性的表格,“零售”垂直领域的季度环比趋势是什么?要回答这个查询,系统必须理解业务如何定义QoQ(例如,从上个季度到本季度的季度环比趋势,或者去年此季度到今年此季度的趋势),以及哪些公司被认为是处于零售垂直领域。这项任务非常适合利用预训练或微调后的LM所持有的知识。

Systems that efficiently leverage databases and LMs together to serve natural language queries, in their full generality, hold poten-tial to transform the way users understand their data. Unfortunately, these questions cannot be answered today by common methods, such as Text2SQL and RAG. While Text2SQL methods [26, 28, 31, 32] are suitable for the subset of natural language queries that have direct relational equivalents, they cannot handle the vast array of user queries that require semantic reasoning or world knowledge. For instance, the previous user query asking which customer reviews are positive may require logical row-wise LM reasoning over re-views to classify each as positive or negative. Similarly the question which asks why sales dropped entails a reasoning question that must aggregate information across many table entries.

On the other hand, the RAG model is limited to simple relevance-based point lookups to a few data records, followed by a single LM invocation. This model serves only the subset of queries answer-able by point lookups and also fails to leverage the richer query execution capabilities of many database systems, which leaves com-putational tasks (e.g., counting, math and filtering) to a single in-vocation of the error-prone LM. In addition to being error prone and inefficient at computational tasks, LMs have also been shown to perform poorly on long-context prompts limiting their ability to reason about data at scale in the generation phase of RAG.

能够高效地结合数据库和LMs来处理自然语言查询的系统,具有完全普遍性,有潜力转变用户理解数据的方式。不幸的是,这些问题是当今常用方法,如Text2SQL和RAG无法解答的。虽然Text2SQL方法[26, 28, 31, 32]适合那些可以直接转化为关系等价物的自然语言查询子集,但它们无法处理大量需要语义推理或世界知识的用户查询。比如,前面提到的用户查询——哪个客户的评价是正面的——可能需要逻辑行级LM推理来对每条评价进行正面或负面分类。同样,询问销售为何下降的问题也涉及一个必须汇总多个表格条目信息的推理问题。

另一方面,RAG模型限于简单的相关性基础点查找至少数条数据记录,随后是一次单一的LM调用。这种模型仅服务于可以通过点查找回答的查询子集,也没有充分利用许多数据库系统更丰富的查询执行能力,而是将计算任务(如计数、数学运算和过滤)留给容易出错的单次LM调用。除了容易出错和在计算任务上的效率低下之外,LMs也被证明在长上下文提示上表现不佳,限制了其在RAG生成阶段对大规模数据进行推理的能力。

We instead propose table-augmented generation (TAG) as a unified paradigm for systems that answer natural language ques-tions over databases. Specifically, TAG defines three key steps, as shown in Figure 1. First, the query synthesis step syn translates the user’s arbitrary natural language request �� to an executable database query ��. Then, the query execution step exec executes �� on the database system to efficiently compute the relevant data �� . Lastly, the answer generation step gen utilizes �� and �� , where the LM is orchestrated, possibly in iterative or recursive patterns over the data, to generate the final natural language answer ��. The TAG model is simple, but powerful: it is defined by the following three equations, but captures a wide range of previously under-studied interactions between LMs and databases.

相反,我们提出了一种称为表增强生成(TAG)的统一范式,用于回答基于数据库的自然语言问题。具体来说,TAG定义了三个关键步骤,如图1所示。首先,查询合成步骤syn将用户的任意自然语言请求R转换为可执行的数据库查询Q。然后,查询执行步骤exec在数据库系统上执行Q以有效计算相关数据T。最后,答案生成步骤gen利用R和T,其中LM可能以迭代或递归模式操作数据,以生成最终的自然语言答案A。TAG模型简单却强大:它由以下三个方程式定义,但捕捉到了之前未充分研究的LM与数据库之间的广泛互动。

Query Synthesis: syn(R) →Q (1)

Query Execution: exec(Q) →T (2)

Answer Generation: gen(R,T) →A (3)

Notably, the TAG model unifies prior methods, including both Text2SQL and RAG, which represent special cases of TAG and serve only a limited subset of user questions.

While several prior works address these special cases of TAG, we provide the first end-to-end TAG benchmark composed of a broad set of realistic queries that require LM reasoning and knowledge ca-pabilities. We demonstrate the significant research challenges posed by these types of questions, as well as the promise of efficient TAG implementations. Our evaluation analyzes the vanilla Text2SQL and RAG baselines as well as two stronger baselines, Text2SQL with LM generation and retrieval with LM-based re-ranking. Across a variety of query types, we find each baseline method consistently fails to achieve high accuracy, never surpassing 20% exact match on the benchmark. On the other hand, we implement hand-written TAG pipelines on top of the recent LOTUS runtime [21] and find they achieve up to 20 − 65% higher accuracy compared to the base-lines. This significant performance gap demonstrates the promise of building efficient TAG systems.

查询合成:syn(R) → Q (1)

查询执行:exec(Q) → T (2)

答案生成:gen(R,T) → A (3)

值得注意的是,TAG模型统一了先前的方法,包括Text2SQL和RAG,这些都是TAG的特例,只服务于用户问题的一个有限子集。

尽管有一些先前的工作解决了TAG的这些特殊情况,但我们提供了首个端到端的TAG基准测试,其中包括一系列现实的查询,这些查询需要LM的推理和知识能力。我们展示了这类问题带来的重大研究挑战,以及高效的TAG实现的前景。我们的评估分析了基本的Text2SQL和RAG基线,以及两个更强的基线,即带有LM生成的Text2SQL和基于LM重排序的检索。在各种类型的查询中,我们发现每个基线方法都持续未能达到高精度,从未超过基准测试中的20%准确匹配。另一方面,我们在最近的LOTUS运行时[21]之上实现了手写的TAG管道,并发现它们相比基线提高了20%到65%的准确性。这一显著的性能差距显示了构建高效TAG系统的前景。

Figure 1: An example TAG implementation for answering the user’s natural language question over a table about movies. The TAG pipeline proceeds in three stages: query synthesis, query execution, and answer generation图1:一个例子展示TAG实现如何回答关于电影表格的用户自然语言问题。TAG流程分为三个阶段:查询合成、查询执行和答案生成。

4 EVALUATION

In this section, we introduce the first TAG benchmark and evaluate a collection of baselines, aiming to address the following questions:

(1) How do existing methods for table question answering perform on queries requiring semantic reasoning or world knowledge?

(2) Howdoes a hand-written implementation of the TAGmodel, which divides computational and reasoning steps across DBMS and LM operations, perform on these queries?

在本节中,我们介绍了第一个TAG基准测试,并对一系列基线进行了评估,旨在解决以下问题:

(1) 现有的表格问答方法在需要语义推理或世界知识的查询上表现如何?

(2) 一个手写的TAG模型实现,该模型将计算和推理步骤分布在数据库管理系统(DBMS)和语言模型(LM)操作之间,在这些查询上的表现如何?

4.1 Benchmark Methodology基准测试方法论

Existing benchmarks have explored how models perform on ba-sic queries answerable entirely from data in the data source. We build upon prior work by modifying queries such that they require knowledge not directly available in the data source or semantic reasoning to answer. We select BIRD [17], a widely used Text2SQL benchmark on which LMs have been evaluated, for its large scale tables along with its variety of domains and query types.

现有的基准测试已经探讨了模型在完全基于数据源中的数据回答的基本查询上的表现。我们在前人工作的基础上修改了查询,使它们需要不能直接从数据源获取的知识或语义推理来回答。我们选择了BIRD [17],这是一个广泛使用的Text2SQL基准测试,它被用来评估语言模型,因为它拥有大规模的表格以及多样化的领域和查询类型。

Dataset数据集

Our queries span 5 domains selected from BIRD, each containing diversity in query types. We select california_schools, debit_card_specializing, formula_1, codebase_community, and euro-pean_football_2 as the DB sources for our queries.

我们的查询涵盖了从BIRD中选出的5个领域,每个领域都包含了多种类型的查询。我们选择了california_schools、debit_card_specializing、formula_1、codebase_community和european_football_2作为我们查询的数据源。

Queries查询

The BIRD benchmark defines fundamental query types, in-cluding match-based, comparison, ranking, and aggregation queries. We select queries among these types from the BIRD benchmark and modify them to require either world knowledge or semantic reasoning for the model to answer. As an example of a modified query requiring world knowledge, in the california_schools DB, a modified query adds an additional clause asking for only schools in the Bay Area. This information is not in the table and requires the model’s world knowledge to answer. Next, a modified query requiring LM reasoning asks for the top 3 most sarcastic comments on a particular post in the codebase_community DB. For evaluation on these queries, we rely on human-labeled ground truth. Our final benchmark consists of 80 modified queries, 40 requiring parametric knowledge and 40 requiring reasoning, with 20 of each of the 4 chosen BIRD query types.

BIRD基准定义了基本的查询类型,包括基于匹配的、比较、排名和聚合查询。我们从BIRD基准中选择这些类型的查询并进行修改,以要求模型需要世界知识或语义推理才能回答。例如,在california_schools数据库中,一个经过修改的查询添加了一个额外的条款,只请求湾区内的学校。这一信息不在表中,需要模型的世界知识来回答。接下来,一个需要LM推理的修改后的查询是请求codebase_community数据库中某篇帖子最讽刺的前三条评论。对于这些查询的评估,我们依赖于人工标注的真实结果。最终的基准由80个修改后的查询组成,其中40个需要参数化知识,40个需要推理,每种选定的BIRD查询类型各20个。

Evaluation metrics评估指标

We measure accuracy as the percentage of exact matches as compared to the labeled correct answer for the match-based, comparison, and ranking query types. For aggrega-tion queries, we provide qualitative analysis on results using each baseline. We also measure execution time in seconds for each query.

对于基于匹配的、比较和排名查询类型,我们测量准确性为与标注正确答案完全匹配的百分比。对于聚合查询,我们使用每个基线的结果进行定性分析。我们还测量每个查询的执行时间(秒)。

Experimental setup实验设置

We use the instruction tuned variant of Meta’s Llama-3.1 model [2] with 70B parameters as our LM for both Text2SQL and final output generation. We use SQLite3 as our database API for baselines involving SQL and use an E5 base embedding model [23] for our RAG baseline. We run Llama-3.1- 70B-Instruct with vLLM [14] on 8 A100 80GB GPUs.

我们使用Meta的Llama-3.1模型[2]的指令调优版本,具有70亿参数,作为Text2SQL和最终输出生成的语言模型。对于涉及SQL的基线,我们使用SQLite3作为数据库API,并使用E5基础嵌入模型[23]作为我们的RAG基线。我们在8个A100-80GB GPU上运行Llama-3.1-70B-Instruct与vLLM[14]。

6 CONCLUSION

In this work we proposed table-augmented generation (TAG) as a unified model for answering natural language questions over databases. We developed benchmarks to study two important types of queries: those that require world knowledge, and those that re-quire semantic reasoning capabilities. Our systematic evaluation confirms that baseline methods are unable to make meaningful traction on these tasks. However, hand-written TAG pipelines can achieve up to 65% higher accuracy, demonstrating substantial re-search opportunities for building TAG systems.

在这项工作中,我们提出了表格增强生成(TAG)作为一种统一的模型,用于回答基于数据库的自然语言问题。我们开发了基准测试来研究两类重要的查询:一类需要世界知识,另一类需要语义推理能力。我们的系统性评估证实,基线方法在这些任务上无法取得有意义的进步。然而,手工编写的TAG管道可以达到高达65%更高的准确性,显示出构建TAG系统的重要研究机会。



声明

本文内容仅代表作者观点,或转载于其他网站,本站不以此文作为商业用途
如有涉及侵权,请联系本站进行删除
转载本站原创文章,请注明来源及作者。