谷歌发布Gemini-SQL2：Gemini 3.1 Pro Text-to-SQL在BIRD单模型排行榜上取得80.04%执行准确率

Google Research团队宣布推出Gemini-SQL2。该系统被描述为基于Gemini 3.1 Pro的突破性文本转SQL能力。Gemini-SQL2在BIRD Text-to-SQL排行榜（单模型）上取得了80.04%的执行准确率。谷歌的图表显示其排名高于此前榜首Gemini-SQL。该指标衡量生成的SQL能否运行并返回正确结果，而非仅凭外观判断。

Gemini-SQL2

Gemini-SQL2是一种文本转SQL能力，并非独立的基座模型发布。它能够将自然语言问题转化为谷歌所称的“可直接执行的SQL查询”。该能力构建于Gemini 3.1 Pro之上。

根据X上的公告，“数据的微妙性及复杂的业务上下文使得从自然语言生成准确的SQL非常困难。”该推文还指出，“改进的SQL理解能力可以提升谷歌数据服务中的自然语言技能。”这指向了可能的集成目标，如BigQuery Studio、AlloyDB AI和Cloud SQL Studio——这些产品已经内置了基于Gemini的SQL生成功能。谷歌尚未确认哪些产品将获得Gemini-SQL2。

Benchmarks

BIRD（大规模数据库基础文本转SQL评估大基准）是该任务的行业标准。它包含12,751个问题-SQL对，涉及95个数据库，覆盖37个专业领域，总数据量33.4GB。这些数据库包含脏数据，且需要外部领域知识，与Spider等旧基准不同。

BIRD衡量执行准确率（EX）：生成的SQL必须能运行并返回与参考答案匹配的结果。谷歌直接指出，“根据BIRD基准（衡量执行验证准确率），Gemini-SQL2的SQL不仅看起来合理，而且能成功运行。”

单模型赛道限制了集成方法中常用的预处理、检索和智能体框架，以衡量模型核心的文本转SQL能力。谷歌云此前于2025年11月15日在该赛道的记录为76.13%。谷歌还将人类表现基准设定为92.96%，与80.04%之间相差12.92个百分点。

How the Leaderboard Stacks Up

谷歌在X的推文图表中显示，Gemini-SQL2领先于八个有名竞品，以及若干未标注的点。只有80.04%明确标注为文字。以下数值根据图表位置近似读取；日期反映每个点的水平位置。

Gemini-SQL2（Google）：80.04%（2026年6月）
Gemini-SQL（Google）：约77.2%（2026年3月）
Q-SQL（AWS）：约76.5%（2025年12月）
Databricks RLVR 32B（Databricks）：约75.7%（2025年7月）
SiriusAI-Text2SQL-32B-v2（腾讯）：约75.0%（2025年12月）
Arctic-Text2SQL-R1-32B（Snowflake）：约73.9%（2025年6月）
GPT-5.5-xhigh（OpenAI）：约72.5%（2026年4月）
SQLWeaver-32B（阿里巴巴）：约71.7%（2026年5月）
Claude Opus 4.6（Anthropic）：约70.1%（2026年2月）

两个模式清晰可见。谷歌目前占据前两个有名位置，即Gemini-SQL2和Gemini-SQL。多个专用的32B SQL模型在该图表上超过了部分通用前沿模型。

Use Cases with Examples

自助分析：收入经理询问按月、按地区统计的月度经常性收入，针对升级后90天内流失的账户。这需要连接、窗口函数和日期运算。执行验证生成能捕获那些运行但返回错误行的SQL。
数据工程草稿：开发者可从英文描述草拟BigQuery转换逻辑，然后审查而非从头编写。谷歌2025年11月的工作发现，理解模式是其中的难点。更高的BIRD分数反映了对歧义列和脏数据的更好处理。
嵌入式“查询数据”功能：添加自然语言查询接口的SaaS团队仍需在80%准确率下加入人工审核。每五个查询中就有一个可能出错。该分数设定了预期，而非取消审核。

Community Reception Dashboard

Verified public engagement on Google Research’s announcement posts • first ~3 hours • Jun 12, 2026

BIRD Single-Model Leaderboard • Execution Accuracy

Platform Engagement Breakdown

X / Twitter (main post): Views 144.4K, Likes 2,800, Reposts 267, Bookmarks 1,300, Replies 64, Engagement rate 3.1%
LinkedIn (main post): Reactions 349+, Comments 12, Reposts 27

Reception signal: 9.3 : 1 (Bookmark-plus-like to reply ratio on X. A high save rate with few replies typically signals approval over controversy. Comment-level sentiment not yet measurable; replies still loading at capture time.)

Implementation Pattern

Google has not published a Gemini-SQL2 model string or API yet. The schema-grounded pattern below works with current Gemini models via the google-genai SDK. Swap the model string when Gemini-SQL2 ships.

from google import genai

client = genai.Client()  # reads GEMINI_API_KEY from environment

schema = """
CREATE TABLE orders (
  order_id INTEGER, customer TEXT, region TEXT,
  amount REAL, status TEXT, created_at DATE
);
"""

question = "Total paid order amount by region in 2026, highest first."

prompt = f"""You are a text-to-SQL system.
Schema:{schema}
Question: {question}
Return only one executable SQLite query. No explanation."""

resp = client.models.generate_content(
    model="gemini-3.1-pro-preview",  # the base model named in the announcement; swap when a Gemini-SQL2 ID ships
    contents=prompt,
)
print(resp.text)

Production systems should add execution verification. Run the returned SQL, catch errors, and retry with the error message appended. That loop mirrors what BIRD's execution accuracy metric rewards.

Key Takeaways

Google reports Gemini-SQL2 at 80.04% execution accuracy on the BIRD single-model leaderboard.
The capability is powered by Gemini 3.1 Pro and targets "execution-ready SQL," not just plausible SQL.
On Google's chart, Gemini-SQL2 and Gemini-SQL hold the top two named positions; human performance is 92.96.
No API, model card, technical report, or product integration details have been published yet.