简介

尽管 Polars 支持与 SQL 交互，但建议用户熟悉表达式语法以编写更具可读性和表现力的代码。由于 DataFrame 接口是主要的，新功能通常会首先添加到表达式 API 中。然而，如果您已经有现有的 SQL 代码库或偏好使用 SQL，Polars 确实提供了支持。

注意

没有单独的 SQL 引擎，因为 Polars 会将 SQL 查询转换为表达式，然后使用其自身的引擎执行这些表达式。这种方法确保了 Polars 作为原生 DataFrame 库能够保持其性能和可伸缩性优势，同时仍为用户提供了使用 SQL 的能力。

上下文

Polars 使用 SQLContext 对象来管理 SQL 查询。该上下文包含 DataFrame 和 LazyFrame 标识符名称与其对应数据集的映射¹。以下示例启动了一个 SQLContext

Python

SQLContext

ctx = pl.SQLContext()

注册数据帧

在 SQLContext 初始化期间，有几种注册 DataFrame 的方式。

在全局命名空间中注册所有 LazyFrame 和 DataFrame 对象。
通过字典映射或 kwargs 显式注册。

Python

SQLContext

df = pl.DataFrame({"a": [1, 2, 3]})
lf = pl.LazyFrame({"b": [4, 5, 6]})

# Register all dataframes in the global namespace: registers both "df" and "lf"
ctx = pl.SQLContext(register_globals=True)

# Register an explicit mapping of identifier name to frame
ctx = pl.SQLContext(frames={"table_one": df, "table_two": lf})

# Register frames using kwargs; dataframe df as "df" and lazyframe lf as "lf"
ctx = pl.SQLContext(df=df, lf=lf)

我们也可以通过先将 Pandas DataFrame 转换为 Polars 来注册它们。

Python

SQLContext

import pandas as pd

df_pandas = pd.DataFrame({"c": [7, 8, 9]})
ctx = pl.SQLContext(df_pandas=pl.from_pandas(df_pandas))

注意

将由 Numpy 支持的 Pandas DataFrame 转换为 Polars 可能会触发代价高昂的转换；然而，如果 Pandas DataFrame 已经由 Arrow 支持，那么转换将显著便宜（在某些情况下几乎是免费的）。

一旦 SQLContext 初始化完成，我们可以使用以下方法注册额外的 DataFrame 或取消注册现有的 DataFrame

register
register_globals
register_many
unregister

执行查询并收集结果

SQL 查询始终以惰性模式执行，以充分利用所有查询规划优化，因此我们有两种方法来收集结果

在 SQLContext 中将参数 eager_execution 设置为 True；这确保 Polars 会自动从 execute 调用中收集 LazyFrame 结果。
在使用 execute 执行查询时将参数 eager 设置为 True，或者使用 collect 显式收集结果。

我们通过在 SQLContext 上调用 execute 来执行 SQL 查询。

Python

register · execute

# For local files use scan_csv instead
pokemon = pl.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
)
with pl.SQLContext(register_globals=True, eager=True) as ctx:
    df_small = ctx.execute("SELECT * from pokemon LIMIT 5")
    print(df_small)

shape: (5, 13)
┌─────┬───────────────────────┬────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1 ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---    ┆ ---    ┆   ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ str    ┆ str    ┆   ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass  ┆ Poison ┆ … ┆ 65      ┆ 45    ┆ 1          ┆ false     │
│ 2   ┆ Ivysaur               ┆ Grass  ┆ Poison ┆ … ┆ 80      ┆ 60    ┆ 1          ┆ false     │
│ 3   ┆ Venusaur              ┆ Grass  ┆ Poison ┆ … ┆ 100     ┆ 80    ┆ 1          ┆ false     │
│ 3   ┆ VenusaurMega Venusaur ┆ Grass  ┆ Poison ┆ … ┆ 120     ┆ 80    ┆ 1          ┆ false     │
│ 4   ┆ Charmander            ┆ Fire   ┆ null   ┆ … ┆ 50      ┆ 65    ┆ 1          ┆ false     │
└─────┴───────────────────────┴────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘

从多个源执行查询

SQL 查询可以同样轻松地从多个源执行。在以下示例中，我们注册了

一个 CSV 文件（惰性加载）
一个 NDJSON 文件（惰性加载）
一个 Pandas DataFrame

并使用 SQL 将它们连接在一起。惰性读取允许只从文件中加载必要的行和列。

同样，也可以注册云数据湖（S3、Azure Data Lake）。PyArrow 数据集可以指向数据湖，然后 Polars 可以使用 scan_pyarrow_dataset 读取它。

Python

register · execute

# Input data:
# products_masterdata.csv with schema {'product_id': Int64, 'product_name': String}
# products_categories.json with schema {'product_id': Int64, 'category': String}
# sales_data is a Pandas DataFrame with schema {'product_id': Int64, 'sales': Int64}

with pl.SQLContext(
    products_masterdata=pl.scan_csv("docs/assets/data/products_masterdata.csv"),
    products_categories=pl.scan_ndjson("docs/assets/data/products_categories.json"),
    sales_data=pl.from_pandas(sales_data),
    eager=True,
) as ctx:
    query = """
    SELECT
        product_id,
        product_name,
        category,
        sales
    FROM
        products_masterdata
    LEFT JOIN products_categories USING (product_id)
    LEFT JOIN sales_data USING (product_id)
    """
    print(ctx.execute(query))

shape: (5, 4)
┌────────────┬──────────────┬────────────┬───────┐
│ product_id ┆ product_name ┆ category   ┆ sales │
│ ---        ┆ ---          ┆ ---        ┆ ---   │
│ i64        ┆ str          ┆ str        ┆ i64   │
╞════════════╪══════════════╪════════════╪═══════╡
│ 1          ┆ Product A    ┆ Category 1 ┆ 100   │
│ 2          ┆ Product B    ┆ Category 1 ┆ 200   │
│ 3          ┆ Product C    ┆ Category 2 ┆ 150   │
│ 4          ┆ Product D    ┆ Category 2 ┆ 250   │
│ 5          ┆ Product E    ┆ Category 3 ┆ 300   │
└────────────┴──────────────┴────────────┴───────┘

兼容性

Polars 不支持完整的 SQL 规范，但它支持最常见的语句类型的一个子集。

注意

在可能的情况下，Polars 旨在遵循 PostgreSQL 的语法定义和函数行为。

例如，以下是一些受支持功能的非详尽列表

编写 CREATE 语句：CREATE TABLE xxx AS ...
编写包含 WHERE、ORDER、LIMIT、GROUP BY、UNION 和 JOIN 子句的 SELECT 语句...
编写公共表表达式（CTE），例如：WITH tablename AS
解释查询：EXPLAIN SELECT ...
列出已注册的表：SHOW TABLES
删除表：DROP TABLE tablename
截断表：TRUNCATE TABLE tablename

以下是一些尚未支持的功能

INSERT、UPDATE 或 DELETE 语句
元查询，例如 ANALYZE

在接下来的章节中，我们将更详细地介绍每种语句。

此外，它还跟踪公共表表达式。 ↩