跳到内容

SELECT

在 Polars SQL 中,SELECT 语句用于从表中检索数据到 DataFrame 中。Polars SQL 中 SELECT 语句的基本语法如下:

SELECT column1, column2, ...
FROM table_name;

在这里,column1column2 等是您希望从表中选择的列。您也可以使用通配符 * 来选择所有列。table_name 是您想要从中检索数据的表的名称。在下面的章节中,我们将介绍一些更常见的 SELECT 变体。

register · execute

df = pl.DataFrame(
    {
        "city": [
            "New York",
            "Los Angeles",
            "Chicago",
            "Houston",
            "Phoenix",
            "Amsterdam",
        ],
        "country": ["USA", "USA", "USA", "USA", "USA", "Netherlands"],
        "population": [8399000, 3997000, 2705000, 2320000, 1680000, 900000],
    }
)

ctx = pl.SQLContext(population=df, eager=True)

print(ctx.execute("SELECT * FROM population"))

shape: (6, 3)
┌─────────────┬─────────────┬────────────┐
│ city        ┆ country     ┆ population │
│ ---         ┆ ---         ┆ ---        │
│ str         ┆ str         ┆ i64        │
╞═════════════╪═════════════╪════════════╡
│ New York    ┆ USA         ┆ 8399000    │
│ Los Angeles ┆ USA         ┆ 3997000    │
│ Chicago     ┆ USA         ┆ 2705000    │
│ Houston     ┆ USA         ┆ 2320000    │
│ Phoenix     ┆ USA         ┆ 1680000    │
│ Amsterdam   ┆ Netherlands ┆ 900000     │
└─────────────┴─────────────┴────────────┘

GROUP BY

GROUP BY 语句用于根据一个或多个列对表中的行进行分组,并对每个组计算聚合函数。

execute

result = ctx.execute(
    """
        SELECT country, AVG(population) as avg_population
        FROM population
        GROUP BY country
    """
)
print(result)

shape: (2, 2)
┌─────────────┬────────────────┐
│ country     ┆ avg_population │
│ ---         ┆ ---            │
│ str         ┆ f64            │
╞═════════════╪════════════════╡
│ Netherlands ┆ 900000.0       │
│ USA         ┆ 3.8202e6       │
└─────────────┴────────────────┘

ORDER BY

ORDER BY 语句用于按一个或多个列对查询结果集进行升序或降序排序。

execute

result = ctx.execute(
    """
        SELECT city, population
        FROM population
        ORDER BY population
    """
)
print(result)

shape: (6, 2)
┌─────────────┬────────────┐
│ city        ┆ population │
│ ---         ┆ ---        │
│ str         ┆ i64        │
╞═════════════╪════════════╡
│ Amsterdam   ┆ 900000     │
│ Phoenix     ┆ 1680000    │
│ Houston     ┆ 2320000    │
│ Chicago     ┆ 2705000    │
│ Los Angeles ┆ 3997000    │
│ New York    ┆ 8399000    │
└─────────────┴────────────┘

JOIN

register_many · execute

income = pl.DataFrame(
    {
        "city": [
            "New York",
            "Los Angeles",
            "Chicago",
            "Houston",
            "Amsterdam",
            "Rotterdam",
            "Utrecht",
        ],
        "country": [
            "USA",
            "USA",
            "USA",
            "USA",
            "Netherlands",
            "Netherlands",
            "Netherlands",
        ],
        "income": [55000, 62000, 48000, 52000, 42000, 38000, 41000],
    }
)
ctx.register_many(income=income)
result = ctx.execute(
    """
        SELECT country, city, income, population
        FROM population
        LEFT JOIN income on population.city = income.city
    """
)
print(result)

shape: (6, 4)
┌─────────────┬─────────────┬────────┬────────────┐
│ country     ┆ city        ┆ income ┆ population │
│ ---         ┆ ---         ┆ ---    ┆ ---        │
│ str         ┆ str         ┆ i64    ┆ i64        │
╞═════════════╪═════════════╪════════╪════════════╡
│ USA         ┆ New York    ┆ 55000  ┆ 8399000    │
│ USA         ┆ Los Angeles ┆ 62000  ┆ 3997000    │
│ USA         ┆ Chicago     ┆ 48000  ┆ 2705000    │
│ USA         ┆ Houston     ┆ 52000  ┆ 2320000    │
│ USA         ┆ Phoenix     ┆ null   ┆ 1680000    │
│ Netherlands ┆ Amsterdam   ┆ 42000  ┆ 900000     │
└─────────────┴─────────────┴────────┴────────────┘

函数

Polars 提供了多种 SQL 函数,包括:

  • 数学函数:ABS, EXP, LOG, ASIN, ACOS, ATAN, 等。
  • 字符串函数:LOWER, UPPER, LTRIM, RTRIM, STARTS_WITH,ENDS_WITH
  • 聚合函数:SUM, AVG, MIN, MAX, COUNT, STDDEV, FIRST 等。
  • 数组函数:EXPLODE, UNNEST,ARRAY_SUM,ARRAY_REVERSE, 等。

有关支持的函数完整列表,请参阅 API 文档。以下示例演示了如何在查询中使用函数

execute

result = ctx.execute(
    """
        SELECT city, population
        FROM population
        WHERE STARTS_WITH(country,'U')
    """
)
print(result)

shape: (5, 2)
┌─────────────┬────────────┐
│ city        ┆ population │
│ ---         ┆ ---        │
│ str         ┆ i64        │
╞═════════════╪════════════╡
│ New York    ┆ 8399000    │
│ Los Angeles ┆ 3997000    │
│ Chicago     ┆ 2705000    │
│ Houston     ┆ 2320000    │
│ Phoenix     ┆ 1680000    │
└─────────────┴────────────┘

表函数

在之前的示例中,我们首先生成了一个 DataFrame 并将其注册到 SQLContext 中。Polars 还支持在 SQL 查询中使用表函数 read_xxx 直接读取 CSV、Parquet、JSON 和 IPC 文件。

execute

result = ctx.execute(
    """
        SELECT *
        FROM read_csv('docs/assets/data/iris.csv')
    """
)
print(result)

shape: (150, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ Setosa    │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ Setosa    │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ Setosa    │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ Setosa    │
│ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ Setosa    │
│ …            ┆ …           ┆ …            ┆ …           ┆ …         │
│ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ Virginica │
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ Virginica │
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ Virginica │
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ Virginica │
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ Virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘