窗口函数

窗口函数是具有超能力的表达式。它们允许你在 `select` 上下文内对组进行聚合。让我们了解一下这意味着什么。

首先，我们加载一个宝可梦数据集

Python Rust

read_csv

import polars as pl

types = (
    "Grass Water Fire Normal Ground Electric Psychic Fighting Bug Steel "
    "Flying Dragon Dark Ghost Poison Rock Ice Fairy".split()
)
type_enum = pl.Enum(types)
# then let's load some csv data with information about pokemon
pokemon = pl.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
).cast({"Type 1": type_enum, "Type 2": type_enum})
print(pokemon.head())

CsvReader · 在功能 csv 上可用

use polars::prelude::*;
use reqwest::blocking::Client;

let data: Vec<u8> = Client::new()
    .get("https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv")
    .send()?
    .text()?
    .bytes()
    .collect();

let file = std::io::Cursor::new(data);
let df = CsvReadOptions::default()
    .with_has_header(true)
    .into_reader_with_file_handle(file)
    .finish()?;

println!("{}", df.head(Some(5)));

shape: (5, 13)
┌─────┬───────────────────────┬────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1 ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---    ┆ ---    ┆   ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ enum   ┆ enum   ┆   ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass  ┆ Poison ┆ … ┆ 65      ┆ 45    ┆ 1          ┆ false     │
│ 2   ┆ Ivysaur               ┆ Grass  ┆ Poison ┆ … ┆ 80      ┆ 60    ┆ 1          ┆ false     │
│ 3   ┆ Venusaur              ┆ Grass  ┆ Poison ┆ … ┆ 100     ┆ 80    ┆ 1          ┆ false     │
│ 3   ┆ VenusaurMega Venusaur ┆ Grass  ┆ Poison ┆ … ┆ 120     ┆ 80    ┆ 1          ┆ false     │
│ 4   ┆ Charmander            ┆ Fire   ┆ null   ┆ … ┆ 50      ┆ 65    ┆ 1          ┆ false     │
└─────┴───────────────────────┴────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘

每组操作

当我们想在组内执行操作时，窗口函数是理想的选择。例如，假设我们想按“速度”列对宝可梦进行排名。然而，我们不想要一个全局排名，而是想在由“类型 1”列定义的每个组内对速度进行排名。我们编写表达式以按“速度”列对数据进行排名，然后添加 `over` 函数来指定这应该在“类型 1”列的唯一值上进行。

Python Rust

over

result = pokemon.select(
    pl.col("Name", "Type 1"),
    pl.col("Speed").rank("dense", descending=True).over("Type 1").alias("Speed rank"),
)

print(result)

over

let result = df
    .clone()
    .lazy()
    .select([
        col("Name"),
        col("Type 1"),
        col("Speed")
            .rank(
                RankOptions {
                    method: RankMethod::Dense,
                    descending: true,
                },
                None,
            )
            .over(["Type 1"])
            .alias("Speed rank"),
    ])
    .collect()?;

println!("{result}");

shape: (163, 3)
┌───────────────────────┬─────────┬────────────┐
│ Name                  ┆ Type 1  ┆ Speed rank │
│ ---                   ┆ ---     ┆ ---        │
│ str                   ┆ enum    ┆ u32        │
╞═══════════════════════╪═════════╪════════════╡
│ Bulbasaur             ┆ Grass   ┆ 6          │
│ Ivysaur               ┆ Grass   ┆ 3          │
│ Venusaur              ┆ Grass   ┆ 1          │
│ VenusaurMega Venusaur ┆ Grass   ┆ 1          │
│ Charmander            ┆ Fire    ┆ 7          │
│ …                     ┆ …       ┆ …          │
│ Moltres               ┆ Fire    ┆ 5          │
│ Dratini               ┆ Dragon  ┆ 3          │
│ Dragonair             ┆ Dragon  ┆ 2          │
│ Dragonite             ┆ Dragon  ┆ 1          │
│ Mewtwo                ┆ Psychic ┆ 2          │
└───────────────────────┴─────────┴────────────┘

为了帮助理解此操作，你可以想象 Polars 选择“类型 1”列具有相同值的数据子集，然后仅对这些值计算排名表达式。然后，该特定组的结果被投影回原始行，Polars 对所有现有组都执行此操作。下图突出显示了“类型 1”等于“草”的宝可梦的排名计算。

请注意，宝可梦“Golbat”的“速度”值为 `90`，这大于宝可梦“Venusaur”的 `80`，但后者仍排名第 1，因为“Golbat”和“Venusaur”在“类型 1”列的值不同。

`over` 函数接受任意数量的表达式来指定要执行计算的组。我们可以重复上述排名，但结合“类型 1”和“类型 2”列进行更细粒度的排名。

Python Rust

over

result = pokemon.select(
    pl.col("Name", "Type 1", "Type 2"),
    pl.col("Speed")
    .rank("dense", descending=True)
    .over("Type 1", "Type 2")
    .alias("Speed rank"),
)

print(result)

over

// Contribute the Rust translation of the Python example by opening a PR.

shape: (163, 4)
┌───────────────────────┬─────────┬────────┬────────────┐
│ Name                  ┆ Type 1  ┆ Type 2 ┆ Speed rank │
│ ---                   ┆ ---     ┆ ---    ┆ ---        │
│ str                   ┆ enum    ┆ enum   ┆ u32        │
╞═══════════════════════╪═════════╪════════╪════════════╡
│ Bulbasaur             ┆ Grass   ┆ Poison ┆ 6          │
│ Ivysaur               ┆ Grass   ┆ Poison ┆ 3          │
│ Venusaur              ┆ Grass   ┆ Poison ┆ 1          │
│ VenusaurMega Venusaur ┆ Grass   ┆ Poison ┆ 1          │
│ Charmander            ┆ Fire    ┆ null   ┆ 7          │
│ …                     ┆ …       ┆ …      ┆ …          │
│ Moltres               ┆ Fire    ┆ Flying ┆ 2          │
│ Dratini               ┆ Dragon  ┆ null   ┆ 2          │
│ Dragonair             ┆ Dragon  ┆ null   ┆ 1          │
│ Dragonite             ┆ Dragon  ┆ Flying ┆ 1          │
│ Mewtwo                ┆ Psychic ┆ null   ┆ 2          │
└───────────────────────┴─────────┴────────┴────────────┘

通常，使用 `over` 函数获得的结果也可以通过聚合，然后调用 `explode` 函数来实现，尽管行的顺序会不同。

Python Rust

explode

result = (
    pokemon.group_by("Type 1")
    .agg(
        pl.col("Name"),
        pl.col("Speed").rank("dense", descending=True).alias("Speed rank"),
    )
    .select(pl.col("Name"), pl.col("Type 1"), pl.col("Speed rank"))
    .explode("Name", "Speed rank")
)

print(result)

explode

// Contribute the Rust translation of the Python example by opening a PR.

shape: (163, 3)
┌───────────────────────────┬─────────┬────────────┐
│ Name                      ┆ Type 1  ┆ Speed rank │
│ ---                       ┆ ---     ┆ ---        │
│ str                       ┆ enum    ┆ u32        │
╞═══════════════════════════╪═════════╪════════════╡
│ Charmander                ┆ Fire    ┆ 7          │
│ Charmeleon                ┆ Fire    ┆ 6          │
│ Charizard                 ┆ Fire    ┆ 2          │
│ CharizardMega Charizard X ┆ Fire    ┆ 2          │
│ CharizardMega Charizard Y ┆ Fire    ┆ 2          │
│ …                         ┆ …       ┆ …          │
│ AlakazamMega Alakazam     ┆ Psychic ┆ 1          │
│ Drowzee                   ┆ Psychic ┆ 7          │
│ Hypno                     ┆ Psychic ┆ 6          │
│ Mr. Mime                  ┆ Psychic ┆ 5          │
│ Mewtwo                    ┆ Psychic ┆ 2          │
└───────────────────────────┴─────────┴────────────┘

这表明，通常 `group_by` 和 `over` 会产生不同形状的结果：

`group_by` 通常产生一个结果数据框，其行数与用于聚合的组数相同；并且
`over` 通常产生一个与原始数据框行数相同的数据框。

`over` 函数并不总是产生与原始数据框行数相同的结果，这就是我们接下来要探讨的内容。

将结果映射到数据框行

`over` 函数接受一个参数 `mapping_strategy`，它决定了表达式在组上的结果如何映射回数据框的行。

`group_to_rows`

默认行为是 `\"group_to_rows\"`：表达式在组上的结果应与组的长度相同，并且结果被映射回该组的行。

如果行的顺序不重要，选项 `\"explode\"` 性能更高。Polars 不会将结果值映射回原始行，而是创建一个新的数据框，其中来自同一组的值彼此相邻。为了帮助理解这种区别，请考虑以下数据框：

shape: (6, 3)
┌─────────┬─────────┬──────┐
│ athlete ┆ country ┆ rank │
│ ---     ┆ ---     ┆ ---  │
│ str     ┆ str     ┆ i64  │
╞═════════╪═════════╪══════╡
│ A       ┆ PT      ┆ 6    │
│ B       ┆ NL      ┆ 1    │
│ C       ┆ NL      ┆ 5    │
│ D       ┆ PT      ┆ 4    │
│ E       ┆ PT      ┆ 2    │
│ F       ┆ NL      ┆ 3    │
└─────────┴─────────┴──────┘

我们可以按运动员在各自国家内的排名进行排序。如果这样做，荷兰运动员最初位于第二、第三和第六行，他们将保持在那里。改变的将是运动员姓名的顺序，从“B”、“C”和“F”变为“B”、“F”和“C”。

Python Rust

over

result = athletes.select(
    pl.col("athlete", "rank").sort_by(pl.col("rank")).over(pl.col("country")),
    pl.col("country"),
)

print(result)

over

// Contribute the Rust translation of the Python example by opening a PR.

shape: (6, 3)
┌─────────┬──────┬─────────┐
│ athlete ┆ rank ┆ country │
│ ---     ┆ ---  ┆ ---     │
│ str     ┆ i64  ┆ str     │
╞═════════╪══════╪═════════╡
│ E       ┆ 2    ┆ PT      │
│ B       ┆ 1    ┆ NL      │
│ F       ┆ 3    ┆ NL      │
│ D       ┆ 4    ┆ PT      │
│ A       ┆ 6    ┆ PT      │
│ C       ┆ 5    ┆ NL      │
└─────────┴──────┴─────────┘

下图表示了这种转换：

`explode`

如果我们将参数 `mapping_strategy` 设置为 `\"explode\"`，那么同一国家的运动员将被分组在一起，但行的最终顺序（相对于国家而言）将不再相同，如下图所示。

因为 Polars 不需要跟踪每个组的行位置，所以使用 `\"explode\"` 通常比 `\"group_to_rows\"` 更快。然而，使用 `\"explode\"` 也需要更谨慎，因为它意味着重新排列我们希望保留的其他列。生成此结果的代码如下：

Python Rust

over

result = athletes.select(
    pl.all()
    .sort_by(pl.col("rank"))
    .over(pl.col("country"), mapping_strategy="explode"),
)

print(result)

over

// Contribute the Rust translation of the Python example by opening a PR.

shape: (6, 3)
┌─────────┬─────────┬──────┐
│ athlete ┆ country ┆ rank │
│ ---     ┆ ---     ┆ ---  │
│ str     ┆ str     ┆ i64  │
╞═════════╪═════════╪══════╡
│ E       ┆ PT      ┆ 2    │
│ D       ┆ PT      ┆ 4    │
│ A       ┆ PT      ┆ 6    │
│ B       ┆ NL      ┆ 1    │
│ F       ┆ NL      ┆ 3    │
│ C       ┆ NL      ┆ 5    │
└─────────┴─────────┴──────┘

`join`

参数 `mapping_strategy` 的另一个可能值是 `\"join\"`，它将结果值聚合到一个列表中，并将该列表重复应用于同一组的所有行。

Python Rust

over

result = athletes.with_columns(
    pl.col("rank").sort().over(pl.col("country"), mapping_strategy="join"),
)

print(result)

over

// Contribute the Rust translation of the Python example by opening a PR.

shape: (6, 3)
┌─────────┬─────────┬───────────┐
│ athlete ┆ country ┆ rank      │
│ ---     ┆ ---     ┆ ---       │
│ str     ┆ str     ┆ list[i64] │
╞═════════╪═════════╪═══════════╡
│ A       ┆ PT      ┆ [2, 4, 6] │
│ B       ┆ NL      ┆ [1, 3, 5] │
│ C       ┆ NL      ┆ [1, 3, 5] │
│ D       ┆ PT      ┆ [2, 4, 6] │
│ E       ┆ PT      ┆ [2, 4, 6] │
│ F       ┆ NL      ┆ [1, 3, 5] │
└─────────┴─────────┴───────────┘

窗口聚合表达式

如果应用于组值的表达式产生标量值，则该标量将在组的所有行中广播。

Python Rust

over

result = pokemon.select(
    pl.col("Name", "Type 1", "Speed"),
    pl.col("Speed").mean().over(pl.col("Type 1")).alias("Mean speed in group"),
)

print(result)

over

let result = df
    .clone()
    .lazy()
    .select([
        col("Name"),
        col("Type 1"),
        col("Speed"),
        col("Speed")
            .mean()
            .over(["Type 1"])
            .alias("Mean speed in group"),
    ])
    .collect()?;

println!("{result}");

shape: (163, 4)
┌───────────────────────┬─────────┬───────┬─────────────────────┐
│ Name                  ┆ Type 1  ┆ Speed ┆ Mean speed in group │
│ ---                   ┆ ---     ┆ ---   ┆ ---                 │
│ str                   ┆ enum    ┆ i64   ┆ f64                 │
╞═══════════════════════╪═════════╪═══════╪═════════════════════╡
│ Bulbasaur             ┆ Grass   ┆ 45    ┆ 54.230769           │
│ Ivysaur               ┆ Grass   ┆ 60    ┆ 54.230769           │
│ Venusaur              ┆ Grass   ┆ 80    ┆ 54.230769           │
│ VenusaurMega Venusaur ┆ Grass   ┆ 80    ┆ 54.230769           │
│ Charmander            ┆ Fire    ┆ 65    ┆ 86.285714           │
│ …                     ┆ …       ┆ …     ┆ …                   │
│ Moltres               ┆ Fire    ┆ 90    ┆ 86.285714           │
│ Dratini               ┆ Dragon  ┆ 50    ┆ 66.666667           │
│ Dragonair             ┆ Dragon  ┆ 70    ┆ 66.666667           │
│ Dragonite             ┆ Dragon  ┆ 80    ┆ 66.666667           │
│ Mewtwo                ┆ Psychic ┆ 130   ┆ 99.25               │
└───────────────────────┴─────────┴───────┴─────────────────────┘

更多示例

为了进行更多练习，下面是一些窗口函数供我们计算：

按类型排序所有宝可梦；
选择每种类型的前 `3` 个宝可梦，作为 `\"Type 1\"`；
按速度降序排序类型内的宝可梦，并选择前 `3` 个作为 `\"fastest/group\"`；
按攻击降序排序类型内的宝可梦，并选择前 `3` 个作为 `\"strongest/group\"`；以及
按名称排序类型内的宝可梦，并选择前 `3` 个作为 `\"sorted_by_alphabet\"`。

Python Rust

over

result = pokemon.sort("Type 1").select(
    pl.col("Type 1").head(3).over("Type 1", mapping_strategy="explode"),
    pl.col("Name")
    .sort_by(pl.col("Speed"), descending=True)
    .head(3)
    .over("Type 1", mapping_strategy="explode")
    .alias("fastest/group"),
    pl.col("Name")
    .sort_by(pl.col("Attack"), descending=True)
    .head(3)
    .over("Type 1", mapping_strategy="explode")
    .alias("strongest/group"),
    pl.col("Name")
    .sort()
    .head(3)
    .over("Type 1", mapping_strategy="explode")
    .alias("sorted_by_alphabet"),
)
print(result)

over

let result = df
    .clone()
    .lazy()
    .select([
        col("Type 1")
            .head(Some(3))
            .over_with_options(Some(["Type 1"]), None, WindowMapping::Explode)?
            .flatten(),
        col("Name")
            .sort_by(
                ["Speed"],
                SortMultipleOptions::default().with_order_descending(true),
            )
            .head(Some(3))
            .over_with_options(Some(["Type 1"]), None, WindowMapping::Explode)?
            .flatten()
            .alias("fastest/group"),
        col("Name")
            .sort_by(
                ["Attack"],
                SortMultipleOptions::default().with_order_descending(true),
            )
            .head(Some(3))
            .over_with_options(Some(["Type 1"]), None, WindowMapping::Explode)?
            .flatten()
            .alias("strongest/group"),
        col("Name")
            .sort(Default::default())
            .head(Some(3))
            .over_with_options(Some(["Type 1"]), None, WindowMapping::Explode)?
            .flatten()
            .alias("sorted_by_alphabet"),
    ])
    .collect()?;
println!("{result:?}");

shape: (43, 4)
┌────────┬───────────────────────┬───────────────────────┬─────────────────────────┐
│ Type 1 ┆ fastest/group         ┆ strongest/group       ┆ sorted_by_alphabet      │
│ ---    ┆ ---                   ┆ ---                   ┆ ---                     │
│ enum   ┆ str                   ┆ str                   ┆ str                     │
╞════════╪═══════════════════════╪═══════════════════════╪═════════════════════════╡
│ Grass  ┆ Venusaur              ┆ Victreebel            ┆ Bellsprout              │
│ Grass  ┆ VenusaurMega Venusaur ┆ VenusaurMega Venusaur ┆ Bulbasaur               │
│ Grass  ┆ Victreebel            ┆ Exeggutor             ┆ Exeggcute               │
│ Water  ┆ Starmie               ┆ GyaradosMega Gyarados ┆ Blastoise               │
│ Water  ┆ Tentacruel            ┆ Kingler               ┆ BlastoiseMega Blastoise │
│ …      ┆ …                     ┆ …                     ┆ …                       │
│ Rock   ┆ Kabutops              ┆ Kabutops              ┆ Geodude                 │
│ Ice    ┆ Jynx                  ┆ Articuno              ┆ Articuno                │
│ Ice    ┆ Articuno              ┆ Jynx                  ┆ Jynx                    │
│ Fairy  ┆ Clefable              ┆ Clefable              ┆ Clefable                │
│ Fairy  ┆ Clefairy              ┆ Clefairy              ┆ Clefairy                │
└────────┴───────────────────────┴───────────────────────┴─────────────────────────┘