版本 1

重大变更

在 Series 构造函数中正确应用 `strict` 参数

Series 构造函数的行为已更新。通常，它将更严格，除非用户传入 strict=False。

严格构造比非严格构造更高效，因此请确保将相同数据类型的值传递给构造函数，以获得最佳性能。

示例

之前

>>> s = pl.Series([1, 2, 3.5])
shape: (3,)
Series: '' [f64]
[
        1.0
        2.0
        3.5
]
>>> s = pl.Series([1, 2, 3.5], strict=False)
shape: (3,)
Series: '' [i64]
[
        1
        2
        null
]
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
        1
        2
        null
]

之后

>>> s = pl.Series([1, 2, 3.5])
Traceback (most recent call last):
...
TypeError: unexpected value while building Series of type Int64; found value of type Float64: 3.5

Hint: Try setting `strict=False` to allow passing data with mixed types.
>>> s = pl.Series([1, 2, 3.5], strict=False)
shape: (3,)
Series: '' [f64]
[
        1.0
        2.0
        3.5
]
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
        1
        2
        3
]

更改 DataFrame 构造的数据方向推断逻辑

Polars 不再检查数据类型来推断传递给 DataFrame 构造函数的数据方向。数据方向是根据数据和模式维度推断的。

此外，每当推断出行方向时，都会发出警告。由于一些令人困惑的边缘情况，用户应传入 orient="row" 以明确表示其输入是基于行的。

示例

之前

>>> data = [[1, "a"], [2, "b"]]
>>> pl.DataFrame(data)
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ a        │
│ 2        ┆ b        │
└──────────┴──────────┘

之后

>>> pl.DataFrame(data)
Traceback (most recent call last):
...
TypeError: unexpected value while building Series of type Int64; found value of type String: "a"

Hint: Try setting `strict=False` to allow passing data with mixed types.

改为使用

>>> pl.DataFrame(data, orient="row")
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ a        │
│ 2        ┆ b        │
└──────────┴──────────┘

在 Series 构造函数中一致地转换为给定时间区

危险

此更改可能会静默影响您的管道结果。如果您使用时区，请务必考虑此更改。

Series 和 DataFrame 构造函数中时区信息的处理不一致。按行构造会转换为给定时间区，而按列构造会替换时间区。通过始终转换为数据类型中指定的时区，此不一致性已得到修复。

示例

之前

>>> from datetime import datetime
>>> pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
shape: (1,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
        2020-01-01 00:00:00 CET
]

之后

>>> from datetime import datetime
>>> pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
shape: (1,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
        2020-01-01 01:00:00 CET
]

更新一些错误类型为更合适的变体

我们已经更新了许多错误类型，以更准确地表示问题。最常见的是，ComputeError 类型已更改为 InvalidOperationError 或 SchemaError。

示例

之前

>>> s = pl.Series("a", [100, 200, 300])
>>> s.cast(pl.UInt8)
Traceback (most recent call last):
...
polars.exceptions.ComputeError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300]

之后

>>> s.cast(pl.UInt8)
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300]

更新 `read/scan_parquet`，默认对文件输入禁用 Hive 分区

Parquet 读取函数现在也支持目录输入。Hive 分区默认对目录启用，但现在默认对文件输入禁用。文件输入包括单个文件、glob 模式和文件列表。显式传入 hive_partitioning=True 以恢复以前的行为。

示例

之前

>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ x   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 1.0 │
│ 1   ┆ 2.0 │
└─────┴─────┘

之后

>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 1)
┌─────┐
│ x   │
│ --- │
│ f64 │
╞═════╡
│ 1.0 │
│ 2.0 │
└─────┘
>>> pl.read_parquet("dataset/a=1/foo.parquet", hive_partitioning=True)
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ x   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 1.0 │
│ 1   ┆ 2.0 │
└─────┴─────┘

更新 `reshape`，使其返回 Array 类型而非 List 类型

reshape 现在返回 Array 类型而非 List 类型。

用户可以通过在输出上调用 .arr.to_list() 来恢复旧功能。请注意，这并不比直接创建 List 类型更昂贵，因为重塑为数组基本上是免费的。

示例

之前

>>> s = pl.Series([1, 2, 3, 4, 5, 6])
>>> s.reshape((2, 3))
shape: (2,)
Series: '' [list[i64]]
[
        [1, 2, 3]
        [4, 5, 6]
]

之后

>>> s.reshape((2, 3))
shape: (2,)
Series: '' [array[i64, 3]]
[
        [1, 2, 3]
        [4, 5, 6]
]

将 2D NumPy 数组读取为 `Array` 类型而非 `List` 类型

Series 构造函数现在将 2D NumPy 数组解析为 Array 类型而非 List 类型。

示例

之前

>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [list[i64]]
[
        [1, 2]
        [3, 4]
]

之后

>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [array[i64, 2]]
[
        [1, 2]
        [3, 4]
]

将 `replace` 功能拆分为两个独立方法

replace 的 API 已被证明对许多用户来说是困惑的，尤其是在 default 参数和结果数据类型方面。

它已被拆分为两个方法：replace 和 replace_strict。replace 现在始终保留现有数据类型（重大更改，请参见下面的示例），旨在替换现有列中的某些值。其参数 default 和 return_dtype 已被弃用。

新方法 replace_strict 旨在创建新列，映射原始列的部分或所有值，并可选择指定默认值。如果未提供默认值，并且任何非空值未被映射，则会引发错误。

示例

之前

>>> s = pl.Series([1, 2, 3])
>>> s.replace(1, "a")
shape: (3,)
Series: '' [str]
[
        "a"
        "2"
        "3"
]

之后

>>> s.replace(1, "a")
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in column 'literal' for 1 out of 1 values: ["a"]
>>> s.replace_strict(1, "a", default=s)
shape: (3,)
Series: '' [str]
[
        "a"
        "2"
        "3"
]

在 `ewm_mean`、`ewm_std` 和 `ewm_var` 中保留空值

Polars 将不再在 ewm 方法中向前填充空值。用户可以在输出上调用 .forward_fill() 以实现相同的结果。

示例

之前

>>> s = pl.Series([1, 4, None, 3])
>>> s.ewm_mean(alpha=.9, ignore_nulls=False)
shape: (4,)
Series: '' [f64]
[
        1.0
        3.727273
        3.727273
        3.007913
]

之后

>>> s.ewm_mean(alpha=.9, ignore_nulls=False)
shape: (4,)
Series: '' [f64]
[
        1.0
        3.727273
        null
        3.007913
]

更新 `clip`，使其不再在给定边界中传播空值

边界中的空值不再将值设置为 null - 相反，保留原始值。

之前

>>> df = pl.DataFrame({"a": [0, 1, 2], "min": [1, None, 1]})
>>> df.select(pl.col("a").clip("min"))
shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ null │
│ 2    │
└──────┘

之后

>>> df.select(pl.col("a").clip("min"))
shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 1    │
│ 2    │
└──────┘

更改 `str.to_datetime`，使其对格式说明符 `"%f"` 和 `"%.f"` 默认使用微秒精度

在 .str.to_datetime 中，当指定 %.f 作为格式时，默认会将结果数据类型设置为纳秒精度。这已更改为微秒精度。

示例

之前

>>> s = pl.Series(["2022-08-31 00:00:00.123456789"])
>>> s.str.to_datetime(format="%Y-%m-%d %H:%M:%S%.f")
shape: (1,)
Series: '' [datetime[ns]]
[
        2022-08-31 00:00:00.123456789
]

之后

>>> s.str.to_datetime(format="%Y-%m-%d %H:%M:%S%.f")
shape: (1,)
Series: '' [datetime[us]]
[
        2022-08-31 00:00:00.123456
]

在 `pivot` 操作中，当按多个值进行透视时，更新结果列名

在 DataFrame.pivot 中，当指定多个 values 列时，结果会在列名中冗余地包含 column 列。此问题已得到解决。

示例

之前

>>> df = pl.DataFrame(
...     {
...         "name": ["Cady", "Cady", "Karen", "Karen"],
...         "subject": ["maths", "physics", "maths", "physics"],
...         "test_1": [98, 99, 61, 58],
...         "test_2": [100, 100, 60, 60],
...     }
... )
>>> df.pivot(index='name', columns='subject', values=['test_1', 'test_2'])
shape: (2, 5)
┌───────┬──────────────────────┬────────────────────────┬──────────────────────┬────────────────────────┐
│ name  ┆ test_1_subject_maths ┆ test_1_subject_physics ┆ test_2_subject_maths ┆ test_2_subject_physics │
│ ---   ┆ ---                  ┆ ---                    ┆ ---                  ┆ ---                    │
│ str   ┆ i64                  ┆ i64                    ┆ i64                  ┆ i64                    │
╞═══════╪══════════════════════╪════════════════════════╪══════════════════════╪════════════════════════╡
│ Cady  ┆ 98                   ┆ 99                     ┆ 100                  ┆ 100                    │
│ Karen ┆ 61                   ┆ 58                     ┆ 60                   ┆ 60                     │
└───────┴──────────────────────┴────────────────────────┴──────────────────────┴────────────────────────┘

之后

>>> df = pl.DataFrame(
...     {
...         "name": ["Cady", "Cady", "Karen", "Karen"],
...         "subject": ["maths", "physics", "maths", "physics"],
...         "test_1": [98, 99, 61, 58],
...         "test_2": [100, 100, 60, 60],
...     }
... )
>>> df.pivot('subject', index='name')
┌───────┬──────────────┬────────────────┬──────────────┬────────────────┐
│ name  ┆ test_1_maths ┆ test_1_physics ┆ test_2_maths ┆ test_2_physics │
│ ---   ┆ ---          ┆ ---            ┆ ---          ┆ ---            │
│ str   ┆ i64          ┆ i64            ┆ i64          ┆ i64            │
╞═══════╪══════════════╪════════════════╪══════════════╪════════════════╡
│ Cady  ┆ 98           ┆ 99             ┆ 100          ┆ 100            │
│ Karen ┆ 61           ┆ 58             ┆ 60           ┆ 60             │
└───────┴──────────────┴────────────────┴──────────────┴────────────────┘

请注意，函数签名也已更改

columns 已重命名为 on，现在是第一个位置参数。
index 和 values 都是可选的。如果未指定 index，则将使用未在 on 和 values 中指定的所有列。如果未指定 values，则将使用未在 on 和 index 中指定的所有列。

从 Arrow 转换时默认支持 Decimal 类型

更新从 Arrow 转换时，始终将 Decimal 转换为 Polars Decimal，而不是转换为 Float64。Config.activate_decimals 已被移除。

示例

之前

>>> from decimal import Decimal as D
>>> import pyarrow as pa
>>> arr = pa.array([D("1.01"), D("2.25")])
>>> pl.from_arrow(arr)
shape: (2,)
Series: '' [f64]
[
        1.01
        2.25
]

之后

>>> pl.from_arrow(arr)
shape: (2,)
Series: '' [decimal[3,2]]
[
        1.01
        2.25
]

从 `pl.read_json` 和 `DataFrame.write_json` 中移除 serde 功能

pl.read_json 不再支持读取由 DataFrame.serialize 生成的 JSON 文件。用户应改用 pl.DataFrame.deserialize。

DataFrame.write_json 现在只写入面向行的 JSON。参数 row_oriented 和 pretty 已被移除。用户应使用 DataFrame.serialize 来序列化 DataFrame。

示例 - write_json

之前

>>> df = pl.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
>>> df.write_json()
'{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2]},{"name":"b","datatype":"Float64","bit_settings":"","values":[3.0,4.0]}]}'

之后

>>> df.write_json()  # Same behavior as previously `df.write_json(row_oriented=True)`
'[{"a":1,"b":3.0},{"a":2,"b":4.0}]'

示例 - read_json

之前

>>> import io
>>> df_ser = '{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2]},{"name":"b","datatype":"Float64","bit_settings":"","values":[3.0,4.0]}]}'
>>> pl.read_json(io.StringIO(df_ser))
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 3.0 │
│ 2   ┆ 4.0 │
└─────┴─────┘

之后

>>> pl.read_json(io.StringIO(df_ser))  # Format no longer supported: data is treated as a single row
shape: (1, 1)
┌─────────────────────────────────┐
│ columns                         │
│ ---                             │
│ list[struct[4]]                 │
╞═════════════════════════════════╡
│ [{"a","Int64","",[1.0, 2.0]}, … │
└─────────────────────────────────┘

改为使用

>>> pl.DataFrame.deserialize(io.StringIO(df_ser))
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 3.0 │
│ 2   ┆ 4.0 │
└─────┴─────┘

`Series.equals` 默认不再检查名称

以前，如果 Series 名称不匹配，Series.equals 会返回 False。现在该方法默认不再检查名称。通过设置 check_names=True 可以保留以前的行为。

示例

之前

>>> s1 = pl.Series("foo", [1, 2, 3])
>>> s2 = pl.Series("bar", [1, 2, 3])
>>> s1.equals(s2)
False

之后

>>> s1.equals(s2)
True
>>> s1.equals(s2, check_names=True)
False

从 `nth` 表达式函数中移除 `columns` 参数

columns 参数已被移除，取而代之的是将位置输入视为额外的索引。请改用 Expr.get 来获得相同的功能。

示例

之前

>>> df = pl.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
>>> df.select(pl.nth(1, "a"))
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
└─────┘

之后

>>> df.select(pl.nth(1, "a"))
...
TypeError: argument 'indices': 'str' object cannot be interpreted as an integer

改为使用

>>> df.select(pl.col("a").get(1))
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
└─────┘

重命名 `rle` 输出的结构体字段

rle 方法的结构体字段已从 lengths/values 重命名为 len/value。len 字段的数据类型也已更新以匹配索引类型（以前是 Int32，现在是 UInt32）。

之前

>>> s = pl.Series(["a", "a", "b", "c", "c", "c"])
>>> s.rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ lengths ┆ values │
│ ---     ┆ ---    │
│ i32     ┆ str    │
╞═════════╪════════╡
│ 2       ┆ a      │
│ 1       ┆ b      │
│ 3       ┆ c      │
└─────────┴────────┘

之后

>>> s.rle().struct.unnest()
shape: (3, 2)
┌─────┬───────┐
│ len ┆ value │
│ --- ┆ ---   │
│ u32 ┆ str   │
╞═════╪═══════╡
│ 2   ┆ a     │
│ 1   ┆ b     │
│ 3   ┆ c     │
└─────┴───────┘

更新 `set_sorted`，使其只接受单个列

调用 set_sorted 表示列是单独排序的。传入多个列表示这些列中的每一个也都是单独排序的。然而，许多用户认为这意味着这些列是作为一个组进行排序的，这导致了不正确的结果。

为了帮助用户避免此陷阱，我们移除了在 set_sorted 中指定多个列的可能性。要将多个列设置为已排序，只需多次调用 set_sorted。

示例

之前

>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0], "c": [9, 7, 8]})
>>> df.set_sorted("a", "b")

之后

>>> df.set_sorted("a", "b")
Traceback (most recent call last):
...
TypeError: DataFrame.set_sorted() takes 2 positional arguments but 3 were given

改为使用

>>> df.set_sorted("a").set_sorted("b")

所有 `get`/`gather` 操作默认对越界索引抛出错误

get 和 gather 操作在不同位置的默认行为不一致。现在，所有此类操作默认都会引发错误。传入 null_on_oob=True 以恢复以前的行为。

示例

之前

>>> s = pl.Series([[0, 1, 2], [0]])
>>> s.list.get(1)
shape: (2,)
Series: '' [i64]
[
        1
        null
]

之后

>>> s.list.get(1)
Traceback (most recent call last):
...
polars.exceptions.ComputeError: get index is out of bounds

改为使用

>>> s.list.get(1, null_on_oob=True)
shape: (2,)
Series: '' [i64]
[
        1
        null
]

将 `read_excel` 的默认引擎更改为 `"calamine"`

calamine 引擎（通过 fastexcel 包提供）是最近才添加到 Polars 的。它比其他引擎快得多，并且已经是 xlsb 和 xls 文件的默认引擎。我们现在将其设为所有 Excel 文件的默认引擎。

此引擎与以前的默认引擎（xlsx2csv）之间可能存在细微差异。一个明显的区别是 calamine 引擎不支持 engine_options 参数。如果您无法通过 calamine 引擎获得所需行为，请指定 engine="xlsx2csv" 以恢复以前的行为。

示例

之前

>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})

之后

>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})
Traceback (most recent call last):
...
TypeError: read_excel() got an unexpected keyword argument 'skip_empty_lines'

相反，请显式指定 xlsx2csv 引擎或省略 engine_options

>>> pl.read_excel("data.xlsx", engine="xlsx2csv", engine_options={"skip_empty_lines": True})

从一些 DataType 中移除类变量

一些 DataType 类曾有类变量。例如，Datetime 类曾将 time_unit 和 time_zone 作为类变量。这是无意的：这些本应是实例变量。现在已更正。

示例

之前

>>> dtype = pl.Datetime
>>> dtype.time_unit is None
True

之后

>>> dtype.time_unit is None
Traceback (most recent call last):
...
AttributeError: type object 'Datetime' has no attribute 'time_unit'

改为使用

>>> getattr(dtype, "time_unit", None) is None
True

将 `group_by_dynamic` 中的默认 `offset` 从 'negative `every`' 更改为 'zero'

这会影响 group_by_dynamic 中第一个窗口的起始位置。新行为应更符合用户预期。

示例

之前

>>> from datetime import date
>>> df = pl.DataFrame({
...     "ts": [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)],
...     "value": [1, 2, 3],
... })
>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (4, 2)
┌────────────┬───────────┐
│ ts         ┆ value     │
│ ---        ┆ ---       │
│ date       ┆ list[i64] │
╞════════════╪═══════════╡
│ 2019-12-31 ┆ [1]       │
│ 2020-01-01 ┆ [1, 2]    │
│ 2020-01-02 ┆ [2, 3]    │
│ 2020-01-03 ┆ [3]       │
└────────────┴───────────┘

之后

>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (3, 2)
┌────────────┬───────────┐
│ ts         ┆ value     │
│ ---        ┆ ---       │
│ date       ┆ list[i64] │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ [1, 2]    │
│ 2020-01-02 ┆ [2, 3]    │
│ 2020-01-03 ┆ [3]       │
└────────────┴───────────┘

更改 `LazyFrame/DataFrame/Expr` 的默认序列化格式

Polars 对象上 serialize/deserialize 方法唯一可用的序列化格式是 JSON。我们添加了一种更优化的二进制格式并将其设为默认。通过传入 format="json" 仍然可以使用 JSON 序列化。

示例

之前

>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
'{"MapFunction":{"input":{"DataFrameScan":{"df":{"columns":[{"name":...'
>>> from io import StringIO
>>> pl.LazyFrame.deserialize(StringIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

之后

>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
b'\xa1kMapFunction\xa2einput\xa1mDataFrameScan\xa4bdf...'
>>> from io import BytesIO  # Note: using BytesIO instead of StringIO
>>> pl.LazyFrame.deserialize(BytesIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

限制从 `DataFrame.sql` 访问全局变量，推荐使用 `pl.sql`

DataFrame 和 LazyFrame 上的 sql 方法不再能访问全局变量。这些方法应该用于操作帧本身。对于全局访问，现在有顶级 sql 函数。

示例

之前

>>> df1 = pl.DataFrame({"id1": [1, 2]})
>>> df2 = pl.DataFrame({"id2": [3, 4]})
>>> df1.sql("SELECT * FROM df1 CROSS JOIN df2")
shape: (4, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 1   ┆ 4   │
│ 2   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

之后

>>> df1.sql("SELECT * FROM df1 CROSS JOIN df2")
Traceback (most recent call last):
...
polars.exceptions.SQLInterfaceError: relation 'df1' was not found

改为使用

>>> pl.sql("SELECT * FROM df1 CROSS JOIN df2", eager=True)
shape: (4, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 1   ┆ 4   │
│ 2   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

移除类型别名的重新导出

我们在 polars.type_aliases 模块中定义了许多类型别名。其中一些在顶级和 polars.datatypes 模块中被重新导出。这些重新导出已被移除。

我们计划将来添加一个公共的 polars.typing 模块，其中包含一些精选的类型别名。在此之前，请自行定义类型别名，或从我们的 polars.type_aliases 模块导入。请注意，type_aliases 模块并非技术上的公共模块，因此请自行承担风险使用。

示例

之前

def foo(dtype: pl.PolarsDataType) -> None: ...

之后

PolarsDataType = pl.DataType | type[pl.DataType]

def foo(dtype: PolarsDataType) -> None: ...

简化 `pyproject.toml` 中的可选依赖定义

我们重新审视了可选依赖定义并做了一些小改动。如果您正在使用 fastexcel、gevent、matplotlib 或 async 这些额外功能，这将是一个重大更改。请更新您的 Polars 安装以使用新的额外功能。

示例

之前

pip install 'polars[fastexcel,gevent,matplotlib]'

之后

pip install 'polars[calamine,async,graph]'

弃用

当使用 LazyFrame 的 `schema/dtypes/columns/width` 属性时发出 `PerformanceWarning`

惰性引擎中模式解析正确性的最新改进，对解析模式的成本产生了显著的性能影响。它不再是“免费”的——事实上，在涉及惰性文件读取的复杂管道中，解析模式可能相对昂贵。

因此，LazyFrame 上与模式相关的属性不再是好的 API 设计。属性表示已经可用的信息，只需要检索。然而，对于 LazyFrame 属性，访问这些属性可能会产生显著的性能成本。

为了解决这个问题，我们添加了 LazyFrame.collect_schema 方法，该方法检索模式并返回一个 Schema 对象。这些属性会引发 PerformanceWarning 并告知用户改用 collect_schema。我们选择暂时不弃用这些属性，以方便编写对 DataFrame 和 LazyFrame 均通用的代码。

版本 1

重大变更

在 Series 构造函数中正确应用 strict 参数

更改 DataFrame 构造的数据方向推断逻辑

在 Series 构造函数中一致地转换为给定时间区

更新一些错误类型为更合适的变体

更新 read/scan_parquet，默认对文件输入禁用 Hive 分区

更新 reshape，使其返回 Array 类型而非 List 类型

将 2D NumPy 数组读取为 Array 类型而非 List 类型

将 replace 功能拆分为两个独立方法

在 ewm_mean、ewm_std 和 ewm_var 中保留空值

更新 clip，使其不再在给定边界中传播空值

更改 str.to_datetime，使其对格式说明符 "%f" 和 "%.f" 默认使用微秒精度

示例

在 pivot 操作中，当按多个值进行透视时，更新结果列名

从 Arrow 转换时默认支持 Decimal 类型

从 pl.read_json 和 DataFrame.write_json 中移除 serde 功能

Series.equals 默认不再检查名称

从 nth 表达式函数中移除 columns 参数

重命名 rle 输出的结构体字段

更新 set_sorted，使其只接受单个列

所有 get/gather 操作默认对越界索引抛出错误

将 read_excel 的默认引擎更改为 "calamine"

示例

从一些 DataType 中移除类变量

将 group_by_dynamic 中的默认 offset 从 'negative every' 更改为 'zero'

更改 LazyFrame/DataFrame/Expr 的默认序列化格式

限制从 DataFrame.sql 访问全局变量，推荐使用 pl.sql

移除类型别名的重新导出

简化 pyproject.toml 中的可选依赖定义

弃用

当使用 LazyFrame 的 schema/dtypes/columns/width 属性时发出 PerformanceWarning

在 Series 构造函数中正确应用 `strict` 参数

更新 `read/scan_parquet`，默认对文件输入禁用 Hive 分区

更新 `reshape`，使其返回 Array 类型而非 List 类型

将 2D NumPy 数组读取为 `Array` 类型而非 `List` 类型

将 `replace` 功能拆分为两个独立方法

在 `ewm_mean`、`ewm_std` 和 `ewm_var` 中保留空值

更新 `clip`，使其不再在给定边界中传播空值

更改 `str.to_datetime`，使其对格式说明符 `"%f"` 和 `"%.f"` 默认使用微秒精度

在 `pivot` 操作中，当按多个值进行透视时，更新结果列名

从 `pl.read_json` 和 `DataFrame.write_json` 中移除 serde 功能

`Series.equals` 默认不再检查名称

从 `nth` 表达式函数中移除 `columns` 参数

重命名 `rle` 输出的结构体字段

更新 `set_sorted`，使其只接受单个列

所有 `get`/`gather` 操作默认对越界索引抛出错误

将 `read_excel` 的默认引擎更改为 `"calamine"`

将 `group_by_dynamic` 中的默认 `offset` 从 'negative `every`' 更改为 'zero'

更改 `LazyFrame/DataFrame/Expr` 的默认序列化格式

限制从 `DataFrame.sql` 访问全局变量，推荐使用 `pl.sql`

简化 `pyproject.toml` 中的可选依赖定义

当使用 LazyFrame 的 `schema/dtypes/columns/width` 属性时发出 `PerformanceWarning`