17370845950

新闻动态

numpy 如何用 np.ufunc.reduceat 实现分组归约

np.ufunc.reduceat 的核心行为是按索引切片归约：以 indices 中非递减整数为左闭右开切片起点，对每段调用 ufunc 归约，最后一段自动延至数组末尾。

np.ufunc.reduceat 不是按值分组，而是按索引切片归约：它在指定起始位置对数组做“左闭右开”切片，然后对每个切片调用 ufunc（如 np.add、np.maximum）归约。关键点在于：分组边界由索引数组决定，不是由数据值决定。

比如 np.add.reduceat(a, [0,2,4]) 等价于：[a[0]+a[1], a[2]+a[3], a[4]+a[5]+...]（假设 a 长度 ≥6）

真实场景中，你通常有类似 group_ids = np.array([0,0,1,1,1,2]) 这样的标签，想按值聚合。这时不能直接传 group_ids 给 reduceat —— 它要的是每组第一个元素的索引。

先用 np.unique 获取分组起始位置：

_, idx_start = np.unique(group_ids, return_index=True)

再用 reduceat 对目标数组归约：

result = np.add.reduceat(values, indices[:-1])

注意：必须用 indices[:-1] 作为 reduceat 的 indices 参数，因为 reduceat 自动处理最后一段到末尾。

np.bincount 要求 group_id 是非负小整数，且隐式归约方式固定为求和；reduceat 则无类型限制，支持任意 ufunc，也支持 float / str（只要 ufunc 支持）。

实际写的时候，最容易被忽略的是：reduceat 的 indices 必须严格对应“每组首个元素位置”，且必须升序；而多数人第一反应是传 group_id 数组本身——这直接导致结果完全不可读。