flox
This project explores strategies for fast GroupBy reductions with dask.array. It used to be called dask_groupby
It was motivated by
- Dask Dataframe GroupBy
blogpost
- numpy_groupies in Xarray
issue
(See a
presentation
about this package, from the Pangeo Showcase).
Acknowledgements
This work was funded in part by
- NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System
Data in the Cloud" (PI J. Hamman, NCAR),
- NASA-OSTFL 80NSSC22K0345 "Enhancing analysis of NASA data with the open-source Python Xarray Library" (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
- NCAR's Earth System Data Science Initiative.
It was motivated by very very many discussions in the Pangeo community.
API
There are two main functions
flox.groupby_reduce(dask_array, by_dask_array, "mean")
"pure" dask array interfaceflox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean")
"pure" xarray interface; though work is ongoing to integrate this
package in xarray.
Implementation
See the documentation for details on the implementation.
Custom reductions
flox
implements all common reductions provided by numpy_groupies
in aggregations.py
.
It also allows you to specify a custom Aggregation (again inspired by dask.dataframe),
though this might not be fully functional at the moment. See aggregations.py
for examples.
mean = Aggregation(
name="mean",
numpy="mean",
chunk=("sum", "count"),
combine=("sum", "sum"),
finalize=lambda sum_, count: sum_ / count,
fill_value=0,
final_fill_value=np.nan,
)