Flower Datasets
Flower Datasets (flwr-datasets
) is a library to quickly and easily create datasets for federated learning, federated evaluation, and federated analytics. It was created by the Flower Labs
team that also created Flower: A Friendly Federated AI Framework.
[!TIP]
For complete documentation that includes API docs, how-to guides and tutorials, please visit the Flower Datasets Documentation and for full FL example see the Flower Examples page.
Installation
For a complete installation guide visit the Flower Datasets Documentation
pip install flwr-datasets[vision]
Overview
Flower Datasets library supports:
- downloading datasets - choose the dataset from Hugging Face's
datasets
, - partitioning datasets - customize the partitioning scheme,
- creating centralized datasets - leave parts of the dataset unpartitioned (e.g. for centralized evaluation).
Thanks to using Hugging Face's datasets
used under the hood, Flower Datasets integrates with the following popular formats/frameworks:
- Hugging Face,
- PyTorch,
- TensorFlow,
- Numpy,
- Pandas,
- Jax,
- Arrow.
Create custom partitioning schemes or choose from the implemented partitioning schemes:
- Partitioner (the abstract base class)
Partitioner
- IID partitioning
IidPartitioner(num_partitions)
- Dirichlet partitioning
DirichletPartitioner(num_partitions, partition_by, alpha)
- Distribution partitioning
DistributionPartitioner(distribution_array, num_partitions, num_unique_labels_per_partition, partition_by, preassigned_num_samples_per_label, rescale)
- InnerDirichlet partitioning
InnerDirichletPartitioner(partition_sizes, partition_by, alpha)
- Pathological partitioning
PathologicalPartitioner(num_partitions, partition_by, num_classes_per_partition, class_assignment_mode)
- Natural ID partitioning
NaturalIdPartitioner(partition_by)
- Size based partitioning (the abstract base class for the partitioners dictating the division based the number of samples)
SizePartitioner
- Linear partitioning
LinearPartitioner(num_partitions)
- Square partitioning
SquarePartitioner(num_partitions)
- Exponential partitioning
ExponentialPartitioner(num_partitions)
- more to come in the future releases (contributions are welcome).
Comparison of Partitioning Schemes on CIFAR10
PS: This plot was generated using a library function (see flwr_datasets.visualization package for more).
Usage
Flower Datasets exposes the FederatedDataset
abstraction to represent the dataset needed for federated learning/evaluation/analytics. It has two powerful methods that let you handle the dataset preprocessing: load_partition(partition_id, split)
and load_split(split)
.
Here's a basic quickstart example of how to partition the MNIST dataset:
from flwr_datasets import FederatedDataset
from flwr_datasets.partitioners import IidPartitioner
# The train split of the MNIST dataset will be partitioned into 100 partitions
partitioner = IidPartitioner(num_partitions=100)
fds = FederatedDataset("ylecun/mnist", partitioners={"train": partitioner})
partition = fds.load_partition(0)
centralized_data = fds.load_split("test")
For more details, please refer to the specific how-to guides or tutorial. They showcase customization and more advanced features.
Future release
Here are a few of the things that we will work on in future releases:
- ✅ Support for more datasets (especially the ones that have user id present).
- ✅ Creation of custom
Partitioner
s. - ✅ More out-of-the-box
Partitioner
s. - ✅ Passing
Partitioner
s via FederatedDataset
's partitioners
argument. - ✅ Customization of the dataset splitting before the partitioning.
- ✅ Simplification of the dataset transformation to the popular frameworks/types.
- Creation of the synthetic data,
- Support for Vertical FL.