Package renamed to torchdatasets!
- Use
map
, apply
, reduce
or filter
directly on Dataset
objects cache
data in RAM/disk or via your own method (partial caching supported)- Full PyTorch's
Dataset
and IterableDataset
support - General
torchdatasets.maps
like Flatten
or Select
- Extensible interface (your own cache methods, cache modifiers, maps etc.)
- Useful
torchdatasets.datasets
classes designed for general tasks (e.g. file reading) - Support for
torchvision
datasets (e.g. ImageFolder
, MNIST
, CIFAR10
) via td.datasets.WrapDataset
- Minimal overhead (single call to
super().__init__()
)
Version | Docs | Tests | Coverage | Style | PyPI | Python | PyTorch | Docker | Roadmap |
---|
| | | | | | | | | |
:bulb: Examples
Check documentation here:
https://szymonmaszke.github.io/torchdatasets
General example
- Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:
import torchdatasets as td
import torchvision
class Images(td.Dataset):
def __init__(self, path: str):
super().__init__()
self.files = [file for file in pathlib.Path(path).glob("*")]
def __getitem__(self, index):
return Image.open(self.files[index])
def __len__(self):
return len(self.files)
images = Images("./data").map(torchvision.transforms.ToTensor()).cache()
You can concatenate above dataset with another (say labels
) and iterate over them as per usual:
for data, label in images | labels:
- Cache first
1000
samples in memory, save the rest on disk in folder ./cache
:
images = (
ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
.cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
.cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
)
To see what else you can do please check torchdatasets documentation
Integration with torchvision
Using torchdatasets
you can easily split torchvision
datasets and apply augmentation
only to the training part of data without any troubles:
import torchvision
import torchdatasets as td
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
model_dataset,
(int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)
train_dataset.map(
td.maps.To(
torchvision.transforms.Compose(
[
torchvision.transforms.RandomResizedCrop(224),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
),
]
)
),
0,
)
Please notice you can use td.datasets.WrapDataset
with any existing torch.utils.data.Dataset
instance to give it additional caching
and mapping
powers!
:wrench: Installation
:snake: pip
Latest release:
pip install --user torchdatasets
Nightly:
pip install --user torchdatasets-nightly
CPU standalone and various versions of GPU enabled images are available
at dockerhub.
For CPU quickstart, issue:
docker pull szymonmaszke/torchdatasets:18.04
Nightly builds are also available, just prefix tag with nightly_
. If you are going for GPU
image make sure you have
nvidia/docker installed and it's runtime set.
:question: Contributing
If you find any issue or you think some functionality may be useful to others and fits this library, please open new Issue or create Pull Request.
To get an overview of thins one can do to help this project, see Roadmap