
Security News
GitHub Actions Pricing Whiplash: Self-Hosted Actions Billing Change Postponed
GitHub postponed a new billing model for self-hosted Actions after developer pushback, but moved forward with hosted runner price cuts on January 1.
torchdatasets-nightly
Advanced tools
PyTorch based library focused on data processing and input pipelines in general.
map, apply, reduce or filter directly on Dataset objectscache data in RAM/disk or via your own method (partial caching supported)Dataset and IterableDataset supporttorchdatasets.maps like Flatten or Selecttorchdatasets.datasets classes designed for general tasks (e.g. file reading)torchvision datasets (e.g. ImageFolder, MNIST, CIFAR10) via td.datasets.WrapDatasetsuper().__init__())| Version | Docs | Tests | Coverage | Style | PyPI | Python | PyTorch | Docker | Roadmap |
|---|---|---|---|---|---|---|---|---|---|
Check documentation here: https://szymonmaszke.github.io/torchdatasets
import torchdatasets as td
import torchvision
class Images(td.Dataset): # Different inheritance
def __init__(self, path: str):
super().__init__() # This is the only change
self.files = [file for file in pathlib.Path(path).glob("*")]
def __getitem__(self, index):
return Image.open(self.files[index])
def __len__(self):
return len(self.files)
images = Images("./data").map(torchvision.transforms.ToTensor()).cache()
You can concatenate above dataset with another (say labels) and iterate over them as per usual:
for data, label in images | labels:
# Do whatever you want with your data
1000 samples in memory, save the rest on disk in folder ./cache:images = (
ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
# First 1000 samples in memory
.cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
# Sample from 1000 to the end saved with Pickle on disk
.cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
# You can define your own cachers, modifiers, see docs
)
To see what else you can do please check torchdatasets documentation
torchvisionUsing torchdatasets you can easily split torchvision datasets and apply augmentation
only to the training part of data without any troubles:
import torchvision
import torchdatasets as td
# Wrap torchvision dataset with WrapDataset
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))
# Split dataset
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
model_dataset,
(int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)
# Apply torchvision mappings ONLY to train dataset
train_dataset.map(
td.maps.To(
torchvision.transforms.Compose(
[
torchvision.transforms.RandomResizedCrop(224),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
),
]
)
),
# Apply this transformation to zeroth sample
# First sample is the label
0,
)
Please notice you can use td.datasets.WrapDataset with any existing torch.utils.data.Dataset
instance to give it additional caching and mapping powers!
pip install --user torchdatasets
pip install --user torchdatasets-nightly
CPU standalone and various versions of GPU enabled images are available at dockerhub.
For CPU quickstart, issue:
docker pull szymonmaszke/torchdatasets:18.04
Nightly builds are also available, just prefix tag with nightly_. If you are going for GPU image make sure you have
nvidia/docker installed and it's runtime set.
If you find any issue or you think some functionality may be useful to others and fits this library, please open new Issue or create Pull Request.
To get an overview of thins one can do to help this project, see Roadmap
FAQs
PyTorch based library focused on data processing and input pipelines in general.
We found that torchdatasets-nightly demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
GitHub postponed a new billing model for self-hosted Actions after developer pushback, but moved forward with hosted runner price cuts on January 1.

Research
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.

Security News
Socket CTO Ahmad Nassri shares practical AI coding techniques, tools, and team workflows, plus what still feels noisy and why shipping remains human-led.