Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
torchdatasets-nightly
Advanced tools
PyTorch based library focused on data processing and input pipelines in general.
map
, apply
, reduce
or filter
directly on Dataset
objectscache
data in RAM/disk or via your own method (partial caching supported)Dataset
and IterableDataset
supporttorchdatasets.maps
like Flatten
or Select
torchdatasets.datasets
classes designed for general tasks (e.g. file reading)torchvision
datasets (e.g. ImageFolder
, MNIST
, CIFAR10
) via td.datasets.WrapDataset
super().__init__()
)Version | Docs | Tests | Coverage | Style | PyPI | Python | PyTorch | Docker | Roadmap |
---|---|---|---|---|---|---|---|---|---|
Check documentation here: https://szymonmaszke.github.io/torchdatasets
import torchdatasets as td
import torchvision
class Images(td.Dataset): # Different inheritance
def __init__(self, path: str):
super().__init__() # This is the only change
self.files = [file for file in pathlib.Path(path).glob("*")]
def __getitem__(self, index):
return Image.open(self.files[index])
def __len__(self):
return len(self.files)
images = Images("./data").map(torchvision.transforms.ToTensor()).cache()
You can concatenate above dataset with another (say labels
) and iterate over them as per usual:
for data, label in images | labels:
# Do whatever you want with your data
1000
samples in memory, save the rest on disk in folder ./cache
:images = (
ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
# First 1000 samples in memory
.cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
# Sample from 1000 to the end saved with Pickle on disk
.cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
# You can define your own cachers, modifiers, see docs
)
To see what else you can do please check torchdatasets documentation
torchvision
Using torchdatasets
you can easily split torchvision
datasets and apply augmentation
only to the training part of data without any troubles:
import torchvision
import torchdatasets as td
# Wrap torchvision dataset with WrapDataset
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))
# Split dataset
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
model_dataset,
(int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)
# Apply torchvision mappings ONLY to train dataset
train_dataset.map(
td.maps.To(
torchvision.transforms.Compose(
[
torchvision.transforms.RandomResizedCrop(224),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
),
]
)
),
# Apply this transformation to zeroth sample
# First sample is the label
0,
)
Please notice you can use td.datasets.WrapDataset
with any existing torch.utils.data.Dataset
instance to give it additional caching
and mapping
powers!
pip install --user torchdatasets
pip install --user torchdatasets-nightly
CPU standalone and various versions of GPU enabled images are available at dockerhub.
For CPU quickstart, issue:
docker pull szymonmaszke/torchdatasets:18.04
Nightly builds are also available, just prefix tag with nightly_
. If you are going for GPU
image make sure you have
nvidia/docker installed and it's runtime set.
If you find any issue or you think some functionality may be useful to others and fits this library, please open new Issue or create Pull Request.
To get an overview of thins one can do to help this project, see Roadmap
FAQs
PyTorch based library focused on data processing and input pipelines in general.
We found that torchdatasets-nightly demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.