Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Utilities for datasets and dataloading with pytorch
pytorch datasets load all data in the __getitem__
.
KappaData decouples the __getitem__
such that single properties of the dataset can be loaded independently.
Let's take an image classification dataset as an example. A sample consists of an image with an associated class label.
class ImageClassificationDataset(torch.utils.data.Dataset):
def __init__(self, image_paths):
super().__init__()
self.image_paths = image_paths
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
img = load_image(self.image_paths[idx])
class_label = image_path_to_class_label(self.image_paths[idx])
return img, class_label
If your training process contains something that only requires the class labels, the dataset has to additionally load
all the images which can take a long time (whereas loading only labels is very fast). With KappaData the __getitem__
method is split into subparts:
# inherit from kappadata.KDDataset
class ImageClassificationDataset(kappadata.KDDataset):
def __init__(self, image_paths):
super().__init__()
self.image_paths = image_paths
def __len__(self):
return len(self.image_paths)
# replace __getitem__ with getitem_x and getitem_y
def getitem_x(self, idx, ctx=None):
return load_image(self.image_paths[idx])
def getitem_y(self, idx, ctx=None):
return image_path_to_class_label(self.image_paths[idx])
Now each subpart of the dataset can be retrieved by wrapping the dataset into a ModeWrapper
:
ds = ImageClassificationDataset(image_paths=...)
for y in kappadata.ModeWrapper(ds, mode="y"):
...
for x, y in kappadata.ModeWrapper(ds, mode="x y"):
...
torch.utils.data.Subset /
torch.utils.data.ConcatDataset
can be used by simply replacing them with kappadata.KDSubset
/kappadata.KDConcatDataset
.
KappaData implements various ways to manipulate datasets (kappadata.wrappers.dataset_wrappers
).
kappadata.ClassFilterWrapper(ds, valid_classes=[0, 1])
kappadata.ClassFilterWrapper(ds, invalid_classes=[0, 1])
kappadata.OversamplingWrapper(ds)
kappadata.PercentFilterWrapper(ds, from_percent=0.25)
kappadata.PercentFilterWrapper(ds, to_percent=0.75)
kappadata.PercentFilterWrapper(ds, from_percent=0.25, to_percent=0.75)
kappadata.RepeatWrapper(ds, repetitions=2)
kappadata.RepeatWrapper(ds, min_size=100)
kappadata.ShuffleWrapper(ds, seed=5)
KappaData implements various ways to manipulate how samples are sampled from the underlying dataset
(kappadata.wrappers.sample_wrappers
). "Sample Wrappers" are similar to transforms in that they transform the sample in
some way, but "Sample Wrappers" are more powerful because they have full access to the underlying dataset whereas normal
transforms only have access to a single sample.
class Transform:
def forward(x):
# only x can be manipulated (e.g. normalized, image-transforms, ...)
class SampleWrapper(kd.KDWrapper):
def getitem_x(idx, ctx=None):
# access to the underlying dataset via self.dataset
# e.g. return the sum of two different samples
idx2 = np.random.randint(len(self))
return self.dataset.getitem_x(idx, ctx) + self.dataset.getitem_x(idx2, ctx)
This allows implementing more complex transformations. KappaData implements the following SampleWrappers:
kappadata.MixupWrapper(dataset=ds, alpha=1., p=1.)
kappadata.CutmixWrapper(dataset=ds, alpha=1., p=1.)
kappadata.MixWrapper(dataset=ds, cutmix_alpha=1., mixup_alpha=1., p=1., cutmix_p=0.5)
kappadata.LabelSmoothingWrapper(dataset=ds, smoothing=.1)
With KappaData you can also retrieve various properties of your data prepocessing (e.g. augmentation parameters). The following example shows how you can retrieve the parameters of torchvision.transforms.RandomResizedCrop .
import torchvision.transforms.functional as F
class MyRandomResizedCrop(torchvision.transforms.RandomResizedCrop):
def forward(self, img, ctx=None):
# make random resized crop
i, j, h, w = self.get_params(img, self.scale, self.ratio)
cropped = F.resized_crop(img, i, j, h, w, self.size, self.interpolation)
# store parameters
if ctx is not None:
ctx["crop_parameters"] = (i, j, h, w)
return cropped
class ImageClassificationDataset(kappadata.KDDataset):
def __init__(self, ...):
...
self.random_resized_crop = MyRandomResizedCrop()
...
def getitem_x(self, idx, ctx=None):
img = load_image(self.image_paths[idx])
return self.random_resized_crop(img, ctx=ctx)
When you want to access the parameters simply pass return_ctx=True
to the ModeWrapper
:
ds = ImageClassificationDataset(image_paths=...)
for x, ctx in kappadata.ModeWrapper(ds, mode="x", return_ctx=True):
print(ctx["crop_parameters"])
for (x, y), ctx in kappadata.ModeWrapper(ds, mode="x y", return_ctx=True):
...
kappadata.SharedDictDataset
provides a wrapper to store arbitrary datasets in-memory via a dictionary shared between
all worker processes (using python multiprocessing data
structures). The shared memory part is important
for dataloading
with num_workers > 0
. Small and medium sized datasets can be cached in-memory to avoid bottlenecks when loading data
from a disk. For example even the full ImageNet can be cached on many servers as it has ~
130GB and its not too uncommon for GPU servers to have more RAM than that.
Naively caching image datasets can lead to high memory consumption because image data is usually stored in a compressed format and decompressed during loading. To reduce memory, the raw uncompressed data needs to be cached.
Example caching a torchvision.datasets.ImageFolder:
from kappadata.loading.image_folder import raw_image_loader, raw_image_folder_sample_to_pil_sample
class CachedImageFolder(kappadata.KDDataset):
def __init__(self, ...):
# modify ImageFolder to load raw samples (NOTE: can't apply transforms onto raw data)
self.ds = torchvision.datasets.ImageFolder(..., transform=None, loader=raw_image_loader)
# initialize cached dataset that decompresses the raw data into a PIL image
self.cached_ds = kappadata.SharedDictDataset(self.ds, transform=raw_image_folder_sample_to_pil_sample)
# store transforms to apply after decompression
self.transform = ...
def getitem_x(self, idx, ctx=None):
x, _ = self.cached_ds[idx]
if self.transform is not None:
x = self.transform(x)
return x
Datasets are often stored on a global (slow) storage and before training moved to a local (fast) disk.
kappadata.copy_folder_from_global_to_local
provides an utility function to do this automatically:
from pathlib import Path
from kappadata import copy_folder_from_global_to_local
global_path = Path("/system/data/ImageNet")
local_path = Path("/local/data")
# /system/data/ImageNet contains a 'train' and a 'val' folder -> copy whole dataset
copy_folder_from_global_to_local(global_path, local_path)
# copy only "train"
copy_folder_from_global_to_local(global_path, local_path, relative_path="train")
The above code will also work (without modification) if /system/data/ImageNet
contains only 2 zip files
train.zip
and val.zip
kappadata.KDDataset
automatically support python slicing
all_class_labels = ModeWrapper(ds, mode="y")[:]
all_class_labels = ModeWrapper(ds, mode="y")[5:-3:2]
kappadata.KDDataset
implement iter
for y in ModeWrapper(ds, mode="y"):
...
ds.root_dataset
FAQs
pytorch dataset wrappers for in-memory caching
We found that kappadata demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.