Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
%matplotlib inline
import matplotlib.pyplot as plt
import torch.utils.data
import torch.nn
from random import randrange
import os
os.environ["WDS_VERBOSE_CACHE"] = "1"
os.environ["GOPEN_VERBOSE"] = "0"
WebDataset format files are tar files, with two conventions:
something-000000.tar
to something-012345.tar
, usually specified using brace notation something-{000000..012345}.tar
You can find a longer, more detailed specification of the WebDataset format in the WebDataset Format Specification
WebDataset can read files from local disk or from any pipe, which allows it to access files using common cloud object stores. WebDataset can also read concatenated MsgPack and CBORs sources.
The WebDataset representation allows writing purely sequential I/O pipelines for large scale deep learning. This is important for achieving high I/O rates from local storage (3x-10x for local drives compared to random access) and for using object stores and cloud storage for training.
The WebDataset format represents images, movies, audio, etc. in their native file formats, making the creation of WebDataset format data as easy as just creating a tar archive. Because of the way data is aligned, WebDataset works well with block deduplication as well and aligns data on predictable boundaries.
Standard tools can be used for accessing and processing WebDataset-format files.
bucket = "https://storage.googleapis.com/webdataset/testdata/"
dataset = "publaynet-train-{000000..000009}.tar"
url = bucket + dataset
!curl -s {url} | tar tf - | sed 10q
PMC4991227_00003.json
PMC4991227_00003.png
PMC4537884_00002.json
PMC4537884_00002.png
PMC4323233_00003.json
PMC4323233_00003.png
PMC5429906_00004.json
PMC5429906_00004.png
PMC5592712_00002.json
PMC5592712_00002.png
tar: stdout: write error
Note that in these .tar
files, we have pairs of .json
and .png
files; each such pair makes up a training sample.
There are several libraries supporting the WebDataset format:
webdataset
for Python3 (includes the wids
library), this repositoryThe webdataset
library can be used with PyTorch, Tensorflow, and Jax.
webdataset
LibraryThe webdataset
library is an implementation of PyTorch IterableDataset
(or a mock implementation thereof if you aren't using PyTorch). It implements as form of stream processing. Some of its features are:
The main limitations people run into are related to the fact that IterableDataset
is less commonly used in PyTorch and some existing code may not support it as well, and that achieving an exactly balanced number of training samples across many compute nodes for a fixed epoch size is tricky; for multinode training, webdataset
is usually used with shard resampling.
There are two interfaces, the concise "fluid" interface and a longer "pipeline" interface. We'll show examples using the fluid interface, which is usually what you want.
import webdataset as wds
pil_dataset = wds.WebDataset(url).shuffle(1000).decode("pil").to_tuple("png", "json")
The resulting datasets are standard PyTorch IterableDataset
instances.
isinstance(pil_dataset, torch.utils.data.IterableDataset)
True
for image, json in pil_dataset:
break
plt.imshow(image)
<matplotlib.image.AxesImage at 0x7f73806db970>
We can add onto the existing pipeline for augmentation and data preparation.
import torchvision.transforms as transforms
from PIL import Image
preproc = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
lambda x: 1-x,
])
def preprocess(sample):
image, json = sample
try:
label = json["annotations"][0]["category_id"]
except:
label = 0
return preproc(image), label
dataset = pil_dataset.map(preprocess)
for image, label in dataset:
break
plt.imshow(image.numpy().transpose(1, 2, 0))
<matplotlib.image.AxesImage at 0x7f7375fc2230>
WebDataset
is just an instance of a standard IterableDataset
. It's a single-threaded way of iterating over a dataset. Since image decompression and data augmentation can be compute intensive, PyTorch usually uses the DataLoader
class to parallelize data loading and preprocessing. WebDataset
is fully compatible with the standard DataLoader
.
Here are a number of notebooks showing how to use WebDataset for image classification and LLM training:
The wds-notes notebook contains some additional documentation and information about the library.
webdataset
Pipeline APIThe wds.WebDataset
fluid interface is just a convenient shorthand for writing down pipelines. The underlying pipeline is an instance of the wds.DataPipeline
class, and you can construct data pipelines explicitly, similar to the way you use nn.Sequential
inside models.
dataset = wds.DataPipeline(
wds.SimpleShardList(url),
# at this point we have an iterator over all the shards
wds.shuffle(100),
# add wds.split_by_node here if you are using multiple nodes
wds.split_by_worker,
# at this point, we have an iterator over the shards assigned to each worker
wds.tarfile_to_samples(),
# this shuffles the samples in memory
wds.shuffle(1000),
# this decodes the images and json
wds.decode("pil"),
wds.to_tuple("png", "json"),
wds.map(preprocess),
wds.shuffle(1000),
wds.batched(16)
)
batch = next(iter(dataset))
batch[0].shape, batch[1].shape
(torch.Size([16, 3, 224, 224]), (16,))
wids
Library for Indexed WebDatasetsInstalling the webdataset
library installs a second library called wids
. This library provides fully indexed/random access to the same datasets that webdataset
accesses using iterators/streaming.
Like the webdataset
library, wids
is high scalable and provides efficient access to very large datasets. Being indexed, it is easily backwards compatible with existing data pipelines based on indexed dataset, including precise epochs for multinode training. The library comes with its own ChunkedSampler
and DistributedChunkedSampler
classes, which provided shuffling accross nodes while still preserving enough locality of reference for efficient training.
Internally, the library uses a mmap
-based tar
file reader implementation; this allows very fast access without precomputed indexes, and it also means that shard and the equivalet of "shuffle buffers" are shared in memory between workers on the same machine.
This additional power comes at some cost: the library requires a small metadata file that lists all the shards in a dataset and the number of samples contained in each, the library requires local storage for as many shards as there are I/O workers on a node, it uses shared memory and mmap
, and the availability of indexing makes it easy to accidentally use inefficient access patterns.
Generally, the recommendation is to use webdataset
for all data generation, data transformation, and training code, and to use wids
only if you need fully random access to datasets (e.g., for browing or sparse sampling), need an indexed-based sampler, or are converting tricky legacy code.
import wids
train_url = "https://storage.googleapis.com/webdataset/fake-imagenet/imagenet-train.json"
dataset = wids.ShardListDataset(train_url)
sample = dataset[1900]
print(sample.keys())
print(sample[".txt"])
plt.imshow(sample[".jpg"])
dict_keys(['.cls', '.jpg', '.txt', '__key__', '__dataset__', '__index__', '__shard__', '__shardindex__'])
a high quality color photograph of a dog
https://storage.googleapis.com/webdataset/fake-ima base: https://storage.googleapis.com/webdataset/fake-imagenet name: imagenet-train nfiles: 1282 nbytes: 31242280960 samples: 128200 cache: /tmp/_wids_cache
<matplotlib.image.AxesImage at 0x7f7373669e70>
There are several examples of how to use wids
in the examples directory.
wids
Note that the APIs between webdataset
and wids
are not fully consistent:
wids
keeps the extension's "." in the keys, while webdataset
removes it (".txt" vs "txt")wids
doesn't have a fully fluid interface, and add_transformation
just adds to a list of transformationswebdataset
currently can't read the wids
JSON specifications$ pip install webdataset
For the Github version:
$ pip install git+https://github.com/tmbdev/webdataset.git
Here are some videos talking about WebDataset and large scale deep learning:
The WebDataset library only requires PyTorch, NumPy, and a small library called braceexpand
.
WebDataset loads a few additional libraries dynamically only when they are actually needed and only in the decoder:
torchvision
, torchvideo
, torchaudio
for image/video/audio decodingmsgpack
for MessagePack decodingcurl
command line tool for accessing HTTP serversLoading of one of these libraries is triggered by configuring a decoder that attempts to decode content in the given format and encountering a file in that format during decoding. (Eventually, the torch... dependencies will be refactored into those libraries.)
FAQs
Record sequential storage for deep learning.
We found that webdataset demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.