Security News
38% of CISOs Fear They’re Not Moving Fast Enough on AI
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
streaming-wds
is a Python library that enables efficient streaming of WebDataset-format datasets from boto3-compliant object stores for PyTorch. It's designed to handle large-scale datasets with ease, especially in distributed training contexts.
StreamingDataLoader
StreamingDataset.process_sample
You can install streaming-wds
using pip:
pip install streaming-wds
Here's a basic example of how to use streaming-wds:
from streaming_wds import StreamingWebDataset, StreamingDataLoader
# Create the dataset
dataset = StreamingWebDataset(
remote="s3://your-bucket/your-dataset",
split="train",
profile="your_aws_profile",
shuffle=True,
max_workers=4,
schema={".jpg": "PIL", ".json": "json"}
)
# or use a custom processing function
import torchvision.transforms.v2 as T
class ImageNetWebDataset(StreamingWebDataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.transforms = T.Compose([
T.ToImage(),
T.Resize((64,)),
T.ToDtype(torch.float32),
T.Normalize(mean=(128,), std=(128,)),
])
def process_sample(self, sample):
sample[".jpg"] = self.transforms(sample[".jpg"])
return sample
# Create a StreamingDataLoader for mid-epoch resumption
dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
# Iterate through the data
for batch in dataloader:
# Your training loop here
pass
# You can save the state for resumption
state_dict = dataloader.state_dict()
# Later, you can resume from this state
dataloader.load_state_dict(state_dict)
remote
(str): The S3 URI of the dataset.split
(Optional[str]): The dataset split (e.g., "train", "val", "test"). Defaults to None.profile
(str): The AWS profile to use for authentication. Defaults to "default".shuffle
(bool): Whether to shuffle the data. Defaults to False.max_workers
(int): Maximum number of worker threads for download and extraction. Defaults to 2.schema
(Dict[str, str]): A dictionary defining the decoding method for each data field. Defaults to {}.memory_buffer_limit_bytes
(Union[Bytes, int, str]): The maximum size of the memory buffer in bytes per worker. Defaults to "2GB".file_cache_limit_bytes
(Union[Bytes, int, str]): The maximum size of the file cache in bytes per worker. Defaults to "2GB".Contributions to streaming-wds are welcome! Please feel free to submit a Pull Request.
MIT License
FAQs
Iterable Streaming Webdataset for PyTorch from boto3 compliant storage
We found that streaming-wds demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.
Security News
Company News
Socket is joining TC54 to help develop standards for software supply chain security, contributing to the evolution of SBOMs, CycloneDX, and Package URL specifications.