Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
🔥 We use
fastdup - a free tool to clean all datasets shared in this repo.
Explore the docs »
Report Issues
·
Read Blog
·
Get In Touch
·
About Us
vl-datasets
is a Python package that provides access to clean computer vision datasets with only 2 lines of code.
For example, to get access to the clean version of the Food-101 dataset simply run:
We support some of the most widely used computer vision datasets. Let us know if you have additional request to support a new dataset.
All the datasets are analyzed for issues such as:
Computer vision is an exciting and rapidly advancing field, with new techniques and models emerging now and then. However, to develop and evaluate these models, it's essential to have reliable and standardized datasets to work with.
Even with the recent success of generative models, data quality remains an issue that's mainly overlooked. Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.
We believe that access to clean and high-quality computer vision datasets leads to accurate, non-biased, and efficient model.
By providing public access to vl-datasets
we hope it helps advance the field of computer vision.
vl-datasets
provides a convenient way to access the cleaned version of the datasets in Python.
Alternatively, for each dataset in this repo, we provide a .csv
file that lists the problematic images from the dataset.
You can use the listed images in the .csv
to improve the model by re-labeling the them or just simply remove it from the dataset.
We're a startup and we'd like to offer free access to the datasets as much as we can afford to. But in doing so, we'd also need your support.
We're offering select .csv
files completely free with no strings attached.
For access to our complete dataset and exclusive beta features, all we ask is that you sign up to be a beta tester – it's completely free and your feedback will help shape the future of our platform.
Here is a table of widely used computer vision datasets, issues we found and a link to access the .csv
file.
Dataset | Issues | CSV | Import Statement |
---|---|---|---|
Food-101 |
| Download here. | from vl_datasets import VLFood101 |
Oxford-IIIT Pet |
| Download here. | from vl_datasets import VLOxfordIIITPet |
LAION-1B |
| Request access here. | WIP |
ImageNet-21K |
| Request access here. | WIP |
ImageNet-1K |
| Request access here. | WIP |
KITTI |
| Request access here. | WIP |
DeepFashion |
| Request access here. | WIP |
CelebA-HQ |
| Request access here. | WIP |
COCO |
| Request access here. | WIP |
Learn more on how we clean the datasets using our profilling tool here.
Option 1 - Install vl_datasets
package from PyPI:
pip install vl-datasets
Option 2 - Install the bleeding edge version on GitHub:
pip install git+https://github.com/visual-layer/vl-datasets.git@main --upgrade
To start using vl-datasets
, import the clean version of the dataset with:
from vl_datasets import VLFood101
This should import the clean version of the Food101
dataset.
Next, you can load the dataset as a PyTorch Dataset
.
train_dataset = VLFood101('./', split='train')
valid_dataset = VLFood101('./', split='test')
If you have a custom .csv
file you can optionally pass in the file:
train_dataset = VLFood101('./', split='train', exclude_csv='my-file.csv')
The filenames listed in the .csv
will be excluded in the dataset.
Next, you can load the train and validation datasets in a PyTorch training loop.
See the Learn from Examples section to learn more.
NOTE: Sign up here for free to be our beta testers and get full access to the all the
.csv
files for the dataset listed in this repo.
With the dataset loaded you can train a model using PyTorch training loop.
| ||
| ||
vl-datasets
is licensed under the Apache 2.0 License. See LICENSE.
However, you are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.
This repository incorporates usage tracking using Sentry.io to monitor and collect valuable information about the usage of the application.
Usage tracking allows us to gain insights into how the application is being used in real-world scenarios. It provides us with valuable information that helps in understanding user behavior, identifying potential issues, and making informed decisions to improve the application.
We DO NOT collect folder names, user names, image names, image content and other personaly identifiable information.
What data is tracked?
To opt out, define an environment variable named SENTRY_OPT_OUT
.
On Linux run the following:
export SENTRY_OPT_OUT=True
Read more on Sentry's official webpage.
Get help from the Visual Layer team or community members via the following channels -
Visual Layer is founded by the authors of XGBoost, Apache TVM & Turi Create - Danny Bickson, Carlos Guestrin and Amir Alush.
Learn more about Visual Layer here.
FAQs
Open, Clean Datasets for Computer Vision.
We found that vl-datasets demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.