Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset!
pip install cleanvision
Download an example dataset (optional). Or just use any collection of image files you have.
wget -nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
from cleanvision import Imagelab
# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")
# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()
# Produce a neat report of the issues found in your dataset
imagelab.report()
issue_types = {"dark": {}, "blurry": {}}
imagelab.find_issues(issue_types=issue_types)
# Produce a report with only the specified issue_types
imagelab.report(issue_types=issue_types)
python examples/run.py --path <FOLDER_WITH_IMAGES>
The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets.
This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision task such as: classification, segmentation, object detection, pose estimation, keypoint detection, generative modeling, etc. To detect issues in the labels of your image data, you can instead use the cleanlab package.
In any collection of image files (most formats supported), CleanVision can detect the following types of issues:
Issue Type | Description | Issue Key | Example | |
---|---|---|---|---|
1 | Exact Duplicates | Images that are identical to each other | exact_duplicates | |
2 | Near Duplicates | Images that are visually almost identical | near_duplicates | |
3 | Blurry | Images where details are fuzzy (out of focus) | blurry | |
4 | Low Information | Images lacking content (little entropy in pixel values) | low_information | |
5 | Dark | Irregularly dark images (underexposed) | dark | |
6 | Light | Irregularly bright images (overexposed) | light | |
7 | Grayscale | Images lacking color | grayscale | |
8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide) | odd_aspect_ratio | |
9 | Odd Size | Images that are abnormally large or small compared to the rest of the dataset | odd_size |
CleanVision supports Linux, macOS, and Windows and runs on Python 3.7+.
The best place to learn is our Slack community. Join the discussion there to see how folks are using this library, discuss upcoming features, or ask for private support.
Need professional help with CleanVision? Join our #help Slack channel and message us there, or reach out via email: team@cleanlab.ai
Interested in contributing? See the contributing guide. An easy starting point is to
consider issues marked good first issue
or
simply reach out in Slack. We welcome your help building a standard open-source library
for data-centric computer vision!
Ready to start adding your own code? See the development guide.
Have an issue? Search existing issues or submit a new issue.
Have ideas for the future of data-centric computer vision? Check out our active/planned Projects and what we could use your help with.
Copyright (c) 2022 Cleanlab Inc.
cleanvision is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
cleanvision is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See GNU Affero General Public LICENSE for details.
Commercial licensing is available for enterprise teams that want to use CleanVision in production workflows, but are unable to open-source their code as is required by the current license. Please email us: team@cleanlab.ai
FAQs
Find issues in image datasets
We found that cleanvision demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 6 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.