
Security News
TypeScript is Porting Its Compiler to Go for 10x Faster Builds
TypeScript is porting its compiler to Go, delivering 10x faster builds, lower memory usage, and improved editor performance for a smoother developer experience.
SNIP is a very compact index (25GB) that has found roughly half a billion duplicates on the LAION-2B-en dataset. You may download the de-duplicated dataset below.
SNIP de-duplicated L2B on a standard home computer, taking just several days. We believe the community will benefit from such a dataset, in light of recent research showing the copyright and privacy risks associated with training generative models on highly duplicated datasets, as well as SNIP for a de-duplication, compression and retrieval tool.
pip install --upgrade snip-dedup
# List available commands
snip --help
snip download --help
# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10
Then, you may download (deduplicated) laion2b images with the awesome img2dataset.
You may check the fidelity of the duplicates by randomly sampling labeled duplicates, and using SNIP to detect its dup. You may do that with retrieve_dup_urls_demo.py (note you will need the original metadata files for this)
You can also do with SNIP (coming soon...)
** DISCLAIMER ** Use at your own risk. Help for better de-duiplication (higher acc, higher recall) is very much appreciated. Taking raw CLIP features as the ground truth for exact duplicates, we get nearly 81% precision (and likely much higher for near duplicates, see below).
We release this index for public use and exploration of the LAION-2B-en dataset (more indices coming soon). Soon we will release tools to train your own SNIP indices as well as our scientific paper discussing the method in more detail.
You may find the following necessary files here:
Binary array of De-duplicated Images
Other:
cumulative sizes of features (for indexing sharded files)
By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:
We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:
you may check a list of (randomly sampled) detected duplicate pairs here
SNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors.
This python project uses the hatch
project manager.
Dependencies are specified inside the pyproject.toml
file, and build configs inside the hatch.toml
file.
As such you can enter the isolated development environment with hatch shell
from inside the repository.
To avoid silly mistakes, the code is checked with pyright. To ensure a consistent styling, all python code is formatted with black and we use the ruff linter. Once you have installed them, you can check that the code is consistent with:
hatch run check # check for mistakes via static analysis
hatch run format # check formatting of all python files
hatch run lint # check linting rules
TODO: check pyright, formatting and linter in CI
[ ] CI [ ] check max file size on CI to prevent pushing data [ ] add docs. numpy docstring standard https://numpydoc.readthedocs.io/en/latest/format.html [ ] auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml [ ] add tests?
FAQs
SNIP: compact index for large dataset
We found that snip-dedup demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
TypeScript is porting its compiler to Go, delivering 10x faster builds, lower memory usage, and improved editor performance for a smoother developer experience.
Research
Security News
The Socket Research Team has discovered six new malicious npm packages linked to North Korea’s Lazarus Group, designed to steal credentials and deploy backdoors.
Security News
Socket CEO Feross Aboukhadijeh discusses the open web, open source security, and how Socket tackles software supply chain attacks on The Pair Program podcast.