
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Command-line tool for extracting DINO, CLIP and SigLIP features from images and videos
DINOtool is a command-line tool for extracting visual features from images and videos using modern vision models like DINOv2, CLIP, SigLIP2, and OpenCLIP/timm compatible models. It supports both global (frame-level) and local (patch-level) features, and can optionally visualize feature maps using PCA.
pip install dinotool
dinotool test.jpg -o out.jpg
Works with:
🧠 Supports multiple model backends:
💾 Outputs standard formats:
🌈 Optional PCA-based side-by-side visualizations
⚡ Simple CLI with no coding required
DINOtool is designed for:
Researchers exploring vision models or needing feature extraction for experiments
Data scientists working with image/video datasets for tasks like clustering, retrieval, or classification
Developers who want to use DINO, CLIP, or SigLIP2 features without writing model code
Students and educators looking to visualize and understand patch-based ViT features
Anyone who wants to preprocess media into standardized visual features for downstream ML tasks — without building a custom pipeline
dinotool input.mp4 -o output.mp4
produces output:
DINOv2 accepts inputs of any size. The OpenCLIP/timm models resize the input. Here is an example of a 896x896 image:
dinotool test/data/bird1.jpg -o dinov2.jpg --model-name vit-b # Shortcut to dinov2_vitb14_reg
dinotool test/data/bird1.jpg -o siglip2.jpg --model-name siglip2 # Shortcut to hf-hub:timm/ViT-B-16-SigLIP2-512
produces outputs (DINOv2 / SigLIP2):
Processing image directories and extracting global or local features for each image is easy with DINOtool:
dinotool image_folder/ -o global_features --save-features 'frame'
produces a global_features.parquet
file with global features:
filename | feature_0 | feature_1 | feature_2 | ... | feature_383 |
---|---|---|---|---|---|
cat_001.jpg | 0.123 | -0.045 | 0.211 | ... | 0.009 |
dog_002.jpg | 0.097 | 0.033 | 0.187 | ... | -0.012 |
tree_003.jpg | -0.056 | 0.140 | 0.092 | ... | 0.034 |
car_004.jpg | 0.301 | -0.202 | 0.144 | ... | -0.019 |
Similar files can be also produced for local patch features, for videos etc.
More example commands can be found in test/test_cases.md
Example of reading output file formats is in docs/reading_outputs.ipynb
Example of PCA feature visualization by first masking objects using the first PCA features, similar to DINOv2 demos is in docs/masked_pca_demo.ipynb:
If you do not have ffmpeg installed:
sudo apt install ffmpeg
Install via pip:
pip install dinotool
You can check that dinotool is properly installed by testing it on an image:
dinotool test.jpg -o out.jpg
uv
If you have uv
installed, you can simply run DINOtool with
uv run --with dinotool dinotool test.jpg -o out.jpg
You still have to have ffmpeg
installed. uvx
does not work on linux due to xformers
dependencies.
If you want an isolated setup, especially useful for managing ffmpeg
and dependencies:
Install Miniforge.
conda create -n dinotool python=3.12
conda activate dinotool
conda install -c conda-forge ffmpeg
pip install dinotool
Extract and visualize DINO features from an image:
dinotool input.jpg -o output.jpg
This produces a .jpg
similar to the examples above.
For a easy-to-process Parquet file of the local features without visualization, run
dinotool input.jpg -o out_features --save-features 'flat' --no-vis
Extract global features from a video using SigLIP2:
dinotool input.mp4 -o features --model-name siglip2 --save-features frame
This produces a features.parquet
file with a row for each frame of the video.
Process a folder of images with patch-level output:
dinotool images/ -o results --save-features full
This produces a folder results
with visualization .jpg
and a NetCDF file for each image separately.
If the images in the folder can be resized to a fixed size, you can use batch processing by setting a fixed resize size (--input-size W H
) and --no-vis
:
dinotool images/ -o results2 --save-features 'frame' --input-size 512 512 --batch-size 4 --no-vis
This produces a parquet file with global features for each image.
Use --save-features
to export features for downstream tasks.
Mode | Format | Output shape | Best for |
---|---|---|---|
full | .nc (image) / .zarr (video, batched image folders) | (frames, height, width, feature) | Keeps spatial structure of patches. |
flat | partitioned .parquet | (frames * height * width, feature) | Reliable long video processing. Faster patch-level analysis |
frame | .parquet | (frames, feature) | One global feature vector per frame |
full
- Spatial local features.zarr
for video sequences.zarr
saving can be memory-intensive and might still fail for large videos.dinotool input.mp4 -o output.mp4 --save-features full
flat
- Flattened local featuresparquet
format..parquet
file with one row per patch..parquet
directory, with indices for frames and patches.dinotool input.mp4 -o output.mp4 --save-features flat
frame
- Global features.txt
file with a single vector.parquet
file with one row per frame/image.# For a video
dinotool input.mp4 -o output.mp4 --save-features frame
# For an image
dinotool input.jpg -o output.jpg --save-features frame
The output is a side-by-side visualization with PCA of the patch-level features.
--model-name
By default, the value passed to this argument is loaded from facebookresearch/dinov2
, meaning that the possible DINOv2 models are:
dinov2_vits14
dinov2_vitb14
dinov2_vitl14
dinov2_vitg14
and their reg
variants (recommended): i.e. dinov2_vits14_reg
.
See the DINOv2 github repo for more information.
OpenCLIP models:
DINOtool now supports also ViT models that follow the OpenCLIP/timm model API for feature extraction. These models are for example the SigLIP2 models in Huggingface hub. Additionally, other models in the Hub should also work, but have not been fully tested. These include SigLIP and CLIP models.
The OpenCLIP/timm model name has to be passed in the format hf-hub:timm/<model name>
.
Shortcuts:
There are some predefined shortcuts for popular models. These can be passed to --model-name
# DINOv2
"vit-s": "dinov2_vits14_reg"
"vit-b": "dinov2_vitb14_reg"
"vit-l": "dinov2_vitl14_reg"
"vit-g": "dinov2_vitg14_reg"
# SigLIP2
"siglip2": "hf-hub:timm/ViT-B-16-SigLIP2-512"
"siglip2-so400m-384": "hf-hub:timm/ViT-SO400M-16-SigLIP2-384"
"siglip2-so400m-512": "hf-hub:timm/ViT-SO400M-16-SigLIP2-512"
"siglip2-b16-256": "hf-hub:timm/ViT-B-16-SigLIP2-256"
"siglip2-b16-512": "hf-hub:timm/ViT-B-16-SigLIP2-512"
"siglip2-b32-256": "hf-hub:timm/ViT-B-32-SigLIP2-256"
"siglip2-b32-512": "hf-hub:timm/ViT-B-32-SigLIP2-512"
# CLIP
"clip": "hf-hub:timm/vit_base_patch16_clip_224.openai"
--input-size
Setting input size fixes the resolution for all inputs. This is useful for processing HD videos, and mandatory for batch processing of image folders.
# Processing a HD video faster:
dinotool input.mp4 -o output.mp4 --input-size 920 540 --batch-size 16
--batch-size
For faster processing, set batch size as large as your GPU memory allows. Batch processing is possible for video files and directories of video frames (following naming where each imagename can be converted to an integer, like 00001.jpg
), where all inputs are assumed to be the same size.
dinotool input.mp4 -o output.mp4 --batch-size 16
For batch processing image folders, --input-size
must be set. Visualization is also not possible.
🦕 DINOtool: Extract and visualize ViT features from images and videos.
Usage:
dinotool input_path -o output_path [options]
Arguments:
input Path to image, video file, or folder of frames.
-o, --output Path for the output (required).
Options:
-s, --save-features MODE Save extracted features: full, flat, or frame
-m, --model-name MODEL Model to use (default: dinov2_vits14_reg)
--input-size W H Resize input before processing. Must be set for batch
processing of image folders
-b, --batch-size N Batch size for faster processing
--only-pca Only visualize PCA features.
--no-vis Only output features with no visualization.
--save features must be set.
FAQs
Command-line tool for extracting DINO, CLIP and SigLIP features from images and videos
We found that dinotool demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.