Security News
38% of CISOs Fear They’re Not Moving Fast Enough on AI
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Shuffler is a Python library for data engineering in computer vision. It simplifies building, maintaining, and inspection of datasets for machine learning.
For example, you are building a dataset to train a vehicle classifier. You may start by downloading the public BDD dataset. Then you (1) remove annotations of everything but vehicles, (2) filter out all tiny vehicles, (3) expand bounding boxes by 20% to include some context, (4) crop out the bounding boxes, (5) save annotations in the ImageNet format to be further fed to TensorFlow. Shuffler allows to do that by running a single command in the terminal (see use case #1).
Data engineering for machine learning means building and maintaining datasets.
Research groups in academia compare their algorithms on publicly available datasets, such as KITTI. In order to allow comparison, public datasets must be static. On the other hand, a data scientist in industry enhances both algorithms AND datasets in order to achieve the best performance on a task. That includes collecting data, cleaning data, and fitting data for a task. Some even treat data as code. This is data engineering.
You may need a data engineering package if you find yourself writing multiple scripts with of lot of boilerplate code for simple operations with data, if your scripts are becoming write-only code, if you have multiple modifications of the same dataset, e.g. folders named "KITTI", "KITTI-only-vans", "KITTI-inspected", etc.
Shuffler requires Python3. The installation instructions assume Conda package management system.
Install dependencies:
conda install -c conda-forge imageio ffmpeg=4 opencv matplotlib
conda install lxml simplejson progressbar2 pillow scipy
conda install pandas seaborn # If desired, add support for plotting commands
Clone this project:
git clone https://github.com/kukuruza/shuffler
To test the installation, run the following command. The installation succeeded if an image opens up. Press Esc to close the window.
cd shuffler
python -m shuffler -i 'testdata/cars/micro1_v5.db' --rootdir 'testdata/cars' examineImages
Shuffler is a command line tool. It chains operations, such as importKitti
to import a dataset from KITTI format and exportCoco
to export it in COCO format.
python -m shuffler \
importKitti --images_dir ${IMAGES_DIR} --detection_dir ${OBJECT_LABELS_DIR} '|' \
exportCoco --coco_dir ${OUTPUT_DIR} --subset 'train'
importKitti
and exportCoco
above are examples of operations. There are over 60 operations that fall under the following broad categories:
Sub-commands can be chained via the vertical bar |
, similar to pipes in Unix. The vertical bar should be quoted or escaped. Using single quotes '|'
works in Windows, Linux, and Mac. Alternatively, in Unix, you can escape the vertical bar as \|
.
The next example (1) opens a database, (2) converts polygon labels to pixel-by-pixel image masks (3) adds more images with their masks to the database, and (4) prints summary.
python -m shuffler --rootdir 'testdata/cars' -i 'testdata/cars/micro1_v5.db' \
polygonsToMask --media='pictures' --mask_path 'testdata/cars/mask_polygons' '|' \
addPictures --image_pattern 'testdata/moon/images/*.jpg' --mask_pattern 'testdata/moon/masks/*.png' '|' \
examineImages --mask_alpha 0.5 \
printInfo
Shuffler has an interface to Pytorch: classes ImageDataset and ObjectDataset implement torch.utils.data.Dataset
.
A demo provides an example of using a Shuffler database as a Dataset in Pytorch inference.
Shuffler also has an interface to Keras: classes [Imaginterface/keras/generetors.py) implement keras.utils.Sequence
.
A demo provides an example of using a Shuffler database as a Generator in Keras inference.
Alternatively, data can be exported to one of the popular formats, e.g. COCO, if your deep learning project already has a loader for it.
Shuffler is for inspecting and modifying your datasets. Check out some use cases.
You can convert one format to another, like in the example below. Check out the [dataset IO] tutorial.
Shuffler provides an API to Pytorch and Keras.
Shuffler's database schema is designed to support computer vision tasks, in particular image classification, object and panoptic detection, image and instance segmentation, object tracking, object detection in video.
Shuffler does not support versions inside the database SQL schema. The version can be a part of the database name, e.g. dataset.v1.db
and dataset.v2.db
.
A dataset consists of (1) image data, stored as image and video files, and (2) metadata, stored as the SQLite database. Shuffler's SQL schema is designed to support popular machine learning tasks in computer vision.
The public BDD dataset includes 100K images taken from a moving car with various objects annotated in each image. If a researcher wants to train a classifier between "car", "truck", and "bus", they may start by using this dataset. First, annotations of all objects except for these three classes must be filtered out. Second, the dataset annotations for tons of tiny vehicles, which would not be good for a classifier. Third, it may be beneficial to expand bounding boxes to allow for data augmentation during training. Fourth, the remaining objects need to be cropped out. The cropped images and the annotations are saved in ImageNet format, which is easily consumable by TensorFlow. The KITTI dataset is assumed to be downloaded to directories ${IMAGES_DIR}
and ${OBJECT_LABELS_DIR}
.
python -m shuffler \
importKitti --images_dir ${IMAGES_DIR} --detection_dir ${OBJECT_LABELS_DIR} '|' \
filterObjectsByName --good_names 'car' 'truck' 'bus' '|' \
filterObjectsSQL --sql "SELECT objectid FROM objects WHERE width < 64 OR height < 64" '|' \
expandObjects --expand_fraction 0.2 '|' \
cropObjects --media 'pictures' --image_path ${NEW_CROPPED_IMAGE_PATH} --target_width 224 --target_height 224 '|' \
exportImagenet2012 --imagenet_home ${NEW_IMAGENET_DIRECTORY} --symlink_images
A researcher has collected a dataset of images with cars. Images were handed out to a team of annotators. Each image was annotated with polygons by several annotators using LabelMeAnootationTool. The researcher 1) imports all labels, 2) merges polygons corresponding to the same car made by all annotators, 3) gets objects masks, where the gray area marks the inconsistency across annotators. See the tutorial.
A user works on object detection in the autonomous vehicle setup, and would like to use as many annotated images as possible. In particular, they aim to combine certain classes from the public datasets KITTI, BDD, and PASCAL VOC 2012. The combined dataset is exported in COCO format for training. See the tutorial.
We have a dataset with objects given as bounding boxes. We would like to remove objects on image boundary, expand bounding boxes by 10% for better training, remove objects of all types except "car", "bus", and "truck", and to remove objects smaller than 30 pixels wide. We would lile to use that subset for training.
In the previous use case we removed some objects for our object detection training task. Now we want to evaluate the trained model. We expect our model to detect only big objects, but we don't want to count it as a false positive if it detects a tiny object either.
A neural network was trained to perform a semantic segmentation of images. We have a directory with ground truth masks and a directory with predicted masks. We would like to 1) evaluate the results, 2) write a video with images and their predicted segmentation masks side-by-side.
We have images with objects. Images have masks with those objects. We would like to crop out objects with name "car" bigger than 32 pixels wide, stretch the crops to 64x64 pixels and write a new dataset of images (and the correspodning masks)
A dataset contains objects of class "car", among other classes. We would like to additionally classify cars by type for more fine-grained detection. An image annotator needs to go through all the "car" objects, and assign one of the following types to them: "passenger", "truck", "van", "bus", "taxi". See the tutorial.
Shuffler stores metadata as an SQLite database. Metadata includes image paths and annotations.
You can import some well-known formats and save them in Shuffler's format. For example, importing PASCAL VOC 2012 looks like this. We assume you have downloaded PASCAL VOC to ${VOC_DIR}
:
python -m shuffler -o 'myPascal.db' importPascalVoc2012 ${VOC_DIR} --annotations
You can open myPascal.db
with any SQLite3 editor/viewer and manually inspect data entries, or run some SQL on it.
The toolbox supports datasets consisting of 1) images and masks, 2) objects annotated with masks, polygons, and bounding boxes, and 3) matches between objects. It stores annotations as a SQL database of its custom format. This database can be viewed and edited manually with any SQL viewer.
The beauty of storing annotations in a relational SQLite database is that one can use any SQL editor to explore them. For example, Linux includes the command line tool sqlite3
.
The commands below illustrate using sqlite3
to get some statistics and change testdata/cars/micro1_v5.db
from this repository.
# Find the total number of images:
sqlite3 testdata/cars/micro1_v5.db 'SELECT COUNT(imagefile) FROM images'
# Find the total number of images with objects:
sqlite3 testdata/cars/micro1_v5.db 'SELECT COUNT(DISTINCT(imagefile)) FROM objects'
# Print out names of objects and their count:
sqlite3 testdata/cars/micro1_v5.db 'SELECT name, COUNT(1) FROM objects GROUP BY name'
# Print out dimensions of all objects of the class "car":
sqlite3 testdata/cars/micro1_v5.db 'SELECT width, height FROM objects WHERE name="car"'
# Change all names "car" to "vehicle".
sqlite3 testdata/cars/micro1_v5.db 'UPDATE objects SET name="vehicle" WHERE name="car"'
Please submit a pull request or open an issue with a suggestion.
If you find this project useful for you, please consider citing:
@inproceedings{10.1145/3332186.3333046,
author = {Toropov, Evgeny and Buitrago, Paola A. and Moura, Jos\'{e} M. F.},
title = {Shuffler: A Large Scale Data Management Tool for Machine Learning in Computer Vision},
year = {2019},
isbn = {9781450372275},
series = {PEARC '19}
}
FAQs
Data engineering tool for learning-based computer vision.
We found that dataset-shuffler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.
Security News
Company News
Socket is joining TC54 to help develop standards for software supply chain security, contributing to the evolution of SBOMs, CycloneDX, and Package URL specifications.