Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Imgtovar is a Python module, developed in collaboration with researchers, which allows for variable extraction from image data. The pipeline consists of three steps - image extraction, cleaning and prediction. This module integrates some functions from the popular module DeepFace for facial attribute analsis.
Currently supporting Natural vs Man-made background analysis, Chart recognition and identification, Age, Gender, Race and Emotion prediction, and finally Object Detection of a total of 100+ objects.
:heavy_exclamation_mark::heavy_exclamation_mark: Currently ImgtoVar is undergoing a full transtion to PyTorch, in the meantime the stable version is 1.1.0. Which is fully functional but due to environment conflicts the object detection model can only run inference on CPU.
The current insallation procedure requires the set up of a Tensorflow GPU enabled environment with a CPU installation of PyTorch. The easiest way to achieve that is through conda.
$ conda create -n imgtovar tensorflow-gpu
$ conda install pytorch torchvision torchaudio cpuonly -c pytorch
Aftrr the environment has been set up the ImgtoVar module is installed through PyPI
.
$ pip install imgtovar
An interactive tutorial showcasing all methods in a sample workflow extracting features from image data of a PDF document can be found in the following Google Colab notebook
Now we go over a quick explanation of all the methods.
:heavy_exclamation_mark: All code has detailed comments explaining the different parameters and their functions
Imgtovar allows the user to extract images from documents in case there is no image database.
ImgtoVar.extract(data, mode="PDF")
The data
parameter can be either a single file or a directory.
This function extracts all images from one or more documents and stores them in a direcotry structure like so:
extract_output|
|--pdf_name|
|--images
...
:warning: Currently supported modes: PDF only!
During initial development, cleaning was found to be a crucial step, especially when the images were acquired through automatic image extraction from documents.
There are three methods used in the cleaning process, I have found that its best to filter out corrupted images and infographics before running the color analysis in order to limit false positives in the last step, here are the methods in the intended order:
1. detect_infographics
This method detects if an image is an infographic and if it predicts it's type. The detection stage has an f1-score of 97% showing strong performance. On the other hand the overall classifier accuracy is 87% which is an indication of the performance of the identification stage.
charts_df = ImgtoVar.detect_infographics(data)
The function can take as intput single file, a directory of files or a directory of directories with structure like the output of ImgtoVar.extract()
and returns a DataFrame with the image file names and the predicted chart_type.
Parameters:
data: Three modes of operation: 'Single file', 'Directory of files', 'Directory of directories'
extract (boolean): Determines if the images classified as infoprahics will be moved to a new dir
run_on_gpu (bool): Determines if tensorflow should use GPU acceleration, default is set to False for stability, if enabled you could run into OOM error depending on available GPU VRAM.
resume (bool): Determines if the a new experiment will be run or a previous one is resumed.
When the data directory contains a folder for each pdf a checkpoint is made after each folder!
The output directory structure looks as follows:
Output|
|--infographics|
|exp_1|
|--infographics.csv
|--checkpoints # Keeps track fo progress to allow for process resuming
|--Infographics| # Directory created only if extracted = True
|--pdf_name/folder_name|
|--extracted_images
...
...
This method should be used before the color_analysis
in order to reduce the false positives in that step.
2. detect_invertedImg
Through trial and error we have found that image extraction from PDF files results in a small percentage of images being corrupted. Those images have inverted channel values and additional problems with contrast and lightness. To identify those images ImgtoVar provides a method that has 93% accuracy in detecting the corrupted images.
inv_df = ImgtoVar.detect_invertedImg(data)
The function can take as intput single file, a directory of files or a directory of directories with structure like the output of ImgtoVar.extract()
and returns a DataFrame with the image file names and the predicted chart_type.
Parameters:
data: Three modes of operation: 'Single file', 'Directory of files', 'Directory of directories'
extract (boolean): Determines if the images classified as inverted will be moved to a new dir
run_on_gpu (bool): Determines if tensorflow should use GPU acceleration, default is set to False for stability, if enabled you could run into OOM error depending on available GPU VRAM.
resume (bool): Determines if the a new experiment will be run or a previous one is resumed.
When the data directory contains a folder for each pdf a checkpoint is made after each folder!
The output directory structure looks as follows:
Output|
|--inverted_images|
|exp_1|
|--inverted.csv
|--checkpoints # Keeps track fo progress to allow for process resuming
|--Inverted| # Directory created only if extracted = True
|--pdf_name/folder_name|
|--extracted_images
...
...
3. color_analysis
To filter out the undesirable images, ImgtoVar provides a method for analysing the distribution on hue/lightness pairs across all pixels.
hl_pairs_df = ImgtoVar.color_analysis(data)
A pandas DataFrame is returned containing the image file name, the total H/L pairs found the and proportion represented by the top 200 pairs (this is adjustable). Additionally, images identified as "Artificial" can be moved to a new directory if the extract
parameter is set to true.
Parameters:
data: Three modes of operation: 'Single file', 'Directory of files', 'Directory of directories'
color_width (int): How many of the most populous hue:lightness pairs to sum together to determine the proportion of the final image they occupy.
threshold (float): 0 < threshold < 1, what percent of the image is acceptable for color_width hue:lightness pairs to occupy, more than this is tagged as artificial.
max_intensity (int): Ceiling value for the hue:lightness pair populations. This value will affect the pixel proportion if too low.
extract (boolean): Determines if the images classified as artificial will be moved to a new dir
run_on_gpu (bool): Determines if tensorflow should use GPU acceleration, default is set to False for stability, if enabled you could run into OOM error depending on available GPU VRAM.
resume (bool): Determines if the a new experiment will be run or a previous one is resumed.
When the data directory contains a folder for each pdf a checkpoint is made after each folder!
The output directory structure looks as follows:
Output|
|--color_analysis|
|exp_1|
|--color_analysis.csv
|--checkpoints # Keeps track fo progress to allow for process resuming
|--Artificial| # Directory created only if extracted = True
|--pdf_name/folder_name|
|--extracted_images
...
...
The filtering is based on that proportion. Where real images will have low proportion and drawings, logos or single color images will have a high proportion. By varying the threshold
parameter the user can make the filtering more or less aggressive.
| Color analysis | 200 H/L Width |
:-------------------------:|:-------------------------:|
| | |
|Dominant pairs proportion: 3% | Dominant pairs proportion: 64%
This method is very effective at identifying real photographs, but can mistakenly label simpler images like infographics as undesirable.
N.B! In order to select appropriate color width and the user can experiement with different values by using:
ImgtoVar.functions.image_histogram(
image,
color_width = {color_width},
max_intensity = {max_intensity}
)
This method will return a useful report on the color histogram of the image analysed.
Once we have clean data, we can begin structuring our image data by using the methods included in ImgtoVar.
This section outlines the three main methods behind feature extraction. Several pre-trained models are included with the library, but the methods also allow for integration with custom models.
The facial attribute analysis was made based on the popular Python module DeepFace. ImgtoVar adds two important features.
First, the face detection function was reworked in order to return all detected faces in an image, and therefore run the analysis on each.
Second, the apparent age classifier was changed for a new custom model, as the original model included with DeepFace lacked training examples below 18 years old, and had limited examples in the higher age groups as well. This leads to poor performance on the test data. The new model classifies age in one of the following groups: child, young adult, adult, middle age, old with 72% accuracy.
Here is a potential workflow:
backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface', 'mediapipe']
#facial analysis
demography_df = ImgtoVar.face_analysis(data, actions=("emotion", "age", "gender", "race"), detector_backend = backends[4])
The method returns a DataFrame with the image file names and the predicted label for each action specified. By default, retinaface is used as a backend and all actions are predicted. To see a comparison of the different backends you can refer to this demo created by the author of DeepFace.
As written in the DeepFace documentation: "RetinaFace and MTCNN seem to overperform in detection and alignment stages but they are much slower. If the speed of your pipeline is more important, then you should use opencv or ssd. On the other hand, if you consider the accuracy, then you should use retinaface or mtcnn."
Here are additional parameters of the face_analysis
function:
data: Three modes of operation: 'Single file', 'Directory of files', 'Directory of directories'
actions (tuple): The default is ('age', 'gender', 'emotion', 'race'). You can drop some of those attributes.
models: (Optional[dict]) facial attribute analysis models are built in every call of analyze function. You can pass pre-built models to speed the function up or to use a custom model.
N.B. -- If using a custom model you need to pass it under a key == 'custom' (ex. models['custom'] = {custom_keras_model})
enforce_detection (boolean): The function throws exception if no faces were detected. Set to False by default.
detector_backend (string): set face detector backend as retinaface, mtcnn, opencv, ssd or dlib.
extract (boolean): Determines if the faces extracted from the images will be saved in a new dir
custom_model_channel (string): The channel used to train your custom model default is BGR set to "RGB" if needed
custom_target_size (tuple): The input size of the custom model, default is (224, 224)
run_on_gpu (bool): Determines if tensorflow should use GPU acceleration, default is set to False for stability, if enabled you could run into OOM error depending on available GPU VRAM.
resume (bool): Determines if the a new experiment will be run or a previous one is resumed. When the data directory contains a folder for each pdf a checkpoint is made after each folder!
return_JSON (boolean): Determines the return type, set to False by default.
The output directory structure looks as follows:
Output|
|--face_analysis|
|exp_1|
|--facial_analysis.csv
|--checkpoints # Keeps track fo progress to allow for process resuming
|--Faces| # Directory created only if extracted = True
|--pdf_name/folder_name|
|--extracted_faces
...
...
The background analysis detects the context of an image, e.g. if an image depicts an urban skyline or a forrest for example. The classifier has 93% accuracy in identifying natural vs man-made image backgrounds.
background_df = ImgtoVar.background_analysis(data)
The method returns a DataFrame with the image file names and the predicted background. To train this classifier, a custom dataset was created.
Due to limitations of the training data some nature examples have limited man-made structures, therefore this classifier cannot be used on its own to filter out images in which no man-made objects exist.
For example, if an image shows a natural landscape with a small house in the middle, that will be classified as natural.
| Label: | Natural | Natural | Man-made|
:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
| Image:| | | |
If the researcher wants to detect images with nothing man-made in them, the background_analysis
method can be used in combination with the object_detection
method to identify false positive cases.
This are the parameters for this method:
data: Three modes of operation: 'Single file', 'Directory of files', 'Directory of directories'
run_on_gpu (bool): Determines if tensorflow should use GPU acceleration, default is set to False for stability, if enabled you could run into OOM error depending on available GPU VRAM.
resume (bool): Determines if the a new experiment will be run or a previous one is resumed. When the data directory contains a folder for each pdf a checkpoint is made after each folder!
The output directory structure looks as follows:
Output|
|--background_analysis|
|exp_1|
|--background_analysis.csv
|--checkpoints # Keeps track fo progress to allow for process resuming
...
For object detection, ImgtoVar uses the YoloV5 family of models.
There are 2 custom pre-trained models included with the module, as well as all the pre-trained models on the COCO dataset included with YoloV5 itself.
A number of the original parameters were wrapped to allow for increased flexibility of the model, additionally there is a functionality for use of custom models and detection with checkpoints:
data: Three modes of operation: 'Single file', 'Directory of files', 'Directory of directories'
model (string): specifies model options. By default it is set to "cusutom" which allows the setting of custom weights.
model can also be set to:
* 'c_energy' -- a custom model trained to detected the following objects: ["Crane", "Wind turbine", "farm equipment", "oil pumps", "plant chimney", "solar panels"]
* 'sub_open_images' -- a custom model trained on a subset of google Open Images dataset included labels are: [ "Animal", "Tree", "Plant", "Flower", "Fruit", "Suitcase", "Motorcycle", "Helicopter", "Sports Equipment", "Office Building", "Tool", "Medical Equipment", "Mug", "Sunglasses", "Headphones", "Swimwear", "Suit", "Dress", "Shirt", "Desk", "Whiteboard", "Jeans", "Helmet", "Building"]
weights (path): determines the weights used if model is set to 'custom' by default COCO pre-trained weights are used, specify labels parameter if you are using a custom model
conf_thres (float): 0 < conf_thresh < 1, the detection confidence cutoff
imgsz (int): The image size for detection, default set to 640, best results are achieved if the img size is the same as the one used for training for more information chech yolov5 documentation.
save_imgs (bool): Determines if the images on which detection was ran should be saved with the predictions overlayed. Set to False by default.
labels (list): A list with the labels used in custom prediction, must be set when using with custom model.
resume (bool): Determines if the a new experiment will be run or a previous one is resumed. When the data directory contains a folder for each pdf a checkpoint is made after each folder!
The output of this method has the following structure:
Output|
|--object_detection|
|{model_name}_exp_1|
|--object_detection.csv
|--checkpoints # Keeps track fo progress to allow for process resuming
|--pdf_name/folder_name| # Directory created only if save_imgs = True
|--labels|
|--image.txt
...
|image.jpg
...
...
...
The COCO dataset covers 80 classes, to which ImgtoVar adds 24 classes extracted from the OpenImages dataset and additional 6 classes trained on a custom dataset. Finally, the module allows users to specify their own custom pre-trained weights and model architecture.
To use the COCO dataset:
coco_od_df = ImgtoVar.detect_objects(data, model="custom", weights="yolov5l.pt",labels=None)
The method returns a DataFrame with the the image file name, object detected, the position and the confidence of prediction.
Since for most researchers the existence of an object is more important than its exact coordinates, we report mAP at 0.5 IoU, which for the yolov5l.pt
model is 67.3%.
[
"person",
"bicycle",
"car",
"motorcycle",
"airplane",
"bus",
"train",
"truck",
"boat",
"traffic light",
"fire hydrant",
"stop sign",
"parking meter",
"bench",
"bird",
"cat",
"dog",
"horse",
"sheep",
"cow",
"elephant",
"bear",
"zebra",
"giraffe",
"backpack",
"umbrella",
"handbag",
"tie",
"suitcase",
"frisbee",
"skis",
"snowboard",
"sports ball",
"kite",
"baseball bat",
"baseball glove",
"skateboard",
"surfboard",
"tennis racket",
"bottle",
"wine glass",
"cup",
"fork",
"knife",
"spoon",
"bowl",
"banana",
"apple",
"sandwich",
"orange",
"broccoli",
"carrot",
"hot dog",
"pizza",
"donut",
"cake",
"chair",
"couch",
"potted plant",
"bed",
"dining table",
"toilet",
"tv",
"laptop",
"mouse",
"remote",
"keyboard",
"cell phone",
"microwave",
"oven",
"toaster",
"sink",
"refrigerator",
"book",
"clock",
"vase",
"scissors",
"teddy bear",
"hair drier",
"toothbrush",
]
To use the subset of OpenImages dataset:
coco_od_df = ImgtoVar.detect_objects(data, model="sub_open_images")
The mAP at 0.5 IoU is 59%. This value is comparable with scores from the OpenImages data challenge where a much more complicated model achieved an overall score of 65% mAP at 0.5. The advantage of YoloV5 is its cutting edge speed, which allows researchers to extract variables from larger datasets.
[
"Animal",
"Tree",
"Plant",
"Flower",
"Fruit",
"Suitcase",
"Motorcycle",
"Helicopter",
"Sports_equipment",
"Office_building",
"Tool",
"Medical_equipment",
"Mug",
"Sunglasses",
"Headphones",
"Swimwear",
"Suit",
"Dress",
"Shirt",
"Desk",
"Whiteboard",
"Jeans",
"Helmet",
"Building",
]
The final dataset is a custom dataset, made in connection with the research that this module is being applied to. It includes 6 labels connected to sustainability, such as wind mills, solar panels, oil pumps etc.
coco_od_df = ImgtoVar.detect_objects(data, model="c_energy")
The mAP at 0.5 IoU is 91%.
[
"Crane",
"Wind turbine",
"farm equipment",
"oil pumps",
"plant chimney",
"solar panels",
]
Distributed under the GNU License. See LICENSE.txt
for more information.
Dimitar Dimitrov
Email - dvdimitrov13@gmail.com
LinkedIn - https://www.linkedin.com/in/dimitarvalentindimitrov/
Special thanks goes to my thesis advisor, Francesco Grossetti, who has helped me develop and verify my work.
FAQs
Extracting structured variables from image data
We found that imgtovar demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.