Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
A lightweight python module and CLI for computing the hash of any directory based on its files' structure and content.
hashlib
module.The hash is computed according to the Dirhash Standard, which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations.
From PyPI:
pip install dirhash
Or directly from source:
git clone git@github.com:andhus/dirhash-python.git
pip install dirhash/
Python module:
from dirhash import dirhash
dirpath = "path/to/directory"
dir_md5 = dirhash(dirpath, "md5")
pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"])
no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"])
CLI:
dirhash path/to/directory -a md5
dirhash path/to/directory -a md5 --match "*.py"
dirhash path/to/directory -a sha1 --ignore ".*" ".*/"
If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing.
There isn't really a standard way of doing this. There are plenty of recipes out
there (see e.g. these SO-questions for linux
and python)
but I couldn't find one that is properly tested (there are some gotcha:s to cover!)
and documented with a compelling user interface. dirhash
was created with this as
the goal.
checksumdir is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests.
The python hashlib
implementation of common hashing algorithms are highly
optimised. dirhash
mainly parses the file tree, pipes data to hashlib
and
combines the output. Reasonable measures have been taken to minimize the overhead
and for common use-cases, the majority of time is spent reading data from disk
and executing hashlib
code.
The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files.
As a reference, let's compare the performance of the dirhash
CLI
with the shell command:
find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5
which is the top answer for the SO-question: Linux: compute a single hash for a given folder & contents? Results for two test cases are shown below. Both have 1 GiB of random data: in "flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in "nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8.
Implementation | Test Case | Time (s) | Speed up |
---|---|---|---|
shell reference | flat_1k_1MB | 2.29 | -> 1.0 |
dirhash | flat_1k_1MB | 1.67 | 1.36 |
dirhash (8 workers) | flat_1k_1MB | 0.48 | 4.73 |
shell reference | nested_32k_32kB | 6.82 | -> 1.0 |
dirhash | nested_32k_32kB | 3.43 | 2.00 |
dirhash (8 workers) | nested_32k_32kB | 1.14 | 6.00 |
The benchmark was run a MacBook Pro (2018), further details and source code here.
Please refer to dirhash -h
, the python source code and the Dirhash Standard.
FAQs
Python module and CLI for hashing of file system directories.
We found that dirhash demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.