Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
dataintegrityfingerprint
Advanced tools
Data Integrity Fingerprint (DIF) - A reference implementation in Python
A reference implementation in Python
by Oliver Lindemann & Florian Krause
This software calculates the Data Integrity Fingerprint (DIF) of multi-file datasets. It can be used via the command line, via a graphical user interface, or as a Python library for embedding in other software. In either case, the user has the choice of calculating the DIF based on a variety of (cryptographic) algorithms using serial (single CPU core) or parallel (multiple CPU cores) computing. In addition, a checksums file with fingerprints of individual files in a dataset can be created. These files can also serve as the basis for calculating the DIF and, in addition, can be compared against a dataset in order to reveal content differences in case a DIF could not be verified.
Note: We strongly recommend to use SHA-256 or one of the other cryptographic algorithms for calculating the DIF. The non-cryptographic algorithms are significantly faster, but also significantly less secure (i.e. collisions are much more likely, breaking the uniqueness of a DIF, and opening a door for potential manipulation). They might hence only be an option for very large datasets in scenarios where a potential manipulation by a third party is not part of the threat model. The graphical user interface does not allow for selecting non-cryptographic algorithms.
The quickest way to use the application is to install it with pipx:
pipx install dataintegrityfingerprint
To also make use of the programming library, a classical pip installation is of course also possible:
python -m pip install dataintegrityfingerprint
After successful installation, the command line interface is available as dataintegrityfingerprint
:
dataintegrityfingerprint [-h] [-f] [-a ALGORITHM] [-C] [-D] [-G] [-L] [-s]
[-d CHECKSUMSFILE] [-n] [-p] [--non-cryptographic]
[PATH]
positional arguments:
PATH the path to the data directory
options:
-h, --help show this help message and exit
-f, --from-checksums-file
Calculate dif from checksums file. PATH is a checksums
file
-a ALGORITHM, --algorithm ALGORITHM
the hash algorithm to be used (default=SHA-256)
-C, --checksums print checksums only
-D, --dif-only print dif only
-G, --gui open graphical user interface
-L, --list-available-algorithms
print available algorithms
-s, --save-checksums-file
save checksums to file
-d CHECKSUMSFILE, --diff-checksums-file CHECKSUMSFILE
Calculate differences of checksums to CHECKSUMSFILE
-n, --no-multi-processing
switch of multi processing
-p, --progress show progressbar
--non-cryptographic allow non cryptographic algorithms (Not suggested,
please read documentation carefully!)
After successful installation, the graphical user interface is available as dataintegrityfingerprint-gui
:
After successful installation, the Python package is available as dataintegrityfingerprint
:
import dataintegrityfingerprint
A DIF can then be created in the following way:
dif = dataintegrityfingerprint.DataIntegrityFingerprint("/path/to/dataset")
print(dif) # get the DIF
print(dif.checksums) # get the list of checksums of individual files
The main functionality for usage in other code is made available via the class DataIntegrityFingerprint
.
Create a DataIntegrityFingerprint object.
DataIntegrityFingerprint(data,
from_checksums_file=False,
hash_algorithm='SHA-256',
multiprocessing=True,
allow_non_cryptographic_algorithms=False)
Parameters
----------
data : str
the path to the data
from_checksums_file : bool
data argument is a checksums file
hash_algorithm : str
the hash algorithm (optional, default: sha256)
multiprocessing : bool
using multi CPU cores (optional, default: True)
speeds up creating of checksums for large data files
allow_non_cryptographic_algorithms : bool
set True only, if you need non cryptographic algorithms (see
notes!)
Note
----
We do not suggest to use non-cryptographic algorithms.
Non-cryptographic algorithms are, while much faster, not secure (e.g.
can be tempered with). Only use these algorithms to check for technical
file damage and in cases security is not of critical concern.
The DataIntegrityFingerprint
class includes a set of global variables which
affect all instances.
Global variable.
Default value = '␣␣'
(i.e., two U+0020 whitespace characters)
Global variable.
Default value = ['MD5', 'SHA-1', 'SHA-224', 'SHA-256', 'SHA-384', 'SHA-512', 'SHA3-224', 'SHA3-256', 'SHA3-384', 'SHA3-512']
Global variable.
Default value = ['ADLER-32', 'CRC-32']
Once initiated, a DataIntegrityFingerprint
object provides several methods and
attributes.
Calculate differences of checksums to checksums file.
diff_checksums(filename)
Parameters
----------
filename : str
the name of the checksums file
Returns
-------
diff : str
the difference of checksums to the checksums file
(minus means checksums is missing something from checksums file,
plus means checksums has something in addition to checksums file)
Generate hash list to get Data Integrity Fingerprint.
generate(progress=None)
Parameters
----------
progress: function, optional
a callback function for a progress reporting that takes the
following parameters:
count -- the current count
total -- the total count
status -- a string describing the status
Get all files to hash.
get_files(self)
Returns
-------
files : list
the list of files to hash
Save the checksums to a file.
save_checksums(filename=None)
Parameters
----------
filename : str, optional
the name of the file to save checksums to
Returns
-------
success : bool
whether saving was successful
An initiated DataIntegrityFingerprint
object also provides a set of
read-only properties.
Read-only property
Read-only property.
Read-only property.
Read-only property.
Read-only property.
Read-only property.
Read-only property.
Read-only property.
For any questions, please use the discussion section from the code repository. If you wish to contribute or report an issue, please use the issue tracker and pull requests.
FAQs
Data Integrity Fingerprint (DIF) - A reference implementation in Python
We found that dataintegrityfingerprint demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.