
Security News
CISA’s 2025 SBOM Guidance Adds Hashes, Licenses, Tool Metadata, and Context
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.
DumpCleaner is a tool that can randomize or anonymize your database dumps. Currently, it works with the MySQL Shell Dump format (other formats may be added later).
Even though we use DumpCleaner in our production setup, this project still beta quality and may experience breaking changes.
The main purpose of this tool is to provide a safe way to work with production data during development. Often, production databases can easily fit into the developers’ computers and if they don’t, the database tools usually provide a way to dump a subset of the data (and leave the audit logs behind, for example).
We believe that working with production data has several benefits over developing against a near-empty and/or a completely made-up data set:
That said, having an exact production data copy at developers’ machines is insecure and could lead to personal data leaks and violations of GDPR or similar legislation. Moreover, developers usually do not—or should not—need to work with real individual data records, they rather need a realistic-enough approximation of the data. That’s where the DumpCleaner’s main feature, a high-fidelity data anonymization / randomization, should come in handy.
All in all, DumpCleaner is just a ”more specialized and configurable awk
“, i.e. a text replacement tool.
To install the gem, add it to the application's Gemfile by executing:
$ bundle add dump_cleaner
If bundler is not being used to manage dependencies, install the gem by executing:
$ gem install dump_cleaner
The gem provides a dump_cleaner
executable that must be called with the following arguments:
$ dump_cleaner -f <source_dump_path> -t <destination_dump_path> [-c <config_file>]
where:
-f
/ --from=
sets the path to the source (original, non-anonymized) data dump; for MySQL Shell this is the directory with the dump created by the MySQL Shell dump utility-t
/ --to=
sets the path to the destination (anonymized) data dump; for MySQL Shell this is the directory with the dump which will be created or overwritten by DumpCleaner-c
/ --config=
sets the path to the configuration file (default: config/dump_cleaner.yml
); the configuration is documented belowThe repository includes a sample MySQL Shell dump that has been produced by running the MySQL Shell dump utility against a db
database:
MySQLShell JS> util.dumpSchemas(["db"], "mysql_shell_dump");
The dump contains a users
table with the following sample contents:
$ zstdcat mysql_shell_dump/db@users@@0.tsv.zst
# id name email phone_number
1 Johnson johnson@gmail.com +420774678763
2 Smith amiright@example.com +420733653796
3 Williams anette.williams@example.com N/A
Now, after running DumpCleaner with the following options including a certain config file:
$ dump_cleaner -f mysql_shell_dump -t mysql_shell_anonymized_dump \
-c mysql_shell_dump_cleaner.yml
a destination dump directory gets created with a copy of the source dump but with the data in the users
table randomized, in this case in the following way:
$ zstdcat mysql_shell_anonymized_dump/db@users@@0.tsv.zst
# id name email phone_number
1 Jackson variety@gmail.com +420774443735
2 Allen contains@present.com +420733637921
3 Harrison should.visitors@program.com N/A
There are a few things to note here:
N/A
) as specified in the config file.If DumpCleaner was run once again over the same source data and using the same config, it would produce exactly the same randomized data in the output.
Read on if you are interested in more details about how DumpCleaner works, otherwise you can safely skip to the Configuration section.
DumpCleaner first reads the config file. From the configuration, it finds the tables and columns that need to be sanitized by the cleaning process. It parses the dump data for each table, extracts the fields from each record and runs the following workflows for each to-be-cleaned field:
nil
value). This workflow usually replaces the field value with a random one.After all configured table columns have been cleaned, the tool copies the remaining data from the original dump so that the destination dump is complete and ready for re-import.
The overall process is summarized in the diagram below, too:
flowchart LR
A(start) --> AA[read\nconfig]
AA --> B{{each\ntable}}
B --> BB{{each\nrecord}}
BB --> C{{each\nfield}}
C -->D[run the\ndata source steps]
D -->E[run the\ncleaning steps]
E -->F{failed?}
F -->|yes|G[run the\nfailure steps]
G --> H
F -->|no|H{result\nunique?}
H -->|yes or\nirrelevant|L{more\ndata to\nclean?}
H --> |no but wanted| E
L -.-> |yes| C
L -.-> |yes| BB
L -.-> |yes| B
L --> |no| M[copy\nremaining\n data]
M --> Z(end)
A particular column in a table may be configured to require unique randomized data. In that case, the cleaning process is repeated until it produces a unique randomized value, or until max retries limit is reached (currently 1000).
The cleaning workflow steps usually just add a numeric suffix to the randomized value, without increasing its length (and byte size). For example, if the sanitized value is something
, its unique variant may become somethin1
or even somethi99
. Some cleaning steps, on the other hand, allow repeatedly taking a random value from the dictionary instead of adding a suffix.
When max retries is reached, DumpCleaner prints an error and switches to the failure workflow for further processing.
To achieve a deterministic randomness when cleaning the data, each random number generation is seeded with a value reflecting the identity of the current record (usually it’s primary key value) and the field original value. This guarantees that the same source data in the same record will be always cleaned up to the same randomized data.
There are some practical limits to this consistency though:
NOTE: The fact that the RNG seed is also dependent on the primary key has one more potentially undesired consequence: the same original value will be cleaned to different values in records with different primary keys, thus adding an artificial variance to the data. We will look into this issue in a future release.
A basic DumpCleaner configuration file might look like this:
dump_cleaner:
log_level: info
dump:
format: mysql_shell
cleanup_tables:
- db: db
table: users
columns:
- name: name
cleanup_type: last_name
- name: e_mail
cleanup_type: email
unique: true
- name: phone_number
cleanup_type: phone_number_intl
record_context_columns:
- id
- admin_flag
keep_same_record_conditions:
- column: admin_flag
condition: non_zero
# id_column: user_id (not useful in this table)
cleanup_types:
last_name:
data_source:
- step: LoadYamlFile
params:
file: spec/support/data/dict/last_names.yml
- step: GroupByBytesize
cleaning:
- step: SelectDataByBytesize
- step: TakeSample
failure:
- step: FillUpWithString
email:
data_source:
...
cleaning:
- step: RandomizeEmail
failure:
- step: FillUpWithString
phone_number_intl:
cleaning:
- step: RandomizeFormattedNumber
params:
# +420123456789
format: (?<front>\+(?:\d{6}))(?<x>\d{6})
failure:
- step: FillUpWithString
keep_same_conditions:
- condition: eq
value: "N/A"
The individual config options are as follows:
dump_cleaner
This allows setting the log level using the log_level
property. The DumpCleaner log output is printed to STDOUT
and the default log level is INFO
.
dump
This setting currently only defines the format of the data dump using the format
property. The only recognized format now is mysql_shell
.
cleanup_tables
This is where things get more interesting. The cleanup_tables
key specifies which tables (via the db
and table
properties) and their columns (via the name
property nested inside the columns
array) should be cleaned and what cleanup_type
each column is, i.e. which variant of the cleanup process will be used for it. Optionally, a column may also include a unique
property: when set to true
the randomized values in this column will be guaranteed to be unique across the table.
Optionally, an id_column
key may be given that determines the foreign key which is responsible for determining the identity of the table records (see the Randomization section above). For example a table that belongs_to the users
table might have the id_column
set to user_id
and this would ensure that the values in this table would be randomized the same as the corresponding values in the users
table, keeping consistency across the associated tables. This property defaults to "id"
.
Optionally, the keep_same_conditions
key may also hold conditions for ignoring the cleanup of a record from the table. When the conditions evaluate to a truthy value for the record, none of its fields gets cleaned. This is useful if you want to keep some records (say admin users) in the original state.
The optional record_context_columns
property may define a list of columns the source values of which should be available for the cleaning workflows. This is currently used when evaluating the keep_same_conditions
. (This could probably be refactored out as it unnecessarily duplicates the configuration a bit.)
cleanup_types
The core of the sanitization process lies here. Under this key the relevant steps for the data_source
, cleaning
or failure
workflows are specified, each with optional params
. In general, the output of one step makes the input of the following step. It is considered an error if a cleaning
step returns a nil
value and that’s when the processing switches to the failure
workflow.
See the Workflow steps page for the individual steps documentation.
Optionally, under the keep_same_conditions
property, conditions for ignoring the cleanup of the given cleanup type may be given. If they evaluate to true for the currently processed field value, it’s cleanup is skipped and the original value is returned.
Finally, the optional ignore_keep_same_record_conditions
property may be set to true to indicate that current field value should be always cleaned, even if the keep_same_conditions
were used for the whole record at the cleanup_table
level.
keep_same_conditions
The keep_same_conditions
property may define a list of conditions that will prevent cleaning up the given field or record. Each condition is a hash that consists of the following properties:
column
- defines the column in the table that the condition should take the field’s value from (this is useful only when using the keep_same_conditions
under the cleanup_tables
configuration key whereas in the cleanup_types
context the column is implied),condition
- specifies the operator or function that the condition should evaluate; currently supported values here are:
eq
- tests with ==
ne
- tests with !=
start_with
- tests strings with the start_with?
methodend_with
- tests strings with the end_with?
methodnon_zero
- converts the value to an integer and tests if it is different from zero,value
- the value to evaluate the condition against (some operators may not use a value, such as non_zero
).If multiple conditions are specified, they are logically OR-ed, i.e. if any of the conditions yields true, the whole statement yields true and the record or field cleaning is skipped.
RandomizeFormattedNumber
step could be generalized to RandomizeFormattedString
, allowing to replace any matching part of the string with not only numbers, but alphanumeric etc. as well. The RandomizeEmail
could then be rewritten using this new step.After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
.
Bug reports and pull requests are welcome on GitHub.
The gem is available as open source under the terms of the MIT License.
FAQs
Unknown package
We found that dump_cleaner demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.
Security News
A clarification on our recent research investigating 60 malicious Ruby gems.
Security News
ESLint now supports parallel linting with a new --concurrency flag, delivering major speed gains and closing a 10-year-old feature request.