The fuzzydata Workflow Generator
The fuzzydata
workflow generator enables:
- Abstract specification of Dataframe-based Workflows
- Generation of randomized tables and workflows
- Loading and replay of workflows on multiple clients
Fuzzydata is currently designed to run using the following clients:
fuzzydata
is designed to be extensible, you may implement your own client.
Please see the existing clients in fuzzydata/clients for ways to extend the abstract Artifact
, Operation
and Workflow
classes for your client.
Installation
Manual build/install using pip.
pip install fuzzydata
fuzzydata
Does not install modin
or SQLAlchemy
by default, but this can be specified as an install option:
pip install fuzzydata[modin|sql|all]
Usage
Some examples of fuzzydata usage are in the examples
directory. You can also run the fuzzydata
command
to get a list of command-line options supported in fuzzydata
$ fuzzydata --help
usage: fuzzydata [-h] [--wf_client WF_CLIENT] [--output_dir OUTPUT_DIR] [--wf_name WF_NAME]
[--columns COLUMNS] [--rows ROWS] [--versions VERSIONS] [--bfactor BFACTOR]
[--matfreq MATFREQ] [--npp NPP] [--log LOG] [--replay_dir REPLAY_DIR]
[--wf_options WF_OPTIONS] [--exclude_ops EXCLUDE_OPS] [--scale_artifact SCALE_ARTIFACT]
optional arguments:
-h, --help show this help message and exit
--wf_client WF_CLIENT
Workflow Client to be used (Default pandas). Available Workflows: pandas|modin|sql
--output_dir OUTPUT_DIR
Location of Output datasets to be stored
--wf_name WF_NAME prefix for each workflow to be generated dir to be the path prefix for these files.
--columns COLUMNS Number of columns in the base version
--rows ROWS Number of rows in the base version
--versions VERSIONS Number of artifact versions to generate
--bfactor BFACTOR Workflow Branching factor, 0.1 is linear, 100 is star-like
--matfreq MATFREQ Materialization frequency, i.e. how many operations before writing out an artifact
--log LOG Set Logging Level
--replay_dir REPLAY_DIR
Replay existing workflow in directory
--wf_options WF_OPTIONS
JSON-encoded workflow engine options like sql_string or modin_engine
--exclude_ops EXCLUDE_OPS
JSON-encoded list of ops to exclude e.g. ["pivot"]
--scale_artifact SCALE_ARTIFACT
JSON-encoded dict of {artifact_label: new_size} to be scaled up e.g. {"artifact_0"
: 1000000}
Documentation
Download our paper here.
If you use fuzzydata in your research, please consider citing our paper:
@inproceedings{10.1145/3531348.3532178,
author = {Rehman, Mohammed Suhail and Elmore, Aaron},
title = {FuzzyData: A Scalable Workload Generator for Testing Dataframe Workflow Systems},
year = {2022},
isbn = {9781450393539},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3531348.3532178},
doi = {10.1145/3531348.3532178},
booktitle = {Proceedings of the 2022 Workshop on 9th International Workshop of Testing Database Systems},
pages = {17–24},
numpages = {8},
location = {Philadelphia, PA, USA},
series = {DBTest '22}
}
License
MIT License
Contributing to fuzzydata
Check out the current roadmap in docs/roadmap.md. You are always welcome to develop a new client for
fuzzydata.
Contact
Suhail Rehman / ChiData Group @ Uchicago CS