Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
The Synthetic Data Gym (SDGym) is a benchmarking framework for modeling and generating synthetic data. Measure performance and memory usage across different synthetic data modeling techniques – classical statistics, deep learning and more!
The SDGym library integrates with the Synthetic Data Vault ecosystem. You can use any of its synthesizers, datasets or metrics for benchmarking. You can also customize the process to include your own work.
Install SDGym using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
pip install sdgym
conda install -c pytorch -c conda-forge sdgym
For more information about using SDGym, visit the SDGym Documentation.
Let's benchmark synthetic data generation for single tables. First, let's define which modeling techniques we want to use. Let's choose a few synthesizers from the SDV library and a few others to use as baselines.
# these synthesizers come from the SDV library
# each one uses different modeling techniques
sdv_synthesizers = ['GaussianCopulaSynthesizer', 'CTGANSynthesizer']
# these basic synthesizers are available in SDGym
# as baselines
baseline_synthesizers = ['UniformSynthesizer']
Now, we can benchmark the different techniques:
import sdgym
sdgym.benchmark_single_table(
synthesizers=(sdv_synthesizers + baseline_synthesizers)
)
The result is a detailed performance, memory and quality evaluation across the synthesizers on a variety of publicly available datasets.
Benchmark your own synthetic data generation techniques. Define your synthesizer by specifying the training logic (using machine learning) and the sampling logic.
def my_training_logic(data, metadata):
# create an object to represent your synthesizer
# train it using the data
return synthesizer
def my_sampling_logic(trained_synthesizer, num_rows):
# use the trained synthesizer to create
# num_rows of synthetic data
return synthetic_data
Learn more in the Custom Synthesizers Guide.
The SDGym library includes many publicly available datasets that you can include right away.
List these using the get_available_datasets
feature.
sdgym.get_available_datasets()
dataset_name size_MB num_tables
KRK_v1 0.072128 1
adult 3.907448 1
alarm 4.520128 1
asia 1.280128 1
...
You can also include any custom, private datasets that are stored on your computer on an Amazon S3 bucket.
my_datasets_folder = 's3://my-datasets-bucket'
For more information, see the docs for Customized Datasets.
Visit the SDGym Documentation to learn more!
The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:
Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.
FAQs
Benchmark tabular synthetic data generators using a variety of datasets
We found that sdgym demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 6 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.