Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
SDQCPy is a comprehensive Python package designed for synthetic data management, quality control, and validation.
SDQCPy: A Comprehensive Python Package for Synthetic Data Management
SDQCPy
offers a comprehensive toolkit for synthetic data generation, quality assessment, and analysis:
You can install SDQCPy
using pip:
pip install sdqcpy
Alternatively, you can install it from the source:
git clone https://github.com/T0217/sdqcpy.git
cd sdqcpy
pip install -e .
SDQCPy
provides a SequentialAnalysis
class to perform the sequential analysis and store the results in a HTML file.
You can use the following code to achieve the sequential analysis and store the results in a HTML file:
from sdqc_integration import SequentialAnalysis
from sdqc_data import read_data
import logging
import warnings
# Ignore warnings and set logging level to ERROR
warnings.filterwarnings('ignore')
logger = logging.getLogger()
logger.setLevel(logging.ERROR)
# Set random seed
random_seed = 17
# Replace with your own data path and use pandas to read the data
raw_data = read_data('3_raw')
synthetic_data = read_data('3_synth')
output_path = 'raw_synth.html'
# Perform sequential analysis
sequential = SequentialAnalysis(
raw_data=raw_data,
synthetic_data=synthetic_data,
random_seed=random_seed,
use_cols=None,
)
results = sequential.run()
sequential.visualize_html(output_path)
SDQCPy
supports various methods, the implementation of these methods are using ydata-synthetic
and SDV
.
[!TIP]
We only display simple code here, and the parameters of each model can be further modified as needed.
YData Synthesizer
import pandas as pd
from sdqc_synthesize import YDataSynthesizer
raw_data = pd.read_csv("raw_data.csv") # Please replace with your own data path
ydata_synth = YDataSynthesizer(data=raw_data)
synthetic_data = ydata_synth.generate()
[!IMPORTANT]
In the latest version,
ydata-synthetic
has switched to using ydata-sdk. However, since synthetic data is only a supplementary feature of this library, it has not been updated yet.
SDV Synthesizer
import pandas as pd
from sdqc_synthesize import SDVSynthesizer
raw_data = pd.read_csv("raw_data.csv") # Please replace with your own data path
sdv_synth = SDVSynthesizer(data=raw_data)
synthetic_data = sdv_synth.generate()
SDQCPy
use the process shown below to perform the quality check and analysis:
---
title Main Idea
---
flowchart TB
%% Define the style
classDef default stroke:#000,fill:none
%% Define the nodes
initial([Input Real Data and Synthetic Data])
step1[Statistical Test]
step2[Classification]
step3[Explainability]
step4[Causal Analysis]
endprocess[Export HTML file]
%% Define the relationships between nodes
initial --> step1
step1 --> step2
step2 --> step3
step3 --> step4
step4 --> endprocess
SDQCPy
employs various methods for descriptive analysis, distribution comparison, and correlation testing tailored to different data types.SDQCPy
employs machine learning models(SVC
, RandomForestClassifier
, XGBClassifier
, LGBMClassifier
) to evaluate the similarity between the real and synthetic data.SDQCPy
employs several of the current mainstream explainability methods(Model-Based
,SHAP
, PFI
) to evaluate the explainability of the synthetic data.SDQCPy
employs several causal structure learning methods and evaluation metrics to compare the adjacency matrix of the raw and synthetic data. The implementation of these methods are using gCastle
.SequentialAnalysis
)
To streamline the process of calling individual modules one by one, we have integrated all the functions. If you have specific needs, you can also use these functions along your lines.Need help? Report a bug? Ideas for collaborations? Reach out via GitHub Issues
[!IMPORTANT]
Before reporting an issue on
GitHub
, please check the existing Issues to avoid duplicates.If you wish to contribute to this library, please first open an Issue to discuss your proposed changes. Once discussed, you are welcome to submit a Pull Request.
FAQs
SDQCPy is a comprehensive Python package designed for synthetic data management, quality control, and validation.
We found that sdqcpy demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.