
Security News
AI Agent Lands PRs in Major OSS Projects, Targets Maintainers via Cold Outreach
An AI agent is merging PRs into major OSS projects and cold-emailing maintainers to drum up more work.
data-designer
Advanced tools
Generate high-quality synthetic datasets from scratch or using your own seed data.
Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.
pip install data-designer
Or install from source:
git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install
Start with one of our default model providers:
Grab your API key(s) using the above links and set one or more of the following environment variables:
export NVIDIA_API_KEY="your-api-key-here"
export OPENAI_API_KEY="your-openai-api-key-here"
export OPENROUTER_API_KEY="your-openrouter-api-key-here"
import data_designer.config as dd
from data_designer.interface import DataDesigner
# Initialize with default settings
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
# Add a product category
config_builder.add_column(
dd.SamplerColumnConfig(
name="product_category",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(
values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
),
)
)
# Generate personalized customer reviews
config_builder.add_column(
dd.LLMTextColumnConfig(
name="review",
model_alias="nvidia-text",
prompt="Write a brief product review for a {{ product_category }} item you recently purchased.",
)
)
# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()
data-designer config providers # Configure model providers
data-designer config models # Set up your model configurations
data-designer config list # View current settings
Data Designer collects telemetry to help us improve the library for developers. We collect:
No user or device information is collected. This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.
Specifically, a model name that is defined a ModelConfig object, is what will be collected. In the below example config:
ModelConfig(
alias="nv-reasoning",
model="openai/gpt-oss-20b",
provider="nvidia",
inference_parameters=ChatCompletionInferenceParams(
temperature=0.3,
top_p=0.9,
max_tokens=4096,
),
)
The value openai/gpt-oss-20b would be collected.
To disable telemetry capture, set NEMO_TELEMETRY_ENABLED=false.
This chart represents the breakdown of models used for Data Designer across all synthetic data generation jobs from 1/5/2026 to 2/5/2026.

Last updated on 2/05/2026
Apache License 2.0 – see LICENSE for details.
If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:
@misc{nemo-data-designer,
author = {The NeMo Data Designer Team, NVIDIA},
title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
year = {2025},
note = {GitHub Repository},
}
FAQs
General framework for synthetic data generation
We found that data-designer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
An AI agent is merging PRs into major OSS projects and cold-emailing maintainers to drum up more work.

Research
/Security News
Chrome extension CL Suite by @CLMasters neutralizes 2FA for Facebook and Meta Business accounts while exfiltrating Business Manager contact and analytics data.

Security News
After Matplotlib rejected an AI-written PR, the agent fired back with a blog post, igniting debate over AI contributions and maintainer burden.