Security News
PyPI’s New Archival Feature Closes a Major Security Gap
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
The SDG Framework is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a “no-code” manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.
The framework is built around the following principles:
At the heart of the framework is the Block. Each block is a self-contained computational unit that performs specific tasks, such as:
Blocks are designed to be:
These blocks are implemented in the src/instructlab/sdg/blocks directory.
Blocks can be chained together to form a Pipeline. Pipelines enable:
There are three default pipelines shipped in SDG: simple
, full
, and eval
. Each pipeline requires specific hardware specifications
The simple pipeline is designed to be used with quantized Merlinite as the teacher model. It enables basic data generation results on low-end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs.
The full pipeline is designed to be used with Mixtral-8x7B-Instruct-v0.1 as the the teacher model, but has also been successfully tested with smaller models such as Mistral-7B-Instruct-v0.2 and even some quantized versions of the two teacher models. This is the preferred data generation pipeline on higher end consumer grade hardware and all enterprise hardware.
The eval pipeline is used to generate MMLU benchmark data that can be used to later evaluate a trained model on your knowledge dataset. It does not generate data for use during model training.
The Pipeline YAML configuration file is central to defining data generation workflows in the SDG Framework. This configuration file describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.
Pipeline configuration must adhere to our JSON schema to be considered valid.
Modular Design:
Reusability:
Ease of Configuration:
Here is an example of a Pipeline configuration:
version: "1.0"
blocks:
- name: gen_questions
type: LLMBlock
config:
config_path: configs/skills/freeform_questions.yaml
output_cols:
- question
batch_kwargs:
num_samples: 30
drop_duplicates:
- question
- name: filter_questions
type: FilterByValueBlock
config:
filter_column: score
filter_value: 1.0
operation: eq
convert_dtype: float
drop_columns:
- evaluation
- score
- num_samples
- name: gen_responses
type: LLMBlock
block_config:
config_path: configs/skills/freeform_responses.yaml
output_cols:
- response
Data Representation: Data flow between blocks and pipelines is handled using Hugging Face Datasets, which are based on Arrow tables. This provides:
Data Checkpoints: Intermediate caches of generated data. Checkpoints allow users to:
Clone the library and navigate to the repo:
git clone https://github.com/instructlab/sdg
cd sdg
Install the library:
pip install .
You can import SDG into your Python files with the following items:
from instructlab.sdg.generate_data import generate_data
from instructlab.sdg.utils import GenerateException
|-- src/instructlab/ (1)
|-- docs/ (2)
|-- scripts/ (3)
|-- tests/ (4)
FAQs
Synthetic Data Generation
We found that instructlab-sdg demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
Research
Security News
Malicious npm package postcss-optimizer delivers BeaverTail malware, targeting developer systems; similarities to past campaigns suggest a North Korean connection.
Security News
CISA's KEV data is now on GitHub, offering easier access, API integration, commit history tracking, and automated updates for security teams and researchers.