Socket
Book a DemoInstallSign in
Socket

megatron-energon

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

megatron-energon

Megatron's multi-modal data loader

7.2.1
pipPyPI
Maintainers
1

Megatron's multi-modal data loader

Megatron Energon

Tests Documentation
Report Bug · Request Feature


DISCLAIMER: This package contains research code. APIs may change.

What is this?

Megatron Energon is the multi-modal data loader of Megatron (you can also use it independently).

It's best at

  • loading large training data to train large multi-modal models
  • blending many different datasets together
  • distributing the work across many nodes and processes of a cluster
  • ensuring reproducibility and resumability
  • adapting easily to various types of data samples and processing

Try using it together with Megatron Core.

Quickstart

Megatron Energon is a pip-installable python package that offers

  • dataset-related classes that you can import in your project
  • a command line utility for data preprocessing and conversion

This document is just a quick start. Please also check out the documentation.

Installation

To install the latest stable version:

pip install megatron-energon

Or to install the current development version:

pip install git+https://github.com/NVIDIA/Megatron-Energon.git

NOTE: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.

For more details on installing this package, see here.

Usage of command line tool

After installation, the command energon will be available.

Here are some examples for things you can do:

CommandDescription
energon prepare DATASET_ROOTTake an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset
energon lint DATASET_ROOTVerify that the dataset complies with the energon dataset format and that all samples are loadable

Usage of the library

To get started, pick a WebDataset-compliant dataset and run energon prepare DATASET_ROOT on it, to run the interactive assistant and create the .nv-meta folder. As an alternative to WebDataset, Energon also supports the JSONL format, see here.

Once done, try to load it from your Python program:

from megatron.energon import get_train_dataset, get_loader, WorkerConfig


simple_worker_config = WorkerConfig(rank=0, world_size=1, num_workers=2)


train_ds = get_train_dataset(
    '/my/dataset/path',
    batch_size=2,
    shuffle_buffer_size=None,
    max_samples_per_sequence=None,
    worker_config=simple_worker_config,
)

train_loader = get_loader(train_ds)

for batch in train_loader:
    # Do something with batch
    # Infer, gradient step, ...
    pass

For more details, read the documentation.

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.