Socket
Book a DemoInstallSign in
Socket

hf-dataset

Package Overview
Dependencies
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

hf-dataset

use HuggingFace datasets from Node.js

0.1.0
latest
Source
npmnpm
Version published
Weekly downloads
7
Maintainers
1
Weekly downloads
 
Created
Source

hf-dataset

A Node.js library for streaming HuggingFace datasets with support for Parquet, CSV, and JSONL formats.

Installation

npm install hf-dataset

Quick Start

import { HFDataset } from 'hf-dataset';

// Load a dataset and iterate through it
const dataset = await HFDataset.create('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text);
  break; // Just show the first row
}

Features

  • Multiple Formats: Supports Parquet, CSV, and JSONL files
  • Gzipped Files: Automatically handles .gz compressed files
  • Streaming: Memory-efficient iteration over large datasets
  • TypeScript: Full TypeScript support with generics
  • Authentication: Support for private/gated datasets with HF tokens

API Reference

HFDataset.create(dataset, options?)

Creates a new dataset instance.

Parameters:

  • dataset (string): HuggingFace dataset identifier (e.g., 'Salesforce/wikitext')
  • options (object, optional):
    • token (string): HuggingFace token for private datasets (defaults to process.env.HF_TOKEN)
    • revision (string): Git revision or tag (defaults to 'main')

Returns: Promise<HFDataset>

// Public dataset
const dataset = await HFDataset.create('Salesforce/wikitext');

// Private dataset with token
const dataset = await HFDataset.create('my-org/private-dataset', {
  token: 'hf_xxxxxxxxxxxxx'
});

// Specific revision
const dataset = await HFDataset.create('Salesforce/wikitext', {
  revision: 'v1.0'
});

Iteration

The dataset implements AsyncIterable, so you can use for await loops:

const dataset = await HFDataset.create('Salesforce/wikitext');

// Process all rows
for await (const row of dataset) {
  console.log(row);
}

// Process first N rows
let count = 0;
for await (const row of dataset) {
  console.log(row);
  if (++count >= 100) break;
}

listFiles()

Returns information about discovered files in the dataset.

const dataset = await HFDataset.create('Salesforce/wikitext');
const files = dataset.listFiles();

console.log(files);
// [
//   { path: 'train.parquet', type: 'parquet', gz: false },
//   { path: 'test.csv.gz', type: 'csv', gz: true }
// ]

Authentication

For private or gated datasets, provide your HuggingFace token:

export HF_TOKEN=hf_xxxxxxxxxxxxx

Explicit Token

const dataset = await HFDataset.create('my-org/private-dataset', {
  token: 'hf_xxxxxxxxxxxxx'
});

Examples

Working with Different File Formats

Parquet Files:

const dataset = await HFDataset.create('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text); // Parquet preserves column types
}

CSV Files:

const dataset = await HFDataset.create('lvwerra/red-wine');

for await (const row of dataset) {
  console.log(row); // CSV columns as string values
}

JSONL Files:

const dataset = await HFDataset.create('BeIR/scifact');

for await (const row of dataset) {
  console.log(row._id, row.title); // JSON structure preserved
}

TypeScript Usage

interface WikiTextRow {
  text: string;
}

const dataset = await HFDataset.create<WikiTextRow>('Salesforce/wikitext');

for await (const row of dataset) {
  console.log(row.text); // TypeScript knows this is a string
}

Processing Large Datasets

const dataset = await HFDataset.create('large-dataset');

let processedCount = 0;
const batchSize = 1000;
const batch = [];

for await (const row of dataset) {
  batch.push(row);
  
  if (batch.length === batchSize) {
    await processBatch(batch);
    batch.length = 0; // Clear batch
    processedCount += batchSize;
    console.log(`Processed ${processedCount} rows`);
  }
}

// Process remaining rows
if (batch.length > 0) {
  await processBatch(batch);
}

Requirements

  • Node.js >= 24.3.0

License

MIT - see LICENSE file for details.

Keywords

huggingface

FAQs

Package last updated on 26 Aug 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.