
Security News
Risky Biz Podcast: Making Reachability Analysis Work in Real-World Codebases
This episode explores the hard problem of reachability analysis, from static analysis limits to handling dynamic languages and massive dependency trees.
A Node.js library for streaming HuggingFace datasets with support for Parquet, CSV, and JSONL formats.
npm install hf-dataset
import { HFDataset } from 'hf-dataset';
// Load a dataset and iterate through it
const dataset = await HFDataset.create('Salesforce/wikitext');
for await (const row of dataset) {
console.log(row.text);
break; // Just show the first row
}
.gz
compressed filesHFDataset.create(dataset, options?)
Creates a new dataset instance.
Parameters:
dataset
(string): HuggingFace dataset identifier (e.g., 'Salesforce/wikitext'
)options
(object, optional):
token
(string): HuggingFace token for private datasets (defaults to process.env.HF_TOKEN
)revision
(string): Git revision or tag (defaults to 'main'
)Returns: Promise<HFDataset>
// Public dataset
const dataset = await HFDataset.create('Salesforce/wikitext');
// Private dataset with token
const dataset = await HFDataset.create('my-org/private-dataset', {
token: 'hf_xxxxxxxxxxxxx'
});
// Specific revision
const dataset = await HFDataset.create('Salesforce/wikitext', {
revision: 'v1.0'
});
The dataset implements AsyncIterable
, so you can use for await
loops:
const dataset = await HFDataset.create('Salesforce/wikitext');
// Process all rows
for await (const row of dataset) {
console.log(row);
}
// Process first N rows
let count = 0;
for await (const row of dataset) {
console.log(row);
if (++count >= 100) break;
}
listFiles()
Returns information about discovered files in the dataset.
const dataset = await HFDataset.create('Salesforce/wikitext');
const files = dataset.listFiles();
console.log(files);
// [
// { path: 'train.parquet', type: 'parquet', gz: false },
// { path: 'test.csv.gz', type: 'csv', gz: true }
// ]
For private or gated datasets, provide your HuggingFace token:
export HF_TOKEN=hf_xxxxxxxxxxxxx
const dataset = await HFDataset.create('my-org/private-dataset', {
token: 'hf_xxxxxxxxxxxxx'
});
Parquet Files:
const dataset = await HFDataset.create('Salesforce/wikitext');
for await (const row of dataset) {
console.log(row.text); // Parquet preserves column types
}
CSV Files:
const dataset = await HFDataset.create('lvwerra/red-wine');
for await (const row of dataset) {
console.log(row); // CSV columns as string values
}
JSONL Files:
const dataset = await HFDataset.create('BeIR/scifact');
for await (const row of dataset) {
console.log(row._id, row.title); // JSON structure preserved
}
interface WikiTextRow {
text: string;
}
const dataset = await HFDataset.create<WikiTextRow>('Salesforce/wikitext');
for await (const row of dataset) {
console.log(row.text); // TypeScript knows this is a string
}
const dataset = await HFDataset.create('large-dataset');
let processedCount = 0;
const batchSize = 1000;
const batch = [];
for await (const row of dataset) {
batch.push(row);
if (batch.length === batchSize) {
await processBatch(batch);
batch.length = 0; // Clear batch
processedCount += batchSize;
console.log(`Processed ${processedCount} rows`);
}
}
// Process remaining rows
if (batch.length > 0) {
await processBatch(batch);
}
MIT - see LICENSE file for details.
FAQs
use HuggingFace datasets from Node.js
The npm package hf-dataset receives a total of 7 weekly downloads. As such, hf-dataset popularity was classified as not popular.
We found that hf-dataset demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
This episode explores the hard problem of reachability analysis, from static analysis limits to handling dynamic languages and massive dependency trees.
Security News
/Research
Malicious Nx npm versions stole secrets and wallet info using AI CLI tools; Socket’s AI scanner detected the supply chain attack and flagged the malware.
Security News
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.