hyparquet
JavaScript parser for Apache Parquet files.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Dependency free since 2023!
Features
- Designed to work with huge ML datasets (things like starcoder)
- Can load metadata separately from data
- Data can be filtered by row and column ranges
- Only fetches the data needed
- Written in JavaScript, checked with TypeScript
- Fast data loading for large scale ML applications
- Bring data visualization closer to the user, in the browser
Why make a new parquet parser in javascript?
First, existing libraries like parquetjs are officially "inactive".
Importantly, they do not support the kind of stream processing needed to make a really performant parser in the browser.
And finally, no dependencies means that hyparquet is lean, and easy to package and deploy.
Demo
Online parquet file reader demo available at:
https://hyparam.github.io/hyparquet/
Demo source: index.html
Installation
npm install hyparquet
Usage
If you're in a node.js environment, you can load a parquet file with the following example:
const { parquetMetadata } = await import('hyparquet')
const fs = await import('fs')
const buffer = fs.readFileSync('example.parquet')
const arrayBuffer = new Uint8Array(buffer).buffer
const metadata = parquetMetadata(arrayBuffer)
If you're in a browser environment, you'll probably get parquet file data from either a drag-and-dropped file from the user, or downloaded from the web.
To load parquet data in the browser from a remote server using fetch
:
import { parquetMetadata } from 'hyparquet'
const res = await fetch(url)
const arrayBuffer = await res.arrayBuffer()
const metadata = parquetMetadata(arrayBuffer)
To parse parquet files from a user drag-and-drop action, see example in index.html.
Async
Hyparquet supports asynchronous fetching of parquet files, over a network.
You can provide an AsyncBuffer
which is like a js ArrayBuffer
but the slice
method returns Promise<ArrayBuffer>
.
Supported Parquet Files
The parquet format supports a number of different compression and encoding types.
Hyparquet does not support 100% of all parquet files, and probably never will, since supporting all possible compression types will increase the size of the library, and are rarely used in practice.
Compression:
Page Type:
Contributions are welcome!
References