I was working on a BI system at work and as our user base kept growing, our jobs got slower and slower. A huge bottleneck for us was csv parsing. We tried many of the popular csv libraries on npm, but were unable to get a decent level of performance consistently from our many jobs. We are parsing around a billion rows per day, with the average row being around 1.5kb and 80+ columns (it's not uncommon for a single job to parse over 10gb of data). Some columns have complex json strings, some are empty, some are in languages like Chinese, and some use \r\n for new lines. We needed something fast, chunk boundary safe, and character encoding safe. Something that worked well on all of our various csv formats. Some of the other parsers failed to accurately parse \r\n newlines across chunk boundaries, some failed altogether, some caused out of memory errors, some were just crazy slow, and some had encoding errors with multi-byte characters.

This library is a result of those learnings. It's fast, stable, works on any data, and is highly configurable. Feel free to check out and run the benchmarks. I'm currently able to achieve around 100mb per second parsing (on a i7 9700k), depending on data and cpu.

Usage

Add csv as a dependency for your app and install via npm

npm install @danmasta/csv --save

Require the package in your app

const csv = require('@danmasta/csv');

By default csv returns a transform stream interface. You can use the methods: promise(), map(), each(), and parse() to consume row objects via promises or as a synchronous array.

Options

name	type	description
`headers`	`boolean\|object\|array\|function`	If truthy, reads the first line as header fields. If false, disables header fields and replaces with integer values of 0-n. If object, the original header name field is replaced with it's value in the mapping. If function, the header field is set to the functions return value. Default is `true`
`values`	`boolean\|object\|function`	Same as the header field, but for values. If you want to replace values on the fly, provide an object: `{'null': null, 'True', true}`. If function, the value is replaced with the functions return value. Default is `null`
`newline`	`string`	Which character to use for newlines. Default is `\n`
`cb`	`function`	Function to call when ready to flush a complete row. Used only for the `Parser` class. If you implement a custom parser you will need to include a cb function. Default is `this.push`
`buffer`	`boolean`	If `true`, uses a string decoder to parse buffers. This is set automatically with convenience methods, but will need to be set on custom parser instances. Default is `false`
`encoding`	`string`	Which encoding to use when parsing rows in buffer mode. This doesn't matter if using strings or streams not in buffer mode. Default is `utf8`
`stream`	`object`	Options to be passed to transform stream constructor. Default is `{ objectMode: true, encoding: null }`

Methods

Name	Description
`Parser(opts, str)`	Parser class for generating custom csv parser instances. Accepts an options object and optional string or buffer to parse
`Parser.parse(str, opts)`	Synchronous parse function. Accepts a string or buffer and optional options object. Returns an array of parsed rows
`promise()`	Returns a promise that resolves with an array of parsed rows
`map(fn)`	Runs an iterator function over each row. Returns a promise that resolves with the new `array`
`each(fn)`	Runs an iterator function over each row. Returns a promise that resolves with `undefined`
`parse(str)`	Parses a string or buffer and pushes rows to stream. Need to call `flush()` when finished. Returns the csv parser instance
`flush()`	Flushes remaining data and pushes final rows to stream. Returns the csv parser instance

Examples

Use CRLF line endings

csv.parse(str, { newline: '\r\n' });

Parse input from a stream and write rows to a destination stream

let read = fs.createReadStream('./data.csv');

read.pipe(csv()).pipe(myDestinationStream());

Parse a string and get results in a promise

csv().parse(str).promise().then(res => {
    console.log('Rows:', res);
});

Run an iteration function over each row using each()

csv().parse(str).each(row => {
    console.log('Row:', row);
}).then(() => {
    console.log('Parse complete!');
});

Update header values

let headers = {
    time: 'timestamp',
    latitude: 'lat',
    longitude: 'lng'
};

csv.parse(str, { headers });

Update headers and/or values with functions

function headers (str) {
    return str.toLowerCase();
}

function values (val) {
    if (val === 'false') return false;
    if (val === 'true') return true;
    if (val === 'null' || val === 'undefined') return null;
    return val.toLowerCase();
}

csv.parse(str, { headers, values });

Create a custom parser that pushes to a queue

const Queue = require('queue');
const q = new Queue();

const parser = csv({ cb: q.push.bind(q) });

parser.parse(str);
parser.flush();

Testing

Testing is currently run using mocha and chai. To execute tests just run npm run test. To generate unit test coverage reports just run npm run coverage

Benchmarks

Benchmarks are currently built using gulp. Just run gulp bench to test timings and bytes per second

Filename         Mode    Density  Rows    Bytes       Time (ms)  Rows/Sec      Bytes/Sec
---------------  ------  -------  ------  ----------  ---------  ------------  --------------
Earthquakes      String  0.11     72,689  11,392,064  121.66     597,492.68    93,641,058.06
Earthquakes      Buffer  0.11     72,689  11,392,064  104.8      693,582.44    108,700,567.12
Pop-By-Zip-LA    String  0.18     3,199   120,912     1.88       1,702,582.7   64,352,197.35
Pop-By-Zip-LA    Buffer  0.18     3,199   120,912     1.28       2,490,424.44  94,130,103.07
Stats-By-Zip-NY  String  0.41     2,369   263,168     15.68      151,058.46    16,780,816.02
Stats-By-Zip-NY  Buffer  0.41     2,369   263,168     14.52      163,183.15    18,127,726.59

Speed and throughput are highly dependent on the density (token matches/byte length) of data as well as the size of the objects created. The earthquakes file test data represents a roughly average level of density for common csv datas. The stats by zip code file represents an example of very dense csv data

Contact

If you have any questions feel free to get in touch

Keywords

FAQs

What is @danmasta/csv?

Is @danmasta/csv popular?

Is @danmasta/csv well maintained?

Last updated on 08 Aug 2020

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@danmasta/csv

CSV

About