Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

neek

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

neek

A simple implementation of *nix uniq in NodeJS to take advantage of streaming

0.1.0
Source
npm

Version published: 10 years ago

Weekly downloads: 35; decreased by-61.11%

Maintainers: 1

Weekly downloads

Created: 10 years ago

Source

Neek

A simple way to filter duplicate lines from a list, à la uniq. Takes an input and filters to an output removing duplicates.

Compatibility

This module is built on each commit with TravisCI on Node 0.8.x, 0.10.x and 0.11.x. It will not work on Node 0.6.x unfortunately. In order to maintain support throughout these versions, the Hashes library is used. There are more efficient alternatives (perhaps a gaining a second per 100,000 records), however they have native components and are unstable on 0.11.x at the moment. At some point in future, I'll revisit this and implement a better HashSet - perhaps when 0.12.x is live.

Setup

Depending on your use case, there are two different ways you can install Neek. The first is as a global module, mostly for use when scripting in a shell.

$ sudo npm install -g neek

You can also install it as a local module in case you wish to use it inside another tool:

$ npm install neek

Usage

As mentioned, there are two ways to use Neek. The first use, and probably the most common, is simply invoking via a shell, or using inside a shell to remove duplicate lines:

$ neek --input dup_file.txt > output.txt

$ cat dup_file.txt | neek > output.txt

The shell version takes these parameters:

-a, --algorithm     the cipher algorithm to use (default to SHA1)
-i, --input         an input file to process
-o, --output        a file to output to
-q, --quiet         only output the processed data

The other use is from within a Node module which requires some processing to output text without duplicates, although I expect this will be less common. Below is an example inside Node:

var Neek = require('neek);

new Neek()
    .setInput(fs.createReadStream('./test/resources/lines_with_dups.txt'))
    .setOutput(fs.createWriteStream('./test/resources/output_without_dups.txt'))
    .unique('md5', function(result){

    });

You can use setInput() and setOutput() to define your streams. You then call unique() to actually remove the duplicate data. setOutput() can take a parameter "string", which will pass the output to the callback as described below. unique() can take an optional algorithm param (defaulting to SHA1), and a callback function which is passed a result object.

This object contains three fields; output, size and count. These fields translate to the following:

output  - output of the process, if you chose a string output - otherwise null
total   - the number of lines processed
unique  - the final amount of lines (without duplicate data)

Comparison

On a test set of a 293MB file containing 576,905 total lines with 322,392 unique lines, below is a comparison of the performance of Unix tool uniq and neek. This is assuming that your data is sorted.

Uniq

$ time uniq test-set.txt

# output

real	0m33.951s
user	0m27.086s
sys     0m2.161s

Neek

$ time bin/neek --input test-set.txt

# output

real	0m16.354s
user	0m13.733s
sys     0m2.217s

In the unfortunately case that your data isn't sorted, you would have to use sort, however Neek behaves the same regardless of order.

Sort

$ time sort -u test-set.txt

# output

real	1m39.203s
user	1m32.484s
sys     0m1.518s

As you can see, Neek is roughly 45% faster to run than Uniq and almost 85% faster to run than Sort, meaning it's invaluable for larger files.

Redirection

One important thing to note here is that a shell redirection is slightly faster than using the --output flag. In the processing of the above file, the --output flag took an extra 9 seconds due to the overheads inside Node.

Where possible, I would recommend simply using a shell redirection. If you do use a redirection, make sure to pass -q. Here is a comparison:

$ time bin/neek --input test-set.txt -q > output.txt

real	0m16.354s
user	0m13.733s
sys     0m2.217s

$ time bin/neek --input test-set.txt --output output.txt

Processing complete: 576905 -> 322392

real	0m25.744s
user	0m14.974s
sys     0m6.657s

Keywords

FAQs

What is neek?

Is neek popular?

Is neek well maintained?

Package last updated on 02 Jul 2014

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

neek

Compatibility

Setup

Usage

Comparison

Redirection

Keywords

Related posts

Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm

Malicious npm Package Typosquats Popular TypeScript ESLint Plugin, Exfiltrates Data and Enables Remote Exploitation