mongodb-collection-sample
Sample documents from a MongoDB collection.
Install
npm install --save mongodb-collection-sample
Example
npm install mongodb lodash mongodb-collection-sample
const sample = require('mongodb-collection-sample');
const { MongoClient } = require('mongodb');
const _ = require('lodash');
const client = new MongoClient();
async function main() {
await client.connect('mongodb://localhost:27017');
const docs = _range(0, 1000).map(function(i) {
return {
_id: 'needle_' + i,
is_even: i % 2
};
});
await db.collection('haystack').insert(docs);
const options = {};
options.size = 5;
options.query = {};
const stream = sample(db, 'haystack', options);
stream.on('error', function(err){
console.error('Error in sample stream', err);
return process.exit(1);
});
stream.on('data', function(doc){
console.log('Got sampled document `%j`', doc);
});
stream.on('end', function(){
console.log('Sampling complete! Goodbye!');
db.close();
process.exit(0);
});
}
main();
Options
Supported options that can be passed to sample(db, coll, options)
are
query
: the filter to be used, default is {}
size
: the number of documents to sample, default is 5
fields
: the fields you want returned (projection object), default is null
raw
: boolean to return documents as raw BSON buffers, default is false
sort
: the sort field and direction, default is {_id: -1}
maxTimeMS
: the maxTimeMS value after which the operation is terminated, default is undefined
promoteValues
: boolean whether certain BSON values should be cast to native Javascript values or not. Default is true
How It Works
Native Sampler
MongoDB version 3.1.6 and above generally uses the $sample
aggregation operator:
db.collectionName.aggregate([
{$match: <query>},
{$sample: {size: <size>}},
{$project: <fields>},
{$sort: <sort>}
])
However, if more documents are requested than are available, the $sample
stage
is omitted for performance optimization. If the sample size is above 5% of the
result set count (but less than 100%), the algorithm falls back to the reservoir
sampling, to avoid a blocking sort stage on the server.
Reservoir Sampling
For MongoDB version 3.1.5 and below we use a client-size reservoir sampling algorithm.
- Query for a stream of _id values, limit 10,000.
- Read stream of
_id
s and save sampleSize
randomly chosen values. - Then query selected random documents by _id.
The two modes, illustrated:
Performance Notes
For peak performance of the client-side reservoir sampler, keep the following guidelines in mind.
- The initial query for a stream of
_id
values must be limited to some finite value. (Default 10k) - This query should be covered by an index
- Since there's a limit, you may wish to bias for recent documents via a sort. (Default: {_id: -1})
- Don't sort on {$natural: -1}: this forces a collection scan!
Queries that include a sort by $natural order do not use indexes to fulfill the query predicate
- When retrieving docs: batch using one $in to reduce network chattiness.
License
Apache 2