New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

mongo-bigquery-sync

Package Overview
Dependencies
Maintainers
1
Versions
3
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

mongo-bigquery-sync

Basic setup to sync mongo data to GBQ

latest
npmnpm
Version
1.0.2
Version published
Maintainers
1
Created
Source

MongoDB to BigQuery Streamer

This package allows you to stream data from a MongoDB collection to a Google BigQuery table in batches. It is designed to handle large datasets by processing them in chunks, transforming the data as needed before inserting it into BigQuery, and providing progress logs during the operation.

Features

  • Batch processing: Efficiently processes large datasets by streaming in batches of configurable size.
  • Data transformation: Supports custom transformation functions to modify documents before insertion.
  • Filtered queries: Allows MongoDB query filters to only sync specific records.
  • Sorting: Supports sorting based on MongoDB fields.
  • Progress tracking: Logs progress and returns the last inserted record for monitoring or resumption.

Requirements

Before using this package, ensure you have:

  • A MongoDB instance with a collection of documents.
  • A Google BigQuery project, dataset, and table where the data will be inserted.
  • A Google Cloud service account with permission to write data to BigQuery.

Installation

To install the package, use npm or yarn:

npm install mongodb-to-bigquery-streamer

or

yarn add mongodb-to-bigquery-streamer

Usage

Here's an example of how to use the package to stream data from MongoDB to BigQuery:

const { streamDataInBatchesFromMongoDB } = require('mongodb-to-bigquery-streamer');

const options = {
  mongodbUri: 'mongodb://localhost:27017',    // MongoDB URI
  mongoDbName: 'my_database',                 // MongoDB Database Name
  mongoCollectionName: 'my_collection',       // MongoDB Collection Name
  bigQueryDataset: 'my_dataset',              // BigQuery Dataset Name
  bigQueryTable: 'my_table',                  // BigQuery Table Name
  transformFn: (doc) => doc,                  // Optional transformation function (default: identity function)
  chunkSize: 100,                             // Optional batch size (default: 10)
  mongoQueryParams: { active: true },         // Optional query parameters for MongoDB (default: {})
  sortField: '_id',                           // Optional sorting field for MongoDB (default: '_id')
  gbqProjectId: 'my-gbq-project',             // Google BigQuery Project ID
  gbqKeyFile: './path/to/keyfile.json'        // Path to BigQuery service account key
};

streamDataInBatchesFromMongoDB(options)
  .then((result) => {
    if (result.success) {
      console.log('Data successfully streamed to BigQuery.');
      console.log('Last Inserted Record:', result.lastInsertedRecord);
    } else {
      console.error('Data streaming failed.');
    }
  })
  .catch((error) => {
    console.error('Error:', error);
  });

Parameters

ParameterTypeDescriptionDefault Value
mongodbUriStringMongoDB connection URI.
mongoDbNameStringName of the MongoDB database.
mongoCollectionNameStringName of the MongoDB collection.
bigQueryDatasetStringBigQuery dataset name.
bigQueryTableStringBigQuery table name.
transformFnFunction(Optional) Function to transform documents before inserting into BigQuery.(doc) => doc
chunkSizeNumber(Optional) The number of documents to process in each batch.10
mongoQueryParamsObject(Optional) Query parameters to filter documents from MongoDB.{}
sortFieldString(Optional) Field to sort the documents by (for paging through the data)._id
gbqProjectIdStringGoogle BigQuery project ID.
gbqKeyFileStringPath to the Google BigQuery service account key file.

Example Transformation Function

You can define a custom transformation function to modify the MongoDB documents before they are inserted into BigQuery. This is useful for changing field names, formatting, or adding new fields. For example:

const transformFn = (doc) => {
  return {
    id: doc._id.toString(),
    name: doc.name.toUpperCase(),
    createdAt: doc.created_at,
  };
};

In this example, the function:

  • Converts the MongoDB _id to a string.
  • Uppercases the name field.
  • Retains the created_at field unchanged.

Error Handling

The streamDataInBatchesFromMongoDB function returns an object indicating the result of the sync process:

{
  "success": true,
  "lastInsertedRecord": { ... }
}

If the streaming process succeeds, success will be true, and lastInsertedRecord will contain the last document that was inserted into BigQuery.

If an error occurs, success will be false, and you can inspect the logs to understand what went wrong.

Handling Large Data Sets

To efficiently stream large datasets, the process is divided into batches. You can adjust the chunkSize parameter to control how many documents are processed in each batch. The default batch size is 10, but you can set it to a higher number for faster processing if needed.

chunkSize: 500,  // Process 500 documents in each batch

BigQuery Configuration

Before using the package, make sure your Google Cloud service account has sufficient permissions to write to the specified BigQuery dataset and table.

  • Set up a service account in Google Cloud with the necessary roles (e.g., BigQuery Data Editor).
  • Download the JSON key file for the service account.
  • Use the gbqKeyFile option in the configuration to provide the path to the service account key.
gbqKeyFile: './path/to/keyfile.json'

License

This package is licensed under the MIT License.

FAQs

Package last updated on 09 Oct 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts