Security News
GitHub Removes Malicious Pull Requests Targeting Open Source Repositories
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
dwh-mixpanel
Advanced tools
Stream queries from your data warehouse to events, profiles, groups, or lookup tables in Mixpanel... rETL style 💫.
No intermediate staging/storage required.
Supported Data Warehouses:
run the module, and provide a configuration file as the first argument:
npx dwh-mixpanel ./myConfig.json
for help building a configuration file, run the module with no arguments:
npx dwh-mixpanel ./myConfig.json
what next?
todo
as stated in the tldr, if you run dwh-mixpanel
with no arguments you get a CLI which helps you build a configuration file:
npx dwh-mixpanel
it looks like this:
at the end of this walkthrough, a JSON file will be saved to your current working directory; it will contain the details of what you typed into the CLI. the CLI will then ask you if you'd like to trigger a run.
once you have a configuration file, you can run (and re-run) that job by passing the configuration file as the first argument to this module:
npx dwh-mixpanel snowflake-mixpanel.json
as it runs, you'll get some console output as to the status of your job:
once the pipeline is complete, it will stash logs in the current working directory, and you can see your data in mixpanel!
note: if you will use this module frequently, consider a global install:
npm install --global dwh-mixpanel
and then you don't need the npx
:
dwh-mixpanel ./myConfig.json
dwh-mixpanel
can also be used as a ESM module inside any node.js environment. this is useful for automation and scheduled syncs.
install it from npm
:
npm install dwh-mixpanel
and then you use it as you would any other dependency:
import dwhMp from "dwh-mixpanel";
dwh-mixpanel
exports a single function, which takes in a single parameter - a configuration object.
this is the entry-point for the whole module:
const myConfig = {
dwh: "bigquery",
sql: "SELECT * FROM EVENTS",
//etc...
};
const bqToMpSummary = await dwhMp(myConfig);
the module returns a summary
of the unload/load job, with statistics and logs about how many records were processed, throughput, and metadata from the warehouse.
{
mixpanel: {
success: 100000,
failed: 0,
total: 100000,
requests: 50,
recordType: "event",
duration: 15314,
human: "15.314 seconds",
retries: 0,
version: "2.2.287",
workers: 10,
eps: 6529,
rps: 3.265,
responses: [],
errors: [],
},
bigquery: {
job: {
//job infos
},
metadata: {
//query metadata
},
schema: {
//schema
}
}
}
your configuration is an object (or JSON) with the following structure:
{
dwh: "", // warehouse name
auth: {}, // warehouse auth details
sql: "", // a SQL query
mappings: {}, // col headers → mixpanel fields
mixpanel: {}, // mixpanel auth
options: {}, // job options
tags: {} // arbitrary tags
}
you can find examples in the repo for different warehouses. additionally, the module is typed using jsdoc, so you should have a good experience using it in your IDE:
here's a description of each of those keys (and values) mean
a string representing the data warehouse you're connecting too.
bigquery
, athena
, snowflake
an object {}
containing the service account/credentials to use when authenticating with the data warehouse.
each cloud warehouse has its own method(s) of authenticating API calls, which usually consist of a username or public key and a password, secret, or private key.
to read more about the ways you can authenticate with a supported data warehouse, see warehouse details
a valid SQL Query to run (as a job) in your data warehouse; this query will usually be in the form of a SELECT {fields}
or SELECT *
statement:
SELECT
eventName,
user_id,
timestamp,
prop_a,
prop_b,
prop_c,
rowId as insert_id
FROM
"myProject.myDB.myTable"
WHERE
env is "prod"
your SQL query should produce a flat, non-nested table that has the fields and records you wish to stream to mixpanel. your column headers can have any title; you will provide a mappings dictionary (detailed below) to describe how mixpanel should receive the fields.
note: property field labels can be retitled post-ingestion using lexicon, mixpanel's data governance suite.
an object {}
containing a map of columns headers to mixpanel property keys.
providing mappings is not a tedious task; mixpanel is a schemaless tool designed for semi-structured data, so any column not explicitly mapped which is present in the table will become an event/user property key and value.
the fields you must provide mappings for depend on the type of data you're importing:
events mappings
{
// REQUIRED
'event_name_col': '', // column for event name
'distinct_id_col': '', // column for uniquer user id
'time_col': '', // column for event time
// OPTIONAL
'insert_id_col': '' // column for row id (deduplication)
}
note: insert_id_col
is required when using strict
mode
user or group profiles mappings
{
// REQUIRED
'distinct_id_col': '', // column for uniquer user id
// OPTIONAL
'profileOperation': '', // the $set style operation to use
'name_col': '', // column $name to use for the user/group profile
'email_col': '', // column $email to use for the user/group profile
'avatar_col': '', // column $email to use for the user/group profile
'created_col': '', // column $created (timestamp) to use
'phone_col': '', // column $phone to use for the user/group profile
'latitude_col': '', // column $latitude to use for the user/group profile
'longitude_col': '', // column $longitude to use for the user/group profile
'ip_co': '' // column $ip to use for the user/group profile
}
lookup tables mappings
{
// REQUIRED
'lookup_col' : '' //the "join" column for the lookup table
//hint: ^ this is usually the first column in the table
}
the key to remember about mappings
is that you a giving the module a guide to understand how to map fields from your warehouse to required fields for the different mixpanel data types.
here's an example:
SELECT
insert_id, timestamp, action, uuid, theme, class
FROM
mydnd.campaign.db
which produces this table:
insert_id | timestamp | action | uuid | theme | class |
---|---|---|---|---|---|
abc-123 | 4:19 PM | attack | ak | dark | cleric |
xyz-345 | 4:20 PM | defend | alice | light | bard |
cba-678 | 4:20 PM | attack | bob | light | paladin |
zyx-901 | 4:21 PM | sneak | eve | dark | rogue |
with this mapping:
{
event_name_col: 'action',
distinct_id_col: 'uuid',
time_col: 'timestamp',
insert_id_col: 'insert_id'
}
then produces these events in mixpanel:
[
{
event: "attack",
properties: {
distinct_id: "ak",
time: 1234567890,
$insert_id: "abc-123",
theme: "dark",
class: "cleric",
}
},
{
event: "defend",
properties: {
distinct_id: "alice",
time: 1234577891,
$insert_id: "xyz-345",
theme: "light",
class: "bard",
}
},
//etc...
];
for more info on mixpanel's data structure, see this deep-dive
an object {}
containing authentication details used to connect to your mixpanel project.
{
project_id: '', //your mixpanel project id
type: 'event', //type of record to import
region: 'US', //or EU
//one of service details OR api secret is required
service_account: '', //service account user name
service_secret: '', //service account secret
api_secret: '', //project api secret [deprecated]
//required for profiles
token: '', //mixpanel project token
//required for groups
groupKey: '', //the group key for this group
//required for lookup tables
lookupTableId: '' //the lookup table to replace
}
note: you can find most of these values in the your mixpanel project's settings
an object {}
containing various options for the job.
{
logFile: 'myLog.txt', // local path to write log files to
verbose: true, // display verbose console output
strict: false, // use strict mode when sending data to mixpanel
compress: false, // gzip data before egress
workers: 20 // number of concurrent workers to make requests to mixpanel
}
the workers
option is important because it governs concurrency, which can greatly affect throughput. best results are observed between 10-20 workers.
an object {}
containing arbitrary key:value
string pairs that will be used to tag the data. this is particularly useful if this module is being used as part of an automated pipeline, and you wish to tag the data with runIds
or some other reference value.
{
mixpanel: {
type: "event"
},
tags: {
"foo": "bar"
// every event in mixpanel will have a {foo: 'bar'} prop
}
}
this works on all record types:
{
mixpanel: {
type: "user"
},
tags: {
"baz": "qux"
// every user profile updated will have a {baz: 'qux'} prop
}
}
the data warehouse connectors used by this module are implemented as middleware, and therefore they have different authentication strategies and dependencies.
in most cases, dwh-mixpanel
wraps the vendor SDKs of each warehouse with it's own API, so when passing auth
params in your configuration, you can use any values that are supported by your warehouse, provided those credentials have the appropriate permissions.
below are details for the most commonly used authentication strategies in each supported warehouse, along with the permissions required to run a successful job. if you find an auth method or strategy that you need and is not supported for your warehouse, please file an issue
most bigquery jobs will be authenticated with GCP service accounts
the service account will need the following permissions in bigquery AND on the specific dataset being queried:
bigquery.jobs.create
bigquery.jobs.get
bigquery.datasets.get
bigquery.tables.get
bigquery.tables.getData
in my experience, the data viewer
+ bigquery job user
roles set together satisfies these cases; if a required permission is missing, the output will tell you what it is.
the typical fields used for auth are project_id
, private_key
, and client_email
; location
will need to be added manually based on the region of your bigquery instance:
{
dwh: "bigquery",
auth : {
"project_id": "my-gcp-project", //GCP project
"client_email": "serviceAccount@email.com", //service acct email
"private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n", // service account private key
"location": "US" //bigquery location.. this is required!
}
}
in most cases, you can drop your exported JSON keys into the auth
param, and it will work.
if you are running dwh-mixpanel
from your local computer, and you do not have IAM access in GCP to create service accounts, but you do have user-level access to the datasets in BigQuery, you can use Application Default Credentials (ADC) which leverages your GCP account to authenticate with bigQuery's APIs.
the general steps here are:
gcloud
CLI and initialize itgcloud init
gcloud auth application-default login
auth
param to an empty object {}
{
"dwh": "bigquery",
"auth": {},
"sql": "SELECT * FROM FOO"
}
dwh-mixpanel
with your updated configuration file:npx dwh-mixpanel ./bigquery-mixpanel
snowflake jobs with authenticate with a user name + password.
the fields used for auth are account
identifier, username
, and password
; you will also need to provide your warehouse
name, database
name, and table schema
. most of these values can be found in the UI or the SQL console.
{
dwh: "snowflake",
auth : {
"account": "foobar.us-central1.gcp", // your snowflake identifier
"username": "",
"password": "",
"database": "PROD1", // database to use
"schema": "PUBLIC", // schema to use
"warehouse": "COMPUTE_WH" //warehouse to use
}
}
no special permissions are required for snowflake - only that the user/pass you entered can view and query the dataset.
note: 2FA with Snowflake is not currently supported in this module.
to query athena from this module, your user account (or service account) will need permission to take the following actions in athena:
since athena depends on S3, your account will also need access to the following actions in S3:
the S3 bucket you assign permissions to should be the same one athena uses for storage; you can see this in the athena UI:
note: all queries made to athena are stored as CSV files in S3; this module uses the DeleteObject
action to delete the materialized CSV after the data is imported into mixpanel.
most AWS accounts can be setup for programmatic access using an accessKeyId
and a secretAccessKey
; you'll also need to add the region
of your S3 instance.
{
dwh: "athena",
auth : {
"accessKeyId": "",
"secretAccessKey": "",
"region": "us-east-2" //note this is important!
}
}
if you would prefer to store your authentication details as environment variables or in an .env
file, you may do so. this module will find those values provided they are correctly named.
here is a sample of how environment variables can be used for mixpanel:
MP_SERVICE_ACCOUNT=myServiceAcct
MP_SERVICE_SECRET=myServiceSecret
MP_API_SECRET=myAPISecret
MP_TOKEN=myToken
MP_LOOKUP_TABLE=myLookupTableId
for the warehouse, use the key DWH_AUTH
and for the value use stringified JSON that you would pass to auth
in the configuration file:
DWH_AUTH='{
"project_id": "ak-internal-tool-1613096051700",
"private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
"client_email": "mySerAcct@iam.google.com",
"location": "US"
}'
that's it for now. have fun!
FAQs
rETL from a data warehouse to mixpanel
The npm package dwh-mixpanel receives a total of 1 weekly downloads. As such, dwh-mixpanel popularity was classified as not popular.
We found that dwh-mixpanel demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
Security News
RubyGems.org has added a new "maintainer" role that allows for publishing new versions of gems. This new permission type is aimed at improving security for gem owners and the service overall.
Security News
Node.js will be enforcing stricter semver-major PR policies a month before major releases to enhance stability and ensure reliable release candidates.