Littlefork
A modular pipeline for sequential batch processing. We use it for iterative
data retrievals and transformations.
About
Installation
npm install --save littlefork
Usage
The application is comprised of a
plugin-runner and a set of
plugins. To use them
install into your NodeJS project Littlefork
and any plugins you wish.
npm init
npm install --save littlefork
This gives you access to the Littlefork
command:
$(npm bin)/littlefork
By itself, littlefork does not much. You have to install one or more
plugins to make
littlefork do anything.
npm install --save littlefork-plugin-twitter littlefork-plugin-mongodb
Configuration
Littlefork accepts configuration options as environment variables, command
line arguments and from a configuration file.
PROFILE=profile-name PLUGINS=ddg DDG_TERMS=term $(npm bin)/littlefork
is equivalent to
$(npm bin)/littlefork -i profile-name -p ddg --ddg.terms term
is equivalent to
$(npm bin)/littlefork -c pipeline.json
with pipeline.json
being a file in JSON format:
{
"plugins": "ddg",
"profile": "profile-name",
"ddg": {
"terms": "term"
}
}
The base pipeline must be configured with a profile id and the plugins that
form the pipeline. profile
and plugins
are required configuration options
and must be set.
Every plugin that is installed can add additional configuration options. Print
the usage help of the command line tool to get a complete list of command line
options:
$(npm bin)/littlefork --help
Plugins
A plugin takes a piece of data and returns a transformed version of this
data. Littlefork starts with a profile and a pipeline configuration and
sequentially uses the output of a plugin as the input for the next
plugin.
A plugin resembles a mathematical function. It maps over profile data to
produce a new version of that profile data. But we are cheating, a plugin in
Littlefork is not total. It has side effects that are managed using promises.
A search on npm
lists all available plugins.
Development
Data format
The envelope of the profile data is a nested object of with values of various
types. This is a simplified version of the profile data:
{
"profile": {
"name": "Some story name",
"profileId": "some profileId",
"twitter_handle": "twetter-id"
},
"data": [
{"_lf_source": "twitter_tweets", "tweet": "bahh"},
{"_lf_source": "twitter_tweets", "tweet": "bahh"}
],
"stats": {}
}
Every data unit is an atomic piece of data. It depends on the data fetching
transformation plugin. Various transformation plugins can extend the data
format with _lf_
prefixed attributes. The list below is the basic set of
littlefork set data entries:
-
_lf_source
(String)
The name of the plugin.
-
_lf_title
(String)
The name of the attribute that functions as title attribute for the data
unit. This is a pointer to the real title attribute.
-
_lf_pubdates
(Object)
Plugins register the publishing dates of the data unit. Different plugins
can determine different publishing dates.
-
_lf_links
(Array)
- A list if links that were found in the data unit.
-
_lf_images
(Array)
A list of images that were found in the data unit.
-
_lf_created
(String)
The timestamp at which the data unit was created.
-
_lf_profile
(String)
The name of the profile.
-
_lf_id_hash
(String)
A sha1 hash of the data unit identities.
-
_lf_content_hash
(String)
A sha1 hash of the data unit content.
-
_lf_meta
(Object)
Meta information stored by transformations.
Debugging
There is support for the excellent
debug library. Use *
to print all
debug messages and to see which debug target exist.
DEBUG=* $(npm/bin)/littlefork -c config.json
Further more, littlefork can store the whole data set between transformation
steps in files. If the VERBOSE_LOG
environment is set with a path, the data
after each transformation step is stored in that location.
VERBOSE_LOG=/tmp $(npm bin)/littlefork -p ddg,mongodb_store
Plugins
Plugins can extend the functionality of littlefork in three ways:
- transformations are functions that take data in the littlefork data format
and return data of the same format again.
- hooks are called before each step of transformation in a pipeline. The
above mentioned verbose logging is implemented using hooks.
- profiles define sources to look up profile data. Each pipeline probably
needs at least one profile source.
A plugin is a simple npm
module. It can export either a single function or
an object with functions.
Most plugins are a minor transformation step. If a module is exporting a
single function it must be a transformation plugin.
If a module wants to export more than one transformation steps, or it wants to
provide a pipeline hook or a profile source, it must export an object where
each value is a function.
export {
hook: (profile) => openRun(profile).disposer(closeRun),
profile: (id) => get(id).then(result => {
if (_.isNil(result)) {
throw new Error(`Profile ${id} not found.`);
}
return result;
}),
// Transformations
twitter_feed: (data) => ....,
twitter_timeline: (data) => ....,
}
hook
and profile
are special functions. Every other item is treated as a
transformation.
Contributing