Wukong Deploy Pack
The Infochimps Platform is an end-to-end,
managed solution for building Big Data applications. It integrates
best-of-breed technologies like Hadoop,
Storm,
Kafka,
MongoDB,
ElasticSearch,
HBase, &c. and provides simple interfaces
for accessing these powerful tools.
Computation, analytics, scripting, &c. are all handled by
Wukong within the
platform. Wukong is an abstract framework for defining computations
on data. Wukong processors and flows can run in many different
execution contexts including:
- locally on the command-line for testing or development purposes
- as a Hadoop mapper or reducer for batch analytics or ETL
- within Storm as part of a real-time data flow
The Infochimps Platform uses the concept of a deploy pack for
developers to develop all their processors, flows, and jobs within.
The deploy pack can be thought of as a container for all the necessary
Wukong code and plugins useful in the context of an Infochimps
Platform application. It includes the following libraries:
- wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
- wukong-storm: Run Wukong processors within the Storm framework. Model flows locally before you run them.
- wukong-load: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
- wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
Installation
The deploy pack is installed as a RubyGem:
$ sudo gem install wukong-deploy
Usage
Wukong-Deploy provides a command-line tool wu-deploy
which can be
used to create or interact with deploy packs.
Creating a New Deploy Pack
Create a new deploy pack:
$ wu-deploy new my_app
Within /home/user/my_app:
create .
create app/models
create app/processors
...
This will create a directory my_app
in the current directory.
Passing the dry_run
option will print what should happen without
actually doing anything:
$ wu-deploy new my_app --dry_run
Within /home/user/my_app:
create .
create app/models
create app/processors
...
You'll be prompted if there is a conflict. You can pass the force
option to always overwrite files and the skip
option to never
overwrite files.
Working with an Existing Deploy Pack
If your current directory is within an existing deploy pack you can
start up an IRB console with the deploy pack's environment already
loaded:
$ wu-deploy console
irb(main):001:0>
File Structure
A deploy pack is a repository with the following
Rails-like file structure:
├── app
│ ├── models
│ ├── processors
│ ├── flows
│ └── jobs
├── config
│ ├── environment.rb
│ ├── application.rb
│ ├── initializers
│ ├── settings.yml
│ └── environments
│ ├── development.yml
│ ├── production.yml
│ └── test.yml
├── data
├── Gemfile
├── Gemfile.lock
├── lib
├── log
├── Rakefile
├── spec
│ ├── spec_helper.rb
│ └── support
└── tmp
Let's look at it piece by piece:
- app: The directory with all the action. It's where you define:
- models: Your domain models or "nouns", which define and wrap the different kinds of data elements in your application. They are built using whatever framework you like (defaults to Gorillib)
- processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
- flows: Chain together processors into streaming flows for ingestion, real-time processing, or complex event processing (CEP)
- jobs: Pair processors together to create batch jobs to run in Hadoop
- config: Where you place all application configuration for all environments
- environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
- application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
- initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
- settings.yml: Defines application-wide settings.
- environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
- data: Holds sample data in flat files. You'll develop and test your application using this data.
- Gemfile and Gemfile.lock: Defines how libraries are resolved with Bundler.
- lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
- log: A good place to stash logs.
- Rakefile: Defines Rake tasks for the development, test, and deploy of your application.
- spec: Holds all your RSpec unit tests.
- spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
- support: Holds support code for your tests.
- tmp: A good place to stash temporary files.