DAWGIE
Data and Algorithm Work-flow Generation, Introspection, and Execution
The DAWGIE software accomplishes:
- Data anonymity is required by the framework because the implementation DAWGIE is independent of the Algorithm Engine element of the framework. The framework allows DAWGIE to identify the author of the data through naming and version even though DAWGIE as no knowledge of the data itself. Because the language of the framework and further implementations is Python, DAWGIE forces a further requirement on the data that it be "pickle-able". The additional requirement was accepted by the framework and pushed to the Algorithm Engine element. DAWGIE uses a lazy load implementation -- loaded only when an Algorithm Engine sub-element requests data from a specific author -- to remove routing and, thus, requiring knowledge about the data itself.
- Data persistence has two independent implementations. In the case of small and simple data author relationships, the Python shelve module is used for persistence. As the number of unique authors grows, the relations grow faster (non-linear) and requires a more sophisticated persistence tool. In the latter case, postgresql is used for persistence. Both implementations are the design making them quickly interchangeable. Neither actually store the data blobs themselves. They -- shelve and postgresql -- only store the author's unique ID -- name and version -- and then reference the blob on disk. Using the relational DB proprieties of shelve (Python dictionaries) and postgresql (relational DB) any request for data from a specific author can quickly be resolved. Creating unique names for all of the data blobs is achieved using {md5 hash}_{sha1 hash}. These names also allow the persistence to recognize if the data is already known.
- For pipeline management, DAWGIE implements a worker farm, a work flow Abstract Syntax Tree (AST), and signaling from data persistence to execute the tasks within the framework Algorithm Engine element. The data persistence implementation signals the pipeline management element when new data has been authored. The manager searches the AST for all tasks that depend on the author and schedules all tasks that depend on the author starting with the earliest dependent. The new data signals can be generated at the end any task. When a task moves from being scheduled to executing, the foreman of the worker farm passes the task to a waiting worker on a compute node. The worker then loads the data via data persistence for the task and begins executing the task. Upon completion of the task, the worker saves the data via data persistence and notifies the foreman it is ready for another task. In this fashion, DAWGIE walks the minimum and complete nodes in the AST that depend on any new data that has been generated. The pipeline management offers periodic tasks as well that treat a temporal event as new data.
Organization
DAWGIE is configured and controlled by ENV variables and command line arguments. The command line arguments override the ENV variables. While this section cover nearly all of the command line options, please use the --help switch to see all of the available options.
Access to Running DAWGIE
- DAWGIE_DB_PORT --context-db-port
- The database access port. See the specific database implemntation being used for detailed definition of this parameter.
- DAWGIE_FARM_PORT --context-farm-port
- The access port that workers in the farm will use to communicate with the DAWGIE server.
- DAWGIE_FE_PORT --port
- The web display port. All subsequent ports are computed from this one.
- DAWGIE_LOG_PORT --context-log-port
- Port number for distributed workers to log messages through the DAWGIE server.
Algorithm Engine (AE)
- DAWGIE_AE_BASE_PATH --context-ae-dir
- The complete path to the AE source code. It is the first directory to start walking down and checking all of the sub-directories and including those that are packages that implement the necessary factories to be identified as AE packages.
- DAAWGIE_AE_BASE_PACKAGE --context-ae-pkg
- Because the AE code may be intermixed with non-AE code and, therefore, may be a subset of code, need to know the packaage prefix.
** Example **
If all of the Python code starts in foo and the AE code starts in foo/bar/ae. Then DAWGIE_AE_BASE_PATH
should be 'foo/bar/ae' and DAWGIE_AE_BASE_PACKAGE
should be 'foo.bar.ae'.
Data
- DAWGIE_DATA_DBSTOR --context-data-dbs
- The location for DAWGIE to store the data generated by the AE known as StateVectors. This area should be vast enough to hold all of the data genreated by the AE over all time.
- DAWGIE_DATA_LOGDIR --context-data-log
- The location for DAWGIE to write its log files.
- DAWGIE_DATA_STAGED --context-data-stg
- The location for DAWGIE to store temporary data for the AE. It should be sized to fit the expected AE use and DAWGIE will clean out the staging area. However, when there are hiccups in the
Database
DAWGIE supports two styles of databases. It supports Python shelve for tiny applications and then POSTGRES for a much larger and scalable system.
Postgresql
- DAWGIE_DB_HOST --context-db-host
- The IP hostname of the POSTGRESQL server.
- DAWGIE_DB_IMPL --context-db-impl
- Must be 'post'
- DAWGIE_DB_NAME --context-db-name
- The name of the database to use.
- DAWGIE_DB_PATH --context-db-path
- THe username:password for the database named with DAWGIE_DB_NAME.
- DAWGIE_DB_PORT --context-db-port
- The IP port number of the POSTGRESQL server. When DAWGIE_DB_IMPL is 'post', this value defaults to 5432 because POSTGRESQL is independent of DAWGIE.
Shelve
- DAWGIE_DB_HOST --context-db-host
- The IP hostname of the machine running DAWGIE.
- DAWGIE_DB_IMPL --context-db-impl
- Must be 'shelve'
- DAWGIE_DB_NAME --context-db-name
- The name of the database to use.
- DAWGIE_DB_PATH --context-db-path
- The directory path on DAWGIE_DB_HOST to write the shelve files.
- DAWGIE_DB_PORT --context-db-port
- The IP port number of the DAWGIE DB interface. In is automatically computed from the general port number where DAWGIE is being served (see --port)
Tools
- DAWGIE_DB_POST2SHELVE_PREFIX
- Used when converting POSTGRESQL to shelve for development of new AE modules.
- DAWGIE_DB_ROTATE_PATH --context-db-rotate-path
- Allows the data to backed up with every new run ID
- DAWGIE_DB_COPY_PATH --context-db-copy-path
- Temporary working space for database work.
- DAWGIE_DB_ROTATES --context-db-rotate
- The number of database backups to preserve.
Souce Code
The source code is then organized by language:
- Bash : utilities for simpler access to the Python
- Python : implementation of DAWGIE
The Python has view key packages:
- dawgie.db : the database interface
- dawgie.de : the display engine that allows user requests to render state vectors to meaningful images
- dawgie.fe : the front-end that we see and interact with
- dawgie.pl : the actual pipeline code that exercises the algorithm engine
- dawgie.tools : a tool box used by the pipeline and adminsitrators (mostly)
Documentation
Fundamental Brochure is a sales brochure used for SOYA 2018 contest.
Fundamental Developer Overiew is a mix of sales and development. It frames the problem and the solution provided. It then proceeds to a high level description of how to use the tool using Gamma et al Design Patterns. Armed with the patterns being used, a developer should be able to move to the HOW TO slides connecting the minutiae in those slides with the highest view in these.
Fundamental Magic is a manager level explanation of what the pipeline does and how it can help development.
Fundamental How To is a beginner course on working with DAWGIE.
Fundamental Administration is a starter on how to adminsiter DAWGIE.
Installation
python3 Python/setup.py build install
bash install.sh
Use