Go4Data
Automate all things
About Go4Data
Go4Data is a data processing tool.
The idea behind Go4Data is that you should be able to create automated concurrent data processing flows.
Go4Data is still under heavy development!
There are a few components that one would need to know more about to develop with Go4Data. But a regular user should be able to use Go4Data without too much knowledge about different indepth knowledge.
To learn more about components and what they do, view Components
Go4Data is built around Processors that is a component used by Go4Data to handle the data pipeline.
It is the processor that handels starting/stopping and making things work.
The idea in Go4Data is too try to make it as seamless as possible.
All Processors has to have an Handler assigned before it can start processing any data. It is the handler that contains processing capabilities.
The goal of handlers is to make them as seamless as possible.
An example of how Go4Data is intended to work with its pubsub system and handlers doing processing seamless .
Some people like UML, so I've used Dumels, great job those who made it.
Here you can see a UML of the project. Dumels
Installation
go get github.com/percybolmer/go4data
Usage
There are currently 3 different ways of using Go4Data.
You can either load processors from a yaml file or you can initialize them by hand. Loading from yaml is the recommended way to avoid alot of coding.
- Use the Tooling, there is a Go4Data runner that loads a yaml. This is the most easy to use way, but offers limited flexibility.
- Use the package in custom codebase. You can generate processors and apply handlers to them and use those to do things. Forexample if your intressted in monitoring a directory for new files and read the contents you could do that and subscribe to the output.
- Use the loader to load a yaml file in your code and run them.
See examples folder for examples.
Csv files to Elasticsearch with Filtering
Creating processor and subscribing to output
Using Redis as the Pub/Sub engine
Components in Go4Data
Below is a more indepth explaination of all the components that are found in Go4Data.
Processors
Processor is the default component that is used. It is used to make a standarized way of handeling the dataflow, error handeling and metrics.
A processor consists of the following fields
ID - which is a unique ID that each processor should have. This is done automagically when running NewProcessor
Name - This is a name of the processor, this does not have to be unique, its usage is mainly for the upcomming UI.
FailureHandler - is the assigned way of handeling errors that occur during processing. See FailureHandler.
Handler - is the processing action to apply, this determines what the processor should be doing. See Handler for more information, and see HandlerList.
Subscriptions - is all the topics to listen for data on.
Topics - is where to send data after processing it.
QueueSize - is how many payloads are allowed to be on queue in the Processor. This is to limit and avoid memory burning if a topic isnt drained.
Metric - is stored by both the Handler and Processor. The handler will inherit the Processors set metric. The default metric is Prometheus. But this can be changed by the user by setting a new metricProvider.
Workers - is how many concurrent workers the handler is allowed to run. Modify this only if you want to increase the amount of goroutines your handler should run. This can be increased to make certain handlers work faster, but remember that it can also slow things down if you set too many.
Handler
Handler is the data processing unit that will actually do any work.
Any struct that fulfills the handler interface is allowed to be used by a Processor.
Handler is an golang interface that looks like this
type Handler interface {
Handle(ctx context.Context, payload payload.Payload, topics ...string) error
ValidateConfiguration() (bool, []string)
GetConfiguration() *property.Configuration
GetHandlerName() string
Subscriptionless() bool
SetMetricProvider(p metric.Provider, prefix string) error
GetErrorChannel() chan error
}
Users can write their own Handlers if they want to add functionality.
The easiest way to start writing a handler is to take a look at handlergenerator
Payload
Payload is the items that are sent inside the data pipeline.
Items that are transferred between Processors are called Payloads. It is also interface based, so it is highly customizable and easy to create new payloads.
Too take a look at the currently available payloads see payload
Payload is a interface that looks like
type Payload interface {
GetPayloadLength() float64
GetPayload() []byte
SetPayload([]byte)
GetSource() string
SetSource(string)
GetMetaData() *property.Configuration
}
It is possible for Payloads to run through the Filter. This is a handler that will remove payloads that does not fulfill any filter requirements. Filters are regexpes that can be run on the payload.
If a payload is gonna be passed through the Filter, they needed to be part of the Filterable interface.
type Filterable interface {
ApplyFilter(f *Filter) bool
}
Properties
Properties are configurations that are applied to Handlers.
This is a way of configuring Handlers in a standard way. The Property is set by the Handlers.
A property can be a Required property, which means that a Handler will not start if this property does not contain a correct value. And a property can ofcourse be a nonrequired property, which is an optional configuration.
It is up to the Handler to make sure that all properties are accounted for, and this is done in ValidateConfiguartion for each handler.
Inside the property package there is also a struct called Configuration.
Configuration is used to easier handle Properties inside a Handler.
Metrics
Each Processor has a MetricProvider set that is inherited by the Handlers.
The goal of a metricprovider is to enable Handlers and Processors to publish metrics about their processing.
The default is Prometheus metrics, unless changed.
Pubsub
Payloads are transported between Handlers by using a Publish/Subscription model.
The main idea is that when processing is done, a payload is published onto a Topic, or Topics. The topics that will be published to is assigned when initializing the Processor with NewProcessor.
For another Processor to receive the published payloads, they have to Subscribe on the topics.
Currently there are two supported Pub/Sub engines that Go4Data can use.
It has a DefaultEngine that is set by default and no configuration is needed.
There is also a RedisEngine that allows the user to instead use Redis.
DefaultEngine - Used by default, works great for single node data flows.
RedisEngine - Can be configured to be used, works best if you have multiple Go4Data nodes that all should Pub/Sub on the same Topics.
Failures
So once in a while, a Processor or Handler may experience errors. This is ofcourse something that wants to be noticed.
A wrapper around the regular error is used in Go4Data, to add some context and posibility to recreate errors.
type Failure struct {
Err error `json:"error"`
Payload payload.Payload `json:"payload"`
Processor uint `json:"processor"`
}
Failures are handled by the Proccessors assigned FailureHandler.
A failurehandler is a simple function that can easily be changed by the user.
The default failurehandler is PrintFailure which will output the Payload into stdout.
The failurehandler looks like
FailureHandler func(f Failure)
Loader
The loader is used to load go4data yaml configurations into ready-to-use processors. It can also be used to Save configured processors.
The usage is fairly easy.
Example of loading a yml and then saving it again
loadedProcessors, err := go4data.Load("testing/loader/loadthis.yml")
if err != nil {
t.Fatal(err)
}
go4data.Save("testing/loader/loadthis.yml", loadedProcessorss)
Tooling
Running a Go4Data yaml
If only interessted in using go4data as a CLI tool then use runner.
After you have downloaded go4data inside that folder and run
go build -o runner
./runner -go4data /path/to/go4data.yml -port 2112
The port is where to host Prometheus metrics, currently runner only has support for prometheus.
Building a new Handler
To build a handler one should look at Handler to learn what a Handler is. Any struct that fullfills the Handler interface can be assigned to a Processor.
To help in building new handlers there is a tooling that will generate a fresh handler for you, the tool can be found here.
This is a code generator that can be used to build a template Handler for you.
Compile the code generator by going into the tooling folder after downloading the source.
run
go build -o handlergenerator
./handlergenerator -package $YOURHANDLERPACKAGE -location $HANDLERPACKAGEPATH -templatepath $PATHTOHANDLERTEMPLATE -handler $HANDLERNAME
You should now see a new Handler that is generated and be able to use it.
Offcourse, you still have to do some coding, The generated handler will only print stdout. View other handlers to see how they are setup.
Example when Creating the pcap reader I ran
handlergenerator -package network -location handlers/network -handler OpenPcap
The tooling will use HANDLERGENERATORPATH environment variable to know where the templates are if not specified. The templates can be found Template Location
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
License
MIT