Goresilience
Goresilience is a Go toolkit to increase the resilience of applications. Inspired by hystrix and similar libraries at it's core but at the same time very different:
Features
- Increase resilience of the programs.
- Easy to extend, test and with clean design.
- Go idiomatic.
- Use the decorator pattern (middleware), like Go's http.Handler does.
- Ability to create custom resilience flows, simple, advanced, specific... by combining different runners in chains.
- Safety defaults.
- Not couple to any framework/library.
- Prometheus/Openmetrics metrics as first class citizen.
Table of Contents
Motivation
You are wondering, why another circuit breaker library...?
Well, this is not a circuit breaker library. Is true that Go has some good circuit breaker libraries (like sony/gobreaker, afex/hystrix-go or rubyist/circuitbreaker). But there is a lack a resilience toolkit that is easy to extend, customize and establishes a design that can be extended, that's why goresilience born.
The aim of goresilience is to use the library with the resilience runners that can be combined or used independently depending on the execution logic nature (complex, simple, performance required, very reliable...).
Also one of the key parts of goresilience is the extension to create new runners yourself and use it in combination with the bulkhead, the circuitbreaker or any of the runners of this library or from others.
Getting started
The usage of the library is simple. Everything is based on Runner
interface.
The runners can be used in two ways, in standalone mode (one runner):
package main
import (
"context"
"log"
"time"
"github.com/fairyhunter13/goresilience/timeout"
)
func main() {
cmd := timeout.New(timeout.Config{
Timeout: 100 * time.Millisecond,
})
for i := 0; i < 200; i++ {
result := ""
err := cmd.Run(context.TODO(), func(_ context.Context) error {
if time.Now().Nanosecond()%2 == 0 {
time.Sleep(5 * time.Second)
}
result = "all ok"
return nil
})
if err != nil {
result = "not ok, but fallback"
}
log.Printf("the result is: %s", result)
}
}
or combining in a chain of multiple runners by combining runner middlewares. In this example the execution will be retried timeout and concurrency controlled using a runner chain:
package main
import (
"context"
"errors"
"fmt"
"github.com/fairyhunter13/goresilience"
"github.com/fairyhunter13/goresilience/bulkhead"
"github.com/fairyhunter13/goresilience/retry"
"github.com/fairyhunter13/goresilience/timeout"
)
func main() {
cmd := goresilience.RunnerChain(
bulkhead.NewMiddleware(bulkhead.Config{}),
retry.NewMiddleware(retry.Config{}),
timeout.NewMiddleware(timeout.Config{}),
)
calledCounter := 0
result := ""
err := cmd.Run(context.TODO(), func(_ context.Context) error {
calledCounter++
if calledCounter%2 == 0 {
return errors.New("you didn't expect this error")
}
result = "all ok"
return nil
})
if err != nil {
result = "not ok, but fallback"
}
fmt.Printf("result: %s", result)
}
As you see, you could create any combination of resilient execution flows by combining the different runners of the toolkit.
Static Runners
Static runners are the ones that based on a static configuration and don't change based on the environment (unlike the adaptive ones).
Timeout
This runner is based on timeout pattern, it will execute the goresilience.Func
but if the execution duration is greater than a T duration timeout it will return a timeout error.
Check example.
Retry
This runner is based on retry pattern, it will retry the execution of goresilience.Func
in case it failed N times.
It will use a exponential backoff with some jitter (for more information check this)
Check example.
Bulkhead
This runner is based on bulkhead pattern, it will control the concurrency of goresilience.Func
executions using the same runner.
It also can timeout if a goresilience.Func
has been waiting too much to be executed on a queue of execution.
Check example.
Circuit breaker
This runner is based on circuitbreaker pattern, it will be storing the results of the executed goresilience.Func
in N buckets of T time to change the state of the circuit based on those measured metrics.
Check example.
Chaos
This runner is based on failure injection of errors and latency. It will inject those failures on the required executions (based on percent or all).
Check example.
Adaptive Runners
Concurrency limit
Concurrency limit is based on Netflix concurrency-limit library. It tries to implement the same features but for goresilience library (nd compatible with other runners).
It limits the concurrency based on less configuration and adaptive based on the environment is running on that moment, hardware, load...
This Runner will limit the concurrency (like bulkhead) but it will use different TCP congestion algorithms to adapt the concurrency limit based on errors and latency.
The Runner is based on 4 components.
- Limiter: This is the one that will measure and calculate the limit of concurrency based on different algorithms that can be choose, for example AIMD.
- Executor: This is the one executing the
goresilience.Func
itself, it has different queuing implementations that will prioritize and drop executions based on the implementations. - Runner: This is the runner itself that will be used by the user and is the glue of the
Limiter
and the Executor
. This will had a policy that will treat the execution result as an error, success or ignore for the Limiter algorithm. - Result policy: This is a function that can be configured on the concurrencylimit Runner. This function receives the result of the executed function and returns a result for the limit algorithm. This policy is responsible to tell the limit algorithm if the received error should be count as a success, failure or ignore on the calculation of the concurrency limit. For example: only count the errors that have been 502 other ones ignore.
Check AIMD example.
Check CoDel example.
Executors
FIFO
: This executor is the default one it will execute the queue jobs in a first-in-first-out order and also has a queue wait timeout.LIFO
: This executor will execute the queue jobs in a last-in-first-out order and also has a queue wait timeout.AdaptiveLIFOCodel
: Implementation of Facebook's CoDel+adaptive LIFO algorithm. This executor is used with Static
limiter.
Limiter
Static
: This limiter will set a constant limit that will not change.AIMD
: This limiter is based on AIMD TCP congestion algorithm. It increases the limit at a constant rate and when congestion occurs (by timeout or result failure) it will decrease by a configured factor
Result policy
FailureOnExternalErrorPolicy
: Will treat as failure every error that is not from concurrencylimit package.NoFailurePolicy
: Will never return a failure, just ignore when an error occurs, this can be used to adapt only on RTT/latency.FailureOnRejectedPolicy
: Will treat as failure every time the execution has been rejected with a errors.ErrRejectedExecution
error.
Other
Metrics
All the runners can be measured using a metrics.Recorder
, but instead of passing to every runner, the runners will try to get this recorder from the context. So you can wrap any runner using metrics.NewMiddleware
and it will activate the metrics support on the wrapped runners. This should be the first runner of the chain.
At this moment only Prometheus is supported.
In this example the runners are measured.
Measuring has always a performance hit (not too high), on most cases is not a problem, but there is a benchmark to see what are the numbers:
BenchmarkMeasuredRunner/Without_measurement_(Dummy).-4 300000 6580 ns/op 677 B/op 12 allocs/op
BenchmarkMeasuredRunner/With_prometheus_measurement.-4 200000 12901 ns/op 752 B/op 15 allocs/op
Hystrix-like
Using the different runners a hystrix like library flow can be obtained. You can see a simple example of how it can be done on this example
http middleware
Creating HTTP middlewares with goresilience runners is simple and clean. You can see an example of how it can be done on this example. The example shows how you can protect the server by load shedding using an adaptive concurrencylimit goresilience.Runner
.
Architecture
At its core, goresilience is based on a very simple idea, the Runner
interface, Runner
interface is the unit of execution, its accepts a context.Context
, a goresilience.Func
and returns an error
.
The idea of the Runner is the same as the go's http.Handler
, having a interface you could create chains of runners, also known as middlewares (Also called decorator pattern).
The library comes with decorators called Middleware
that return a function that wraps a runner with another runner and gives us the ability to create a resilient execution flow having the ability to wrap any runner to customize with the pieces that we want including custom ones not in this library.
This way we could create execution flow like this example:
Circuit breaker
└── Timeout
└── Retry
Extend using your own runners
To create your own runner, You need to have 2 things in mind.
- Implement the
goresilience.Runner
interface. - Give constructors to get a
goresilience.Middleware
, this way your Runner
could be chained with other Runner
s.
In this example (full example here) we create a new resilience runner to make chaos engineering that will fail at a constant rate set on the Config.FailEveryTimes
setting.
Following the library convention with NewFailer
we get the standalone Runner (the one that is not chainable). And with NewFailerMiddleware
We get a Middleware
that can be used with goresilience.RunnerChain
to chain with other Runners.
Note: We can use nil
on New
because NewMiddleware
uses goresilience.SanitizeRunner
that will return a valid Runner as the last part of the chain in case of being nil
(for more information about this check goresilience.command
).
type Config struct {
FailEveryTimes int
}
func New(cfg Config) goresilience.Runner {
return NewMiddleware(cfg)(nil)
}
func NewMiddleware(cfg Config) goresilience.Middleware {
return func(next goresilience.Runner) goresilience.Runner {
calledTimes := 0
return goresilience.RunnerFunc(func(ctx context.Context, f goresilience.Func) error {
calledTimes++
if calledTimes == cfg.FailEveryTimes {
calledTimes = 0
return fmt.Errorf("failed due to %d call", calledTimes)
}
next = goresilience.Sanitize(next)
return next.Run(ctx, f)
})
}
}