Package monkit is a flexible code instrumenting and data collection library. I'm going to try and sell you as fast as I can on this library. Example usage We've got tools that capture distribution information (including quantiles) about int64, float64, and bool types. We have tools that capture data about events (we've got meters for deltas, rates, etc). We have rich tools for capturing information about tasks and functions, and literally anything that can generate a name and a number. Almost just as importantly, the amount of boilerplate and code you have to write to get these features is very minimal. Data that's hard to measure probably won't get measured. This data can be collected and sent to Graphite (http://graphite.wikidot.com/) or any other time-series database. Here's a selection of live stats from one of our storage nodes: This library generates call graphs of your live process for you. These call graphs aren't created through sampling. They're full pictures of all of the interesting functions you've annotated, along with quantile information about their successes, failures, how often they panic, return an error (if so instrumented), how many are currently running, etc. The data can be returned in dot format, in json, in text, and can be about just the functions that are currently executing, or all the functions the monitoring system has ever seen. Here's another example of one of our production nodes: https://raw.githubusercontent.com/spacemonkeygo/monkit/master/images/callgraph2.png This library generates trace graphs of your live process for you directly, without requiring standing up some tracing system such as Zipkin (though you can do that too). Inspired by Google's Dapper (http://research.google.com/pubs/pub36356.html) and Twitter's Zipkin (http://zipkin.io), we have process-internal trace graphs, triggerable by a number of different methods. You get this trace information for free whenever you use Go contexts (https://blog.golang.org/context) and function monitoring. The output formats are svg and json. Additionally, the library supports trace observation plugins, and we've written a plugin that sends this data to Zipkin (http://github.com/spacemonkeygo/monkit-zipkin). https://raw.githubusercontent.com/spacemonkeygo/monkit/master/images/trace.png Before our crazy Go rewrite of everything (https://www.spacemonkey.com/blog/posts/go-space-monkey) (and before we had even seen Google's Dapper paper), we were a Python shop, and all of our "interesting" functions were decorated with a helper that collected timing information and sent it to Graphite. When we transliterated to Go, we wanted to preserve that functionality, so the first version of our monitoring package was born. Over time it started to get janky, especially as we found Zipkin and started adding tracing functionality to it. We rewrote all of our Go code to use Google contexts, and then realized we could get call graph information. We decided a refactor and then an all-out rethinking of our monitoring package was best, and so now we have this library. Sometimes you really want callstack contextual information without having to pass arguments through everything on the call stack. In other languages, many people implement this with thread-local storage. Example: let's say you have written a big system that responds to user requests. All of your libraries log using your log library. During initial development everything is easy to debug, since there's low user load, but now you've scaled and there's OVER TEN USERS and it's kind of hard to tell what log lines were caused by what. Wouldn't it be nice to add request ids to all of the log lines kicked off by that request? Then you could grep for all log lines caused by a specific request id. Geez, it would suck to have to pass all contextual debugging information through all of your callsites. Google solved this problem by always passing a context.Context interface through from call to call. A Context is basically just a mapping of arbitrary keys to arbitrary values that users can add new values for. This way if you decide to add a request context, you can add it to your Context and then all callsites that decend from that place will have the new data in their contexts. It is admittedly very verbose to add contexts to every function call. Painfully so. I hope to write more about it in the future, but Google also wrote up their thoughts about it (https://blog.golang.org/context), which you can go read. For now, just swallow your disgust and let's keep moving. Let's make a super simple Varnish (https://www.varnish-cache.org/) clone. Open up gedit! (Okay just kidding, open whatever text editor you want.) For this motivating program, we won't even add the caching, though there's comments for where to add it if you'd like. For now, let's just make a barebones system that will proxy HTTP requests. We'll call it VLite, but maybe we should call it VReallyLite. Run and build this and open localhost:8080 in your browser. If you use the default proxy target, it should inform you that the world hasn't been destroyed yet. The first thing you'll want to do is add the small amount of boilerplate to make the instrumentation we're going to add to your process observable later. Import the basic monkit packages: and then register environmental statistics and kick off a goroutine in your main method to serve debug requests: Rebuild, and then check out localhost:9000/stats (or localhost:9000/stats/json, if you prefer) in your browser! Remember what I said about Google's contexts (https://blog.golang.org/context)? It might seem a bit overkill for such a small project, but it's time to add them. To help out here, I've created a library that constructs contexts for you for incoming HTTP requests. Nothing that's about to happen requires my webhelp library (https://godoc.org/github.com/jtolds/webhelp), but here is the code now refactored to receive and pass contexts through our two per-request calls. You can create a new context for a request however you want. One reason to use something like webhelp is that the cancelation feature of Contexts is hooked up to the HTTP request getting canceled. Let's start to get statistics about how many requests we receive! First, this package (main) will need to get a monitoring Scope. Add this global definition right after all your imports, much like you'd create a logger with many logging libraries: Now, make the error return value of HandleHTTP named (so, (err error)), and add this defer line as the very first instruction of HandleHTTP: Let's also add the same line (albeit modified for the lack of error) to Proxy, replacing &err with nil: You should now have something like: We'll unpack what's going on here, but for now: For this new funcs dataset, if you want a graph, you can download a dot graph at localhost:9000/funcs/dot and json information from localhost:9000/funcs/json. You should see something like: with a similar report for the Proxy method, or a graph like: https://raw.githubusercontent.com/spacemonkeygo/monkit/master/images/handlehttp.png This data reports the overall callgraph of execution for known traces, along with how many of each function are currently running, the most running concurrently (the highwater), how many were successful along with quantile timing information, how many errors there were (with quantile timing information if applicable), and how many panics there were. Since the Proxy method isn't capturing a returned err value, and since HandleHTTP always returns nil, this example won't ever have failures. If you're wondering about the success count being higher than you expected, keep in mind your browser probably requested a favicon.ico. Cool, eh? How it works is an interesting line of code - there's three function calls. If you look at the Go spec, all of the function calls will run at the time the function starts except for the very last one. The first function call, mon.Task(), creates or looks up a wrapper around a Func. You could get this yourself by requesting mon.Func() inside of the appropriate function or mon.FuncNamed(). Both mon.Task() and mon.Func() are inspecting runtime.Caller to determine the name of the function. Because this is a heavy operation, you can actually store the result of mon.Task() and reuse it somehow else if you prefer, so instead of you could instead use which is more performant every time after the first time. runtime.Caller only gets called once. Careful! Don't use the same myFuncMon in different functions unless you want to screw up your statistics! The second function call starts all the various stop watches and bookkeeping to keep track of the function. It also mutates the context pointer it's given to extend the context with information about what current span (in Zipkin parlance) is active. Notably, you *can* pass nil for the context if you really don't want a context. You just lose callgraph information. The last function call stops all the stop watches ad makes a note of any observed errors or panics (it repanics after observing them). Turns out, we don't even need to change our program anymore to get rich tracing information! Open your browser and go to localhost:9000/trace/svg?regex=HandleHTTP. It won't load, and in fact, it's waiting for you to open another tab and refresh localhost:8080 again. Once you retrigger the actual application behavior, the trace regex will capture a trace starting on the first function that matches the supplied regex, and return an svg. Go back to your first tab, and you should see a relatively uninteresting but super promising svg. Let's make the trace more interesting. Add a to your HandleHTTP method, rebuild, and restart. Load localhost:8080, then start a new request to your trace URL, then reload localhost:8080 again. Flip back to your trace, and you should see that the Proxy method only takes a portion of the time of HandleHTTP! https://cdn.rawgit.com/spacemonkeygo/monkit/master/images/trace.svg There's multiple ways to select a trace. You can select by regex using the preselect method (default), which first evaluates the regex on all known functions for sanity checking. Sometimes, however, the function you want to trace may not yet be known to monkit, in which case you'll want to turn preselection off. You may have a bad regex, or you may be in this case if you get the error "Bad Request: regex preselect matches 0 functions." Another way to select a trace is by providing a trace id, which we'll get to next! Make sure to check out what the addition of the time.Sleep call did to the other reports. It's easy to write plugins for monkit! Check out our first one that exports data to Zipkin (http://zipkin.io/)'s Scribe API: https://github.com/spacemonkeygo/monkit-zipkin We plan to have more (for HTrace, OpenTracing, etc, etc), soon!
This code is for loading database data that maps ip addresses to countries for collecting and presenting statistics on snowflake use that might alert us to censorship events. The functions here are heavily based off of how tor maintains and searches their geoip database The tables used for geoip data must be structured as follows: Recognized line format for IPv4 is: Note that the IPv4 line format is not currently supported. Recognized line format for IPv6 is: It also recognizes, and skips over, blank lines and lines that start with '#' (comments).
Package stats defines a lightweight interface for collecting statistics. It doesn't provide an implementation, just the shared interface.
* This package provides Statistical functions Largely inspired from https://github.com/leesper/go_rng/blob/master/gauss.go
Package throttled implements different throttling strategies for controlling access to HTTP handlers. go get gopkg.in/throttled/throttled.v1/... The Interval function creates a throttler that allows requests to go through at a controlled, constant interval. The interval may be applied to all requests (vary argument == nil) or independently based on vary-by criteria. For example: Creates a throttler that will allow a request each 100ms (10 requests per second), with a buffer of 100 exceeding requests before dropping requests with a status code 429 (by default, configurable using th.DeniedHandler or the package-global DefaultDeniedHandler variable). Different paths will be throttled independently, so that /path_a and /path_b both can serve 10 requests per second. The last argument, 50, indicates the maximum number of keys that the throttler will keep in memory. The MemStats function creates a throttler that allows requests to go through only if the memory statistics of the current process are below specified thresholds. For example: Creates a throttler that will allow requests to go through until the number of garbage collections reaches the initial number + 10 (the MemThresholds function creates absolute memory stats thresholds from offsets). The second argument, 10ms, indicates the refresh rate of the memory stats. The RateLimit function creates a throttler that allows a certain number of requests in a given time window, as is often implemented in public RESTful APIs. For example: Creates a throttler that will limit requests to 30 per minute, based on the remote address of the client, and will store the counter and remaining time of the current window in the provided memory store, limiting the number of keys to keep in memory to 1000. The store sub-package also provides a Redis-based Store implementations. The RateLimit throttler sets the expected X-RateLimit-* headers on the response, and also sets a Retry-After header when the limit is exceeded. The API documentation is available as usual on godoc.org: There is also a blog post explaining the package's usage on 0value.com: Finally, many examples are provided in the /examples sub-folder of the repository. The BSD 3-clause license. Copyright (c) 2014 Martin Angers and Contributors.
Package stats is a statistics library created by Engineers at Lyft with support for Counters, Gauges, and Timers.
Package tcpinfo implements encoding and decoding of TCP-level socket options regarding connection information. The Transmission Control Protocol (TCP) is defined in RFC 793. TCP Selective Acknowledgment Options is defined in RFC 2018. Management Information Base for the Transmission Control Protocol (TCP) is defined in RFC 4022. TCP Congestion Control is defined in RFC 5681. Computing TCP's Retransmission Timer is described in RFC 6298. TCP Options and Maximum Segment Size (MSS) is defined in RFC 6691. Shared Use of Experimental TCP Options is defined in RFC 6994. TCP Extensions for High Performance is defined in RFC 7323. NOTE: Older Linux kernels may not support extended TCP statistics described in RFC 4898.
Binary dnsmasq_exporter is a Prometheus exporter for dnsmasq statistics.
Package rtcp implements encoding and decoding of RTCP packets according to RFC 3550. RTCP is a sister protocol of the Real-time Transport Protocol (RTP). Its basic functionality and packet structure is defined in RFC 3550. RTCP provides out-of-band statistics and control information for an RTP session. It partners with RTP in the delivery and packaging of multimedia data, but does not transport any media data itself. The primary function of RTCP is to provide feedback on the quality of service (QoS) in media distribution by periodically sending statistics information such as transmitted octet and packet counts, packet loss, packet delay variation, and round-trip delay time to participants in a streaming multimedia session. An application may use this information to control quality of service parameters, perhaps by limiting flow, or using a different codec. Decoding RTCP packets: Encoding RTCP packets:
Package taskstats provides access to Linux's taskstats interface, for sending per-task, per-process, and cgroup statistics from the kernel to userspace. For more information on taskstats, please see:
Package onlinestats provides online, one-pass algorithms for descriptive statistics. The implementation is based on the public domain code available at http://www.johndcook.com/skewness_kurtosis.html . The linear regression code is from http://www.johndcook.com/running_regression.html .
Package kstat provides a Go interface to the Solaris/OmniOS kstat(s) system for user-level access to a lot of kernel statistics. For more documentation on kstats, see kstat(1) and kstat(3kstat). The package can retrieve what are called 'named' kstat statistics, IO statistics, and the most common additional types of 'raw' statistics, which covers almost all kstats you will normally find in the kernel. You can see the names and types of other kstats, but not currently retrieve data for them. Named statistics are the most common type for general information; IO statistics are exported by disks and some other things. Supported additional raw kstats are unix:0:sysinfo, unix:0:vminfo, unix:0:var, and mnt:*:mntinfo. General usage for named statistics: call Open() to obtain a Token, then call GetNamed() on it to obtain Named(s) for specific statistics. Note that this always gives you the very latest value for the statistic. If you want a number of statistics from the same module:inst:name triplet (eg several network counters from the same network interface) and you want them to all have been gathered at the same time, you need to call .Lookup() to obtain a KStat and then repeatedly call its .GetNamed() (this is also slightly more efficient). The short version: a kstat is a collection of some related statistics, eg various network counters for a particular network interface. A Token is a handle for a collection of kstats. You go collection (Token) -> kstat (KStat) -> specific statistic (Named) in order to retrieve the value of a specific statistic. (IO stats are retrieved all at once with GetIO(), because they come to us from the kernel as one single struct so that's what you get.) This is a cgo-based package. Cross compilation is up to you. Goroutine safety is in no way guaranteed because the underlying C kstat library is probably not thread or goroutine safe (and there are some all-Go concurrency races involving .Close()). This package may leak memory, especially since the Solaris kstat manpage is not clear on the requirements here. However I believe it's reasonably memory safe. It's possible to totally corrupt memory with use-after-free errors if you do operations on kstats after calling Token.Close(), although we try to avoid that. NOTE: this package is quite young. The API may well change as I (and other people) gain more experience with it. In general this is not going to be as lean and mean as calling C directly, partly because of intrinsic CGo overheads and partly because we do more memory allocation and deallocation than a C program would (partly because we prioritize not leaking memory). We support named kstats and IO kstats (KSTAT_TYPE_NAMED and KSTAT_TYPE_IO / kstat_io_t respectively). kstat(1) also knows about a number of magic specific 'raw' stats (which are generally custom C structs); of these we support unix:0:sysinfo, unix:0:vminfo, unix:0:var, and mnt:*:mntinfo for NFS filesystem mounts. In theory kstat supports general timer and interrupt stats. In practice there is no use of KSTAT_TYPE_TIMER in the current Illumos kernel source and very little use of KSTAT_TYPE_INTR (mostly by very old hardware drivers, although the vioif driver uses it too). Since I can't test KSTAT_TYPE_INTR stats, we don't currently support it. There are also a few additional KSTAT_TYPE_RAW raw stats that we don't support, mostly because they seem to be effectively obsolete. These specific raw stats can be found listed in the Illumos source code in cmd/stat/kstat/kstat.h in the ks_raw_lookup array. See cmd/stat/kstat/kstat.c for how they're interpreted. If you need access to one of these kstats, the KStat.CopyTo() and KStat.Raw() methods give you an escape hatch to roll your own. You'll probably need to use cgo to generate an appropriate Go struct that matches the C struct you need. My notes on this process may be helpful: https://utcc.utoronto.ca/~cks/space/blog/programming/GoCGoCompatibleStructs Author: Chris Siebenmann https://github.com/siebenmann/go-kstat Copyright: standard Go copyright. (If you're reading this documentation on a non-Solaris platform, you're probably not seeing the detailed API documentation for constants, types, and so on because of tooling limitations in godoc et al.)
Package kstat provides a Go interface to the Solaris/OmniOS kstat(s) system for user-level access to a lot of kernel statistics. For more documentation on kstats, see kstat(1) and kstat(3kstat). The package can retrieve what are called 'named' kstat statistics, IO statistics, and the most common additional types of 'raw' statistics, which covers almost all kstats you will normally find in the kernel. You can see the names and types of other kstats, but not currently retrieve data for them. Named statistics are the most common type for general information; IO statistics are exported by disks and some other things. Supported additional raw kstats are unix:0:sysinfo, unix:0:vminfo, unix:0:var, and mnt:*:mntinfo. General usage for named statistics: call Open() to obtain a Token, then call GetNamed() on it to obtain Named(s) for specific statistics. Note that this always gives you the very latest value for the statistic. If you want a number of statistics from the same module:inst:name triplet (eg several network counters from the same network interface) and you want them to all have been gathered at the same time, you need to call .Lookup() to obtain a KStat and then repeatedly call its .GetNamed() (this is also slightly more efficient). The short version: a kstat is a collection of some related statistics, eg various network counters for a particular network interface. A Token is a handle for a collection of kstats. You go collection (Token) -> kstat (KStat) -> specific statistic (Named) in order to retrieve the value of a specific statistic. (IO stats are retrieved all at once with GetIO(), because they come to us from the kernel as one single struct so that's what you get.) This is a cgo-based package. Cross compilation is up to you. Goroutine safety is in no way guaranteed because the underlying C kstat library is probably not thread or goroutine safe (and there are some all-Go concurrency races involving .Close()). This package may leak memory, especially since the Solaris kstat manpage is not clear on the requirements here. However I believe it's reasonably memory safe. It's possible to totally corrupt memory with use-after-free errors if you do operations on kstats after calling Token.Close(), although we try to avoid that. NOTE: this package is quite young. The API may well change as I (and other people) gain more experience with it. In general this is not going to be as lean and mean as calling C directly, partly because of intrinsic CGo overheads and partly because we do more memory allocation and deallocation than a C program would (partly because we prioritize not leaking memory). We support named kstats and IO kstats (KSTAT_TYPE_NAMED and KSTAT_TYPE_IO / kstat_io_t respectively). kstat(1) also knows about a number of magic specific 'raw' stats (which are generally custom C structs); of these we support unix:0:sysinfo, unix:0:vminfo, unix:0:var, and mnt:*:mntinfo for NFS filesystem mounts. In theory kstat supports general timer and interrupt stats. In practice there is no use of KSTAT_TYPE_TIMER in the current Illumos kernel source and very little use of KSTAT_TYPE_INTR (mostly by very old hardware drivers, although the vioif driver uses it too). Since I can't test KSTAT_TYPE_INTR stats, we don't currently support it. There are also a few additional KSTAT_TYPE_RAW raw stats that we don't support, mostly because they seem to be effectively obsolete. These specific raw stats can be found listed in the Illumos source code in cmd/stat/kstat/kstat.h in the ks_raw_lookup array. See cmd/stat/kstat/kstat.c for how they're interpreted. If you need access to one of these kstats, the KStat.CopyTo() and KStat.Raw() methods give you an escape hatch to roll your own. You'll probably need to use cgo to generate an appropriate Go struct that matches the C struct you need. My notes on this process may be helpful: https://utcc.utoronto.ca/~cks/space/blog/programming/GoCGoCompatibleStructs Author: Chris Siebenmann https://github.com/siebenmann/go-kstat Copyright: standard Go copyright. (If you're reading this documentation on a non-Solaris platform, you're probably not seeing the detailed API documentation for constants, types, and so on because of tooling limitations in godoc et al.)
Copyright 2015 Swisscom (Schweiz) AG Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Package taskstats provides access to Linux's taskstats interface, for sending per-task, per-process, and cgroup statistics from the kernel to userspace. For more information on taskstats, please see:
Package toolbox is a middleware that provides health check, pprof, profile and statistic services for Macaron.
Package evidently provides the API client, operations, and parameter types for Amazon CloudWatch Evidently. You can use Amazon CloudWatch Evidently to safely validate new features by serving them to a specified percentage of your users while you roll out the feature. You can monitor the performance of the new feature to help you decide when to ramp up traffic to your users. This helps you reduce risk and identify unintended consequences before you fully launch the feature. You can also conduct A/B experiments to make feature design decisions based on evidence and data. An experiment can test as many as five variations at once. Evidently collects experiment data and analyzes it using statistical methods. It also provides clear recommendations about which variations perform better. You can test both user-facing features and backend features.
Package hercules contains the functions which are needed to gather various statistics from a Git repository. The analysis is expressed in a form of the tree: there are nodes - "pipeline items" - which require some other nodes to be executed prior to selves and in turn provide the data for dependent nodes. There are several service items which do not produce any useful statistics but rather provide the requirements for other items. The top-level items include: - BurndownAnalysis - line burndown statistics for project, files and developers. - CouplesAnalysis - coupling statistics for files and developers. - ShotnessAnalysis - structural hotness and couples, by any Babelfish UAST XPath (functions by default). The typical API usage is to initialize the Pipeline class: Then add the required analysis: This call will add all the needed intermediate pipeline items. Then link and execute the analysis tree: Finally extract the result: The actual usage example is cmd/hercules/root.go - the command line tool's code. You can provide additional options via `facts` on initialization. For example, to provide your own logger, enable people-tracking, and set a custom tick size: Hercules depends heavily on https://github.com/src-d/go-git and leverages the diff algorithm through https://github.com/sergi/go-diff. Besides, BurndownAnalysis involves File and RBTree. These are low level data structures which enable incremental blaming. File carries an instance of RBTree and the current line burndown state. RBTree implements the red-black balanced binary tree and is based on https://github.com/yasushi-saito/rbtree. Coupling stats are supposed to be further processed rather than observed directly. labours.py uses Swivel embeddings and visualises them in Tensorflow Projector. Shotness analysis as well as other UAST-featured items relies on [Babelfish](https://doc.bblf.sh) and requires the server to be running.
Package clistats implements a progress monitor functionality which exposes statistics in json format on a api endpoint bound to localhost
Package dockerstats provides the ability to get currently running Docker container statistics, including memory and CPU usage. To get the statistics of running Docker containers, you can use the `Current()` function: Alternatively, you can use the `NewMonitor()` function to receive a constant stream of Docker container stats, available on the Monitor's `Stream` channel:
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.