Package poller is a file-descriptor multiplexer. It allows concurent Read and Write operations from and to multiple file-descriptors without allocating one OS thread for every blocked operation. It operates similarly to Go's netpoller (which multiplexes network connections) without requiring special support from the Go runtime. It can be used with tty devices, character devices, pipes, FIFOs, and any file-descriptor that is poll-able (can be used with select(2), epoll(7), etc.) In addition, package poller allows the user to set timeouts (deadlines) for read and write operations, and also allows for safe cancelation of blocked read and write operations; a Close from another go-routine safely cancels ongoing (blocked) read and write operations. Typical usage All operations on poller FDs are thread-safe; you can use the same FD from multiple go-routines. It is, for example, safe to close a file descriptor blocked on a Read or Write call from another go-routine. Linux systems are supported using an implementations based on epoll(7). For other POSIX systems (that is, most Unix-like systems), a more portable fallback implementation based on select(2) is provided. It has the same semantics as the Linux epoll(7)-based implementation, but is expected to be of lower performance. The select(2)-based implementation uses CGo for some ancillary select-related operations. Ideally, system-specific io-multiplexing or async-io facilities (e.g. kqueue(2), or /dev/poll(7)) should be used to provide higher performance implementations for other systems. Patches for this will be greatly appreciated. If you wish, you can build package poller on Linux to use the select(2)-based implementation instead of the epoll(7) one. To do this define the build-tag "noepoll". Normally, there is no reason to do this.
Package queue provides multiple thread-safe generic queue implementations. Currently, there are 2 available implementations: A blocking queue, which provides methods that wait for the queue to have available elements when attempting to retrieve an element, and waits for a free slot when attempting to insert an element. A priority queue based on a container.Heap. The elements in the queue must implement the Lesser interface, and are ordered based on the Less method. The head of the queue is always the highest priority element. A circular queue, which is a queue that uses a fixed-size slice as if it were connected end-to-end. When the queue is full, adding a new element to the queue overwrites the oldest element. A linked queue, implemented as a singly linked list, offering O(1) time complexity for enqueue and dequeue operations. The queue maintains pointers to both the head (front) and tail (end) of the list for efficient operations without the need for traversal.
Package ring provides a high performance and thread safe bloom filter. Copyright (c) 2019 Tanner Ryan. All rights reserved. Use of this source code is governed by a BSD-style license that can be found in the LICENSE file.
Package memguard lets you easily handle sensitive values in memory. The number of LockedBuffers that you are able to create is limited by how much memory your system kernel allows each process to mlock/VirtualLock. Therefore you should call Destroy on LockedBuffers that you no longer need, or simply defer a Destroy call after creating a new LockedBuffer. If a function that you're using requires an array, you can cast the buffer to an array and then pass around a pointer. Make sure that you do not dereference the pointer and pass around the resulting value, as this will leave copies all over the place. The MemGuard API is thread-safe. You can extend this thread-safety to outside of the API functions by using the Mutex that each LockedBuffer exposes. Don't use the mutex when calling a function that is part of the MemGuard API though, or the process will deadlock. When terminating your application, care should be taken to securely cleanup everything.
Package pargo provides functions and data structures for expressing parallel algorithms. While Go is primarily designed for concurrent programming, it is also usable to some extent for parallel programming, and this library provides convenience functionality to turn otherwise sequential algorithms into parallel algorithms, with the goal to improve performance. For documentation that provides a more structured overview than is possible with Godoc, see the wiki at https://github.com/exascience/pargo/wiki Pargo provides the following subpackages: pargo/parallel provides simple functions for executing series of thunks or predicates, as well as thunks, predicates, or reducers over ranges in parallel. See also https://github.com/ExaScience/pargo/wiki/TaskParallelism pargo/speculative provides speculative implementations of most of the functions from pargo/parallel. These implementations not only execute in parallel, but also attempt to terminate early as soon as the final result is known. See also https://github.com/ExaScience/pargo/wiki/TaskParallelism pargo/sequential provides sequential implementations of all functions from pargo/parallel, for testing and debugging purposes. pargo/sort provides parallel sorting algorithms. pargo/sync provides an efficient parallel map implementation. pargo/pipeline provides functions and data structures to construct and execute parallel pipelines. Pargo has been influenced to various extents by ideas from Cilk, Threading Building Blocks, and Java's java.util.concurrent and java.util.stream packages. See http://supertech.csail.mit.edu/papers/steal.pdf for some theoretical background, and the sample chapter at https://mitpress.mit.edu/books/introduction-algorithms for a more practical overview of the underlying concepts.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.