Package twitter implements a client for the Twitter API v2. This package is in development and is not yet ready for production use. The general structure of an API call is to first construct a query, then invoke that query with a context on a client: Package "types" contains the type and constant definitions for the API. Queries to look up tweets by ID or username, to search recent tweets, and to search or sample streams of tweets are defined in package "tweets". Queries to look up users by ID or user name are defined in package "users". Queries to read or update search rules are defined in package "rules". Queries to create, edit, delete, and show the contents of lists are defined in package "lists".
Package conf is a configuration parser that loads configuration in a struct from files and automatically creates cli flags. Just define a struct with needed configuration. Values are then taken from multiple source in this order of precedence: Default is to use lower cased field names in struct to create command line arguments. Tags can be used to specify different names, command line help message and section in conf file. Format is: conf.Path variable can be set to the path of a configuration file. For default it is initialized to the value of environment variable: Files ending with: will be parsed as INI informal standard: First letter of every key found is upper cased and than, a struct field with same name is searched: If such field name is not found than comparison is made against key specified as first element in tag. conf tries to be modular and easily extensible to support different formats. This is a work in progress, APIs can change.
Package freegeoip provides an API for searching the geolocation of IP addresses. It uses a database that can be either a local file or a remote resource from a URL. Local databases are monitored by fsnotify and reloaded when the file is either updated or overwritten. Remote databases are automatically downloaded and updated in background so you can focus on using the API and not managing the database.
Package freegeoip provides an API for searching the geolocation of IP addresses. It uses a database that can be either a local file or a remote resource from a URL. Local databases are monitored by fsnotify and reloaded when the file is either updated or overwritten. Remote databases are automatically downloaded and updated in background so you can focus on using the API and not managing the database.
Package bst and its sub-packages specify an extensible Binary Search Tree (BST) API. Instead of trying to think of every method a user might want on a BST, we declare a set of extensible methods with the hopes of enabling a user to create any functionality they need. Packages and sub-packages: -------------------------- * bst Declares basic BST methods (Insert, Get, Remove, etc). * bst/finder Declares extensible method, Find, which allows you to direct navigation down a tree, looking for a specific value. * bst/visitor Declares extensible method, Visit, which allows you to defer taking an action on a tree node (insert, get, update, delete) until after you've visited that node (i.e. inspect then change with one visit to the node). * bst/walker Declares extensible method, Walk, which allows you to freely traverse a tree, either up and down (left, right, parent) or in sequential order (prev, next). * bst/simple Provides a reference implementation of an Extensible BST that implements all of the above-declared methods. License ------- This package is released under the MIT License. See included file 'LICENSE' for more details. Contributors ------------ David Farell <DavidPFarrell@yahoo.com>
Package typesense is a Go client for Typesense, an open-source, typo tolerant search engine. This package defines a client and some useful methods to communicate with the Typesense API without the drawback of writing actual API requests. Example of an actual program that can connect to the Typesense API creates a new collection, index some document and retrieve it using search. Highly inspired by the official Typesense guide https://typesense.org/docs/0.11.2/guide/ .
Package lingua accurately detects the natural language of written text, be it long or short. Its task is simple: It tells you which language some text is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages. Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy. So far, the only other comprehensive open source library in the Go ecosystem for this task is Whatlanggo (https://github.com/abadojack/whatlanggo). Unfortunately, it has two major drawbacks: 1. Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it does not provide adequate results. 2. The more languages take part in the decision process, the less accurate are the detection results. Lingua aims at eliminating these problems. It nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline. Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, 75 languages are supported. They are listed as variants of type Language. Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts: 1. a list of single words with a minimum length of 5 characters 2. a list of word pairs with a minimum length of 10 characters 3. a list of complete grammatical sentences of various lengths Both the language models and the test data have been created from separate documents of the Wortschatz corpora (https://wortschatz.uni-leipzig.de) offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively. Given the generated test data, I have compared the detection results of Lingua, and Whatlanggo running over the data of Lingua's supported 75 languages. Additionally, I have added Google's CLD3 (https://github.com/google/cld3/) to the comparison with the help of the gocld3 bindings (https://github.com/jmhodges/gocld3). Languages that are not supported by CLD3 or Whatlanggo are simply ignored during the detection process. Lingua clearly outperforms its contenders. Every language detector uses a probabilistic n-gram (https://en.wikipedia.org/wiki/N-gram) model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language. A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance. In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable. There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance. The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages. Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? In the example below, a slice of ConfidenceValue is returned containing those languages which the calling instance of LanguageDetector has been built from. The entries are sorted by their confidence value in descending order. Each value is a probability between 0.0 and 1.0. The probabilities of all languages will sum to 1.0. If the language is unambiguously identified by the rule engine, the value 1.0 will always be returned for this language. The other languages will receive a value of 0.0. By default, Lingua uses lazy-loading to load only those language models on demand which are considered relevant by the rule-based filter engine. For web services, for instance, it is rather beneficial to preload all language models into memory to avoid unexpected latency while waiting for the service response. If you want to enable the eager-loading mode, you can do it as seen below. Multiple instances of LanguageDetector share the same language models in memory which are accessed asynchronously by the instances. By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word `prologue`, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated as seen below. Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise Unknown will be returned most of the time as in the example below. This is the return value for cases where language detection is not reliably possible.
Package esquery provides a non-obtrusive, idiomatic and easy-to-use query and aggregation builder for the official Go client (https://github.com/elastic/go-elasticsearch) for the ElasticSearch database (https://www.elastic.co/products/elasticsearch). esquery alleviates the need to use extremely nested maps (map[string]interface{}) and serializing queries to JSON manually. It also helps eliminating common mistakes such as misspelling query types, as everything is statically typed. Using `esquery` can make your code much easier to write, read and maintain, and significantly reduce the amount of code you write. esquery provides a method chaining-style API for building and executing queries and aggregations. It does not wrap the official Go client nor does it require you to change your existing code in order to integrate the library. Queries can be directly built with `esquery`, and executed by passing an `*elasticsearch.Client` instance (with optional search parameters). Results are returned as-is from the official client (e.g. `*esapi.Response` objects). Getting started is extremely simple: esquery currently supports version 7 of the ElasticSearch Go client. The library cannot currently generate "short queries". For example, whereas ElasticSearch can accept this: { "query": { "term": { "user": "Kimchy" } } } The library will always generate this: This is also true for queries such as "bool", where fields like "must" can either receive one query object, or an array of query objects. `esquery` will generate an array even if there's only one query object.
Package freegeoip provides an API for searching the geolocation of IP addresses. It uses a database that can be either a local file or a remote resource from a URL. Local databases are monitored by fsnotify and reloaded when the file is either updated or overwritten. Remote databases are automatically downloaded and updated in background so you can focus on using the API and not managing the database.
The musicbrainzws2 package allows accessing the MusicBrainz web service. The package supports all MusicBrainz entities. The MusicBrainz web service provides three general types of requests: lookup, browse and search. All of them are supported. All API functions are available by creating a Client using NewClient. Please note that this package currently is still a beta and the API may still change. The Client supports lookup, browse and search queries for all entities. Submission of ISRCs, barcodes, folksonomy tags/genres and ratings is also supported, as is managing the contents of user collections. The deprecated CD stubs are not supported and there are no plans to support them. For a full documentation of the web service see https://musicbrainz.org/doc/MusicBrainz_API and the documentation of the search service at https://musicbrainz.org/doc/MusicBrainz_API/Search.
Package main (doc.go) : This is a CLI tool to execute Google Apps Script (GAS) on a terminal. Will you want to develop GAS on your local PC? Generally, when we develop GAS, we have to login to Google using own browser and develop it on the Script Editor. Recently, I have wanted to have more convenient local-environment for developing GAS. So I created this "ggsrun". The main work is to execute GAS on local terminal and retrieve the results from Google. 1. Develops GAS using your terminal and text editor which got accustomed to using. 2. Executes GAS by giving values to your script. 3. Executes GAS made of CoffeeScript. 4. Downloads spreadsheet, document and presentation, while executes GAS, simultaneously. 5. Downloads files from Google Drive and Uploads files to Google Drive. 6. Downloads standalone script and bound script. 7. Downloads all files and folders in a specific folder. 8. Upload script files and create project as standalone script and container-bound script. 9. Update project. 10. Retrieve revision files of Google Docs and retrieve versions of projects. 11. Rearranges scripts in project. 12. Modifies Manifests in project. 13. Seach files in Google Drive using search query and regex. 14. Manage Permissions of files. 15. Get Drive Information. 16. ggsrun got to be able to be used by not only OAuth2, but also Service Account from v1.7.0. You can see the release page https://github.com/tanaikech/ggsrun/releases ggsrun uses Execution API, Web Apps and Drive API on Google. About how to install ggsrun, please check my github repository. https://github.com/tanaikech/ggsrun/ You can read the detail information there. --------------------------------------------------------------- # How to Execute Google Apps Script Using ggsrun When you have the configure file `ggsrun.cfg`, you can execute GAS. If you cannot find it, please download `client_secret.json` and run $ ggsrun auth In the case of using Execution API, $ ggsrun e1 -s sample.gs If you want to execute a function except for `main()` of default, you can use an option like `-f foo`. This command `exe1` can be used to execute a function on project. $ ggsrun e1 -f foo $ ggsrun e2 -s sample.gs At `e2`, you cannot select the executing function except for `main()` of default. `e1`, `e2` and `-s` mean using Execution API and GAS script file name, respectively. Sample codes which are shown here will be used Execution API. At this time, the executing function is `main()`, which is a default, in the script. In the case of using Web Apps, $ ggsrun w -s sample.gs -p password -u [ WebApps URL ] `w` and `-p` mean using Web Apps and password you set at the server side, respectively. Using `-u` it imports Web Apps URL like `-u https://script.google.com/macros/s/#####/exec`. --------------------------------------------------------------- Package main (ggsrun.go) : This file is included all commands and options. Package main (handler.go) : Handler for ggsrun Package main (init.go) : These methods are for reading and writing configuration file (ggsrun.cfg). Package main (materials.go) : Materials for ggsrun. Package main (oauth.go) : Get accesstoken using refreshtoken, and confirm condition of accesstoken. Package main (projectupdater.go) : These methods are for updating project. Package main (scriptrearrange.go) : These methods are for rearranging scripts in a project. Package main (sender.go) : These methods are for sending GAS scripts to Google Drive.
Package openfec implements a client library for the OpenFEC API. See https://api.open.fec.gov This package all allows you to explore the way candidates and committees fund their campaigns by accessing FEC (Federal Election Committee) data. A few restrictions limit the way you can use FEC data. For example, you can’t use contributor lists for commercial purposes or to solicit donations. Get an API key here: https://api.data.gov/signup/ Endpoints are classified in the following groups: Candidate, Committee, Financial, Filings and Schedules. Candidate endpoints give you access to information about the people running for office. This information is organized by candidate_id. If you're unfamiliar with candidate IDs, using `/candidates/search` will help you locate a particular candidate. Officially, a candidate is an individual seeking nomination for election to a federal office. People become candidates when they (or agents working on their behalf) raise contributions or make expenditures that exceed $5,000. The candidate endpoints primarily use data from FEC registration [Form 1](http://www.fec.gov/pdf/forms/fecfrm1.pdf), for candidate information, and [Form 2](http://www.fec.gov/pdf/forms/fecfrm2.pdf), for committee information. Committees are entities that spend and raise money in an election. Their characteristics and relationships with candidates can change over time. You might want to use filters or search endpoints to find the committee you're looking for. Then you can use other committee endpoints to explore information about the committee that interests you. Financial information is organized by `committee_id`, so finding the committee you're interested in will lead you to more granular financial information. The committee endpoints include all FEC filers, even if they aren't registered as a committee. Officially, committees include the committees and organizations that file with the FEC. Several different types of organizations file financial reports with the FEC: * Campaign committees authorized by particular candidates to raise and spend funds in their campaigns * Non-party committees (e.g., PACs), some of which may be sponsored by corporations, unions, trade or membership groups, etc. * Political party committees at the national, state, and local levels * Groups and individuals making only independent expenditures * Corporations, unions, and other organizations making internal communications The committee endpoints primarily use data from FEC registration Form 1 and Form 2. Fetch key information about a committee's Form 3, Form 3X, or Form 3P financial reports. Most committees are required to summarize their financial activity in each filing; those summaries are included in these files. Generally, committees file reports on a quarterly or monthly basis, but some must also submit a report 12 days before primary elections. Therefore, during the primary season, the period covered by this file may be different for different committees. These totals also incorporate any changes made by committees, if any report covering the period is amended. Information is made available on the API as soon as it's processed. Keep in mind, complex paper filings take longer to process. The financial endpoints use data from FEC [form 5](http://www.fec.gov/pdf/forms/fecfrm5.pdf), for independent expenditors; or the summary and detailed summary pages of the FEC [form 3](http://www.fec.gov/pdf/forms/fecfrm3.pdf), for House and Senate committees; [form 3X](http://www.fec.gov/pdf/forms/fecfrm3x.pdf), for PACs and parties; and [form 3P](http://www.fec.gov/pdf/forms/fecfrm3p.pdf), for presidential committees. All official records and reports filed by or delivered to the FEC. Note: because the filings data includes many records, counts for large result sets are approximate. Schedules come from particular sections on forms and contain detailed transactional data. Schedule A explains where contributions come from. If you are interested in individual donors, this will be the endpoint you use. For the Schedule A aggregates, "memoed" items are not included to avoid double counting. Schedule B explains how money is spent.
Package aw is a "plug-and-play" workflow development library/framework for Alfred 3 & 4 (https://www.alfredapp.com/). It requires Go 1.13 or later. It provides everything you need to create a polished and blazing-fast Alfred frontend for your project. As of AwGo 0.26, all applicable features of Alfred 4.1 are supported. The main features are: AwGo is an opinionated framework that expects to be used in a certain way in order to eliminate boilerplate. It *will* panic if not run in a valid, minimally Alfred-like environment. At a minimum the following environment variables should be set to meaningful values: NOTE: AwGo is currently in development. The API *will* change and should not be considered stable until v1.0. Until then, be sure to pin a version using go modules or similar. Be sure to also check out the _examples/ subdirectory, which contains some simple, but complete, workflows that demonstrate the features of AwGo and useful workflow idioms. Typically, you'd call your program's main entry point via Workflow.Run(). This way, the library will rescue any panic, log the stack trace and show an error message to the user in Alfred. In the Script box (Language = "/bin/bash"): To generate results for Alfred to show in a Script Filter, use the feedback API of Workflow: You can set workflow variables (via feedback) with Workflow.Var, Item.Var and Modifier.Var. See Workflow.SendFeedback for more documentation. Alfred requires a different JSON format if you wish to set workflow variables. Use the ArgVars (named for its equivalent element in Alfred) struct to generate output from Run Script actions. Be sure to set TextErrors to true to prevent Workflow from generating Alfred JSON if it catches a panic: See ArgVars for more information. New() creates a *Workflow using the default values and workflow settings read from environment variables set by Alfred. You can change defaults by passing one or more Options to New(). If you do not want to use Alfred's environment variables, or they aren't set (i.e. you're not running the code in Alfred), use NewFromEnv() with a custom Env implementation. A Workflow can be re-configured later using its Configure() method. See the documentation for Option for more information on configuring a Workflow. AwGo can check for and install new versions of your workflow. Subpackage update provides an implementation of the Updater interface and sources to load updates from GitHub or Gitea releases, or from the URL of an Alfred `metadata.json` file. See subpackage update and _examples/update. AwGo can filter Script Filter feedback using a Sublime Text-like fuzzy matching algorithm. Workflow.Filter() sorts feedback Items against the provided query, removing those that do not match. See _examples/fuzzy for a basic demonstration, and _examples/bookmarks for a demonstration of implementing fuzzy.Sortable on your own structs and customising the fuzzy sort settings. Fuzzy matching is done by package https://godoc.org/go.deanishe.net/fuzzy AwGo automatically configures the default log package to write to STDERR (Alfred's debugger) and a log file in the workflow's cache directory. The log file is necessary because background processes aren't connected to Alfred, so their output is only visible in the log. It is rotated when it exceeds 1 MiB in size. One previous log is kept. AwGo detects when Alfred's debugger is open (Workflow.Debug() returns true) and in this case prepends filename:linenumber: to log messages. The Config struct (which is included in Workflow as Workflow.Config) provides an interface to the workflow's settings from the Workflow Environment Variables panel (see https://www.alfredapp.com/help/workflows/advanced/variables/#environment). Alfred exports these settings as environment variables, and you can read them ad-hoc with the Config.Get*() methods, and save values back to Alfred/info.plist with Config.Set(). Using Config.To() and Config.From(), you can "bind" your own structs to the settings in Alfred: See the documentation for Config.To and Config.From for more information, and _examples/settings for a demo workflow based on the API. The Alfred struct provides methods for the rest of Alfred's AppleScript API. Amongst other things, you can use it to tell Alfred to open, to search for a query, to browse/action files & directories, or to run External Triggers. See documentation of the Alfred struct for more information. AwGo provides a basic, but useful, API for loading and saving data. In addition to reading/writing bytes and marshalling/unmarshalling to/from JSON, the API can auto-refresh expired cache data. See Cache and Session for the API documentation. Workflow has three caches tied to different directories: These all share (almost) the same API. The difference is in when the data go away. Data saved with Session are deleted after the user closes Alfred or starts using a different workflow. The Cache directory is in a system cache directory, so may be deleted by the system or "system maintenance" tools. The Data directory lives with Alfred's application data and would not normally be deleted. Subpackage util provides several functions for running script files and snippets of AppleScript/JavaScript code. See util for documentation and examples. AwGo offers a simple API to start/stop background processes via Workflow's RunInBackground(), IsRunning() and Kill() methods. This is useful for running checks for updates and other jobs that hit the network or take a significant amount of time to complete, allowing you to keep your Script Filters extremely responsive. See _examples/update and _examples/workflows for demonstrations of this API.
Package gbsearch provides a way to search the Google books API.
Package freegeoip provides an API for searching the geolocation of IP addresses. It uses a database that can be either a local file or a remote resource from a URL. Local databases are monitored by fsnotify and reloaded when the file is either updated or overwritten. Remote databases are automatically downloaded and updated in background so you can focus on using the API and not managing the database. Also, the freegeoip package provides http handlers that any Go http server (net/http) can use. These handlers can process IP geolocation lookup requests and return data in multiple formats like CSV, XML, JSON and JSONP. It has also an API for supporting custom formats.
Package itunes_search_go provides client for access to iTunes API. See https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api
Package stringset provides the capabilities of a standard Set with strings, including the standard set operations and a specialized ability to have a "negative" set for intersections. Negative sets are useful for inverting the sense of a set when combining them using Intersection. Negative sets do not change the behavior of any other set operations, including Union and Difference. Negative operations were created for managing a set of tags, where it is useful to be able to search for items that contain some tags but do not contain others. The API is designed to be chainable so that common operations become one-liners, and it is designed to be easy to use with slices of strings. This package does not return errors. Set operations should be fast and chainable; adding items more than once, or attempting to remove things that don't exist are not errors.
Package conf is a configuration parser that loads configuration in a struct from files and automatically creates cli flags. Just define a struct with needed configuration. Values are then taken from multiple source in this order of precedence: Default is to use lower cased field names in struct to create command line arguments. Tags can be used to specify different names, command line help message and section in conf file. Format is: conf.Path variable can be set to the path of a configuration file. For default it is initialized to the value of environment variable: Files ending with: will be parsed as INI informal standard: First letter of every key found is upper cased and than, a struct field with same name is searched: If such field name is not found than comparison is made against key specified as first element in tag. conf tries to be modular and easily extensible to support different formats. This is a work in progress, APIs can change.
Package postkr provides access to APIs of epost.kr, Korean post office. The epost.kr provide APIs for track snail-mail and search zip-code. But, I can't access to the tracking API with my key. So, It's not implemented currently. You need own key which is issued by eport.kr. Get one from this link:
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/IlyaLab/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/IlyaLab/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/IlyaLab/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/IlyaLab/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.