Machine Learning

github.com/cyrusf/libsvm-go

** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Caches Q matrix rows. The cache implements a LRU (Last Recently Used) eviction policy. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Implements the linear, radial-basis function, sigmoid, and polynomial kernels ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Model describes the properties of the Support Vector Machine after training. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Input/output routines for the Support Vector Machine model ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful types/methods for running loops in parallel. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes the parameters of the Supper Vector Machine solver ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Prediciton related APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Probability estimation APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes problem, i.e. label/vector set ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Q matrix for Support Vector Classification (svcQ), Support Vector Regression (svrQ), ** and One-Class Support Vector Machines (oneClassQ) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Sequential Minimal Optimization (SMO) solver ** Ref: C.-C. Chang, C.-J. Lin. "LIBSVM: A library for support vector machines". ACM Transactions on Intelligent Systems and Technology 2 (2011) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Functions for calling the solver for different problem scenerios, i.e. SVC, SVR, or One-Class ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful functions used in various parts of the library ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Working-set selection ** Ref: R.-E. Fan, P.-H. Chen, and C.-J. Lin. "Working set selection using second order information for training SVM". Journal of Machine Learning Research 6 (2005) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Cross validation API ** @author: Ed Walker

v0.0.0-20180928035651-2210b124fd3d • 6 years ago

github.com/IlyaLab/CloudForest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/IlyaLab/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/IlyaLab/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

v0.0.0-20150921161629-7f0049714db2 • 9 years ago

github.com/ilyalab/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/IlyaLab/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/IlyaLab/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

v0.0.0-20150921161629-7f0049714db2 • 9 years ago

github.com/ranjib/goiardi

Goiardi is an implementation of the Chef server (http://www.opscode.com) written in Go. It can either run entirely in memory with the option to save and load the in-memory data and search indexes to and from disk, drawing inspiration from chef-zero, or it can use MySQL or PostgreSQL as its storage backend. Like all software, it is a work in progress. Goiardi now, though, should have all the functionality of the open source Chef Server, plus some extras like reporting and event logging. It does not support other Enterprise Chef type features like organizations or pushy at this time. When used, knife works, and chef-client runs complete successfully. Almost all chef-pendant tests successfully successfully run, with a few disagreements about error messages that don't impact the clients. It does pretty well against the official chef-pedant, but because goiardi handles some authentication matters a little differently than the official chef-server, there is also a fork of chef-pedant located at https://github.com/ctdk/chef-pedant that's more custom tailored to goiardi. Many go tests are present as well in different goiardi subdirectories. Goiardi currently has eight dependencies: go-flags, go-cache, go-trie, toml, the mysql driver from go-sql-driver, the postgres driver, logger, and go-uuid. To install them, run: from your $GOROOT, or just use the -t flag when you go get goiardi. If you would like to modify the search grammar, you'll need the 'peg' package. To install that, run In the 'search/' directory, run 'peg -switch -inline search-parse.peg' to generate the new grammar. If you don't plan on editing the search grammar, though, you won't need that. 1. Install go. (http://golang.org/doc/install.html) You may need to upgrade to go 1.2 to compile all the dependencies. Go 1.3 is also confirmed to work. 2. Make sure your $GOROOT and PATH are set up correctly per the Go installation instructions. 3. Download goairdi and its dependencies. 4. Run tests, if desired. Several goiardi subdirectories have go tests, and chef-pedant can and should be used for testing goiardi as well. 5. Install the goiardi binaries. 6. Run goiardi. Or, you can look at the goiardi releases page on github at https://github.com/ctdk/goiardi/releases and see if there are precompiled binaries available for your platform. You can get a list of command-line options with the '-h' flag. Goiardi can also take a config file, run like goiardi -c /path/to/conf-file. See etc/goiardi.conf-sample for an example documented configuration file. Options in the configuration file share the same name as the long command line arguments (so, for example, --ipaddress=127.0.0.1 on the command line would be ipaddress = "127.0.0.1" in the config file. Currently available command line and config file options: For more documentation on Chef, see http://docs.opscode.com. If goiardi is not running in use-auth mode, it does not actually care about .pem files at all. You still need to have one to keep knife and chef-client happy. It's like chef-zero in that regard. If goiardi is running in use-auth mode, then proper keys are needed. When goiardi is started, if the chef-webui and chef-validator clients, and the admin user, are not present, it will create new keys in the --conf-root directory. Use them as you would normally for validating clients, performing tasks with the admin user, or using chef-webui if webui will run in front of goiardi. In auth mode, goiardi supports versions 1.0, 1.1, and 1.2 of the Chef authentication protocol. *Note:* The admin user, when created on startup, does not have a password. This prevents logging in to the webui with the admin user, so a password will have to be set for admin before doing so. By default, goiardi logs to standard output. A log file may be specified with the `-L/--log-file` flag, or goiardi can log to syslog with the `-s/--syslog` flag on platforms that support syslog. Attempting to use syslog on one of these platforms (currently Windows and plan9 (although plan9 doesn't build for other reasons)) will result in an error. Log levels can be set in goiardi with either the `log-level` option in the configuration file, or with one to four -V flags on the command line. Log level options are "debug", "info", "warning", "error", and "critical". More -V on the command line means more spewing into the log. Goiardi can now use MySQL to store its data, instead of keeping all its data in memory (and optionally freezing its data to disk for persistence). If you want to use MySQL, you (unsurprisingly) need a MySQL installation that goiardi can access. This document assumes that you are able to install, configure, and run MySQL. Once the MySQL server is set up to your satisfaction, you'll need to install sqitch to deploy the schema, and any changes to the database schema that may come along later. It can be installed out of CPAN or homebrew; see "Installation" on http://sqitch.org for details. The sqitch MySQL tutorial at https://metacpan.org/pod/sqitchtutorial-mysql explains how to deploy, verify, and revert changes to the database with sqitch, but the basic steps to deploy the schema are: * Create goiardi's database: `mysql -u root --execute 'CREATE DATABASE goiardi'` * Optionally, create a separate mysql user for goiardi and give it permissions on that database. * In sql-files/mysql-bundle, deploy the bundle: `sqitch deploy db:mysql://root[:<password>]@/goiardi` To update an existing database deployed by sqitch, run the `sqitch deploy` command above again. If you really really don't want to install sqitch, apply each SQL patch in sql-files/mysql-bundle by hand in the same order they're listed in the sqitch.plan file. The above values are for illustration, of course; nothing requires goiardi's database to be named "goiardi". Just make sure the right database is specified in the config file. Set `use-mysql = true` in the configuration file, or specify `--use-mysql` on the command line. It is an error to specify both the `-D`/`--data-file` flag and `--use-mysql` at the same time. At this time, the mysql connection options have to be defined in the config file. An example configuration is available in `etc/goiardi.conf-sample`, and is given below: Goiardi can also use Postgres as a backend for storing its data, instead of using MySQL or the in-memory data store. The overall procedure is pretty similar to setting up goiardi to use MySQL. Specifically for Postgres, you may want to create a database especially for goiardi, but it's not mandatory. If you do, you may also want to create a user for it. If you decide to do that: * Create the user: `$ createuser goiardi <additional options>` * Create the database, if you decided to: `$ createdb goiardi_db <additional options>`. If you created a user, make it the owner of the goiardi db with `-O goiardi`. After you've done that, or decided to use an existing database and user, deploy the sqitch bundle in sql-files/postgres-bundle. If you're using the default Postgres user on the local machine, `sqitch deploy db:pg:<dbname>` will be sufficient. Otherwise, the deploy command will be something like `sqitch deploy db:pg://user:password@localhost/goairdi_db`. The Postgres sqitch tutorial at https://metacpan.org/pod/sqitchtutorial explains more about how to use sqitch and Postgres. Set `use-postgresql` in the configuration file, or specify `--use-postgresql` on the command line. It's also an error to specify both `-D`/`--data-file` flag and `--use-postgresql` at the same time like it is in MySQL mode. MySQL and Postgres cannot be used at the same time, either. Like MySQL, the Postgres connection options must be specified in the config file at this time. There is also an example Postgres configuration in the config file, and can be seen below: Goiardi has optional event logging. When enabled with the `--log-events` command line option, or with the `"log-events"` option in the config file, changes to clients, users, cookbooks, data bags, environments, nodes, and roles will be tracked. The event log can be viewed through the /events API endpoint. If the `-K`/`--log-event-keep` option is set, then once a minute the event log will be automatically purged, leaving that many events in the log. This is particularly recommended when using the event log in in-memory mode. The event API endpoints work as follows: A user or client must be an administrator account to use the `/events` endpoint. The data returned by an event should look something like this: The easiest way to use the event log is with the knife-goiardi-event-log knife plugin. It's available on rubygems, or at github at https://github.com/ctdk/knife-goiardi-event-log. Goiardi now supports, on an experimental basis, Chef's reporting facilities. Nothing needs to be enabled in goiardi to use this, but changes are required with the client. See http://docs.opscode.com/reporting.html for details on how to enable reporting and how to use it. There is a goiardi extension to reporting: a "status" query parameter may be passed in a GET request that lists reports to limit the reports returned to ones that match the status, so you can read only reports of chef runs that were successful, failed, or started but haven't completed yet. Valid values for the "status" parameter are "started", "success", and "failure". To use reporting, you'll either need the Chef knife-reporting plugin, or use the knife-goiardi-reporting plugin that supports querying runs by status. It's available on rubygems, or on github at https://github.com/ctdk/knife-goiardi-reporting. As this is an experimental feature, it may not work entirely correctly. Bug reports are appreciated. Goiardi can now import and export its data in a JSON file. This can help both when upgrading, when the on-disk data format changes between releases, and to convert your goiardi installation from in-memory to MySQL (or vice versa). The JSON file has a version number set (currently 1.0), so that in the future if there is some sort of incompatible change to the JSON file format the importer will be able to handle it. Before importing data, you should back up any existing data and index files (and take a snapshot of the SQL db, if applicable) if there's any reason you might want it around later. After exporting, you may wish to hold on to the old installation data until you're satisfied that the import went well. Remember that the JSON export file contains the client and user public keys (which for the purposes of goiardi and chef are private) and the user hashed passwords and password salts. The export file should be guarded closely. The `-x/--export` and `-m/--import` flags control importing and exporting data. To export data, stop goiardi, then run it again with the same options as before but adding `-x <filename>` to the command. This will export all the data to the given filename, and goiardi will exit. Importing is ever so slightly trickier. You should remove any existing data store and index files, and if using an SQL database use sqitch to revert and deploy all of the SQL files to set up a completely clean schema for goiardi. Then run goiardi with the new options like you normally would, but add `-m <filename>`. Goiardi will run, import the new data, and exit. Assuming it went well, the data will be all imported. The export dump does not contain the user and client .pem files, so those will need to be saved and moved as needed. Theoretically a properly crafted export file could be used to do bulk loading of data into goiardi, thus goiardi does not wipe out the existing data on its own but rather leaves that task to the administrator. This functionality is merely theoretical and completely untested. If you try it, you should back your data up first. Goiardi has been built and run with the native 6g compiler on Mac OS X (10.7, 10.8, and 10.9), Debian squeeze and wheezy, a fairly recent Arch Linux, FreeBSD 9.2, and Solaris. Using Go's cross compiling capabilities, goiardi builds for all of Go's supported platforms except Dragonfly BSD and plan9 (because of issues with the postgres client library). Windows support has not been tested extensively, but a cross compiled binary has been tested successfully on Windows. Goiardi has also been built and run with gccgo (using the "-compiler gccgo" option with the "go" command) on Arch Linux. Building it with gccgo without the go command probably works, but it hasn't happened yet. This is a priority, though, so goiardi can be built on platforms the native compiler doesn't support yet. As mentioned above, goiardi can now freeze its in-memory data store and index to disk if specified. It will save before quitting if the program receives a SIGTERM or SIGINT signal, along with saving every "freeze-interval" seconds automatically. Saving automatically helps guard against the case where the server receives a signal that it can't handle and forces it to quit. In addition, goiardi will not replace the old save files until the new one is all finished writing. However, it's still not anywhere near a real database with transaction protection, etc., so while it should work fine in the general case, possibilities for data loss and corruption do exist. The appropriate caution is warranted. In addition to the aforementioned Chef documentation at http://docs.opscode.com, more documentation specific to goiardi can be viewed with godoc. See http://godoc.org/code.google.com/p/go.tools/cmd/godoc for an explanation of how godoc works. See the TODO file for an up-to-date list of what needs to be done. There's a lot. There's going to be a lot of these for a while, so we'll just keep those in a BUGS file, won't we? This started as a project to learn Go, and because I thought that an in memory chef server would be handy. Then I found out about chef-zero, but I still wanted a project to learn Go, so I kept it up. Chef 11 Server also only runs under Linux at this time, while Goiardi is developed under Mac OS X and ought to run under any platform Go supports (only partially at this time though). If you feel like contributing, great! Just fork the repo, make your improvements, and submit a pull request. Tests would, of course, be appreciated. Adding tests where there are no tests currently would be even more appreciated. At least, though, try and not break anything worse than it is. Test coverage has improved, but is still an ongoing concern. Goiardi is authored and copyright (c) Jeremy Bingham, 2013. Like many Chef ecosystem programs, goairdi is licensed under the Apache 2.0 License. See the LICENSE file for details. Chef is copyright (c) 2008-2013 Opscode, Inc. and its various contributors. Thanks go out to the fine folks of Opscode and the Chef community for all their hard work. Also, if you were wondering, Ettore Boiardi was the man behind Chef Boyardee. Wakka wakka.

v0.6.0 • 10 years ago

github.com/golearn/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang). It includes implementations of Breiman and Cutler's Random Forest for classification and regression on heterogeneous numerical/categorical data with missing values and several related algorithms including entropy and cost driven classification, L1 regression and feature selection with artificial contrasts and hooks for modifying the algorithms for your needs. Command line utilities to grow, apply and analyze forests are provided in sub directories. CloudForest is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. Documentation has been generated with godoc and can be viewed live at: http://godoc.org/github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome; Code Repo and Issue tracker can be found at: https://github.com/ryanbressler/CloudForest CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the origional slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record Variable Importance in CloudForest is calculated as the mean decrease in impurity over all of the splits made using a feature. To provide a baseline for evaluating importance, artificial contrast features can be used by including shuffled copies of existing features. In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface. When compiled with go1.1 CloudForest achieves running times similar to implementations in other languages. Using gccgo (4.8.0 at least) results in longer running times and is not recommended until full go1.1 support is implemented in gcc 4.8.1. Development of CloudForest is being driven by our needs as we analyze large biomedical data sets. As such new and modified analysis will be added as needed. The basic functionality has stabilized but we have discussed several possible changes that may require additional abstraction and/or changes in the api. These include: Allow additional types of candidate features. Some multidimensional data types may not be best served by decomposition into categorical and numerical features. It would be possible to allow arbitrary feature types by adding CanidateFeature (which should expose BestSplit), CodedSplitter and Splitter abstraction. Allowing data to reside anywhere. This would involve abstracting FeatureMatrix to allow database etc driven implementations. "growforest" trains a forest using the following parameters which can be listed with -h "applyforest" applies a forest to the specified feature matrix and outputs predictions as a two column (caselabel predictedvalue) tsv. errorrate calculates the error of a forest vs a testing data set and reports it to standard out leafcount outputs counts of case case co-occurrence on leaf nodes (Brieman's proximity) and counts of the number of times a feature is used to split a node containing each case (a measure of relative/local importance). CloudForest borrows the annotated feature matrix (.afm) and stoicastic forest (.sf) file formats from Timo Erkkila's rf-ace which can be found at https://code.google.com/p/rf-ace/ An annotated feature matrix (.afm) file is a tab delineated file with column and row headers. Columns represent cases and rows represent features. A row header/feature id includes a prefix to specify the feature type Categorical and boolean features use strings for their category labels. Missing values are represented by "?","nan","na", or "null" (case insensitive). A short example: A stochastic forest (.sf) file contains a forest of decision trees. The main advantage of this format as opposed to an established format like json is that an sf file can be written iteratively tree by tree and multiple .sf files can be combined with minimal logic required allowing for massively parallel growth of forests with low memory use. An .sf file consists of lines each of which is a comma separated list of key value pairs. Lines can designate either a FOREST, TREE, or NODE. Each tree belongs to the preceding forest and each node to the preceding tree. Nodes must be written in order of increasing depth. CloudForest generates fewer fields then rf-ace but requires the following. Other fields will be ignored Forest requires forest type (only RF currently), target and ntrees: Tree requires only an int and the value is ignored though the line is needed to designate a new tree: Node requires a path encoded so that the root node is specified by "*" and each split left or right as "L" or "R". Leaf nodes should also define PRED such as "PRED=1.5" or "PRED=red". Splitter nodes should define SPLITTER with a feature id inside of double quotes, SPLITTERTYPE=[CATEGORICAL|NUMERICAL] and a LVALUE term which can be either a float inside of double quotes representing the highest value sent left or a ":" separated list of categorical values sent left. An example .sf file: Cloud forest can parse and apply .sf files generated by at least some versions of rf-ace. The idea for (and trademark of the term) Random Forests originated with Leo Brieman and Adele Cuttler. Their code and paper's can be found at: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm All code in CloudForest is original but some ideas for methods and optimizations were inspired by Timo Erkilla's rf-ace and Andy Liaw and Matthew Wiener randomForest R package based on Brieman and Cuttler's code: https://code.google.com/p/rf-ace/ http://cran.r-project.org/web/packages/randomForest/index.html The idea for Artificial Contrasts was found in: Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's paper "Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination" http://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf The idea for growing trees to minimize categorical entropy comes from Ross Quinlan's ID3: http://en.wikipedia.org/wiki/ID3_algorithm "The Elements of Statistical Learning" 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman was also consulted during development.

v0.0.0-20131023235215-805a285a97ce • 11 years ago

github.com/danilodvaz/lingua-go/v2

Package lingua accurately detects the natural language of written text, be it long or short. Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages. Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy. So far, the only other comprehensive open source library in the Go ecosystem for this task is Whatlanggo (https://github.com/abadojack/whatlanggo). Unfortunately, it has two major drawbacks: 1. Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it does not provide adequate results. 2. The more languages take part in the decision process, the less accurate are the detection results. Lingua aims at eliminating these problems. It nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline. Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, 75 languages are supported. They are listed as variants of type Language. Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts: 1. a list of single words with a minimum length of 5 characters 2. a list of word pairs with a minimum length of 10 characters 3. a list of complete grammatical sentences of various lengths Both the language models and the test data have been created from separate documents of the Wortschatz corpora (https://wortschatz.uni-leipzig.de) offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively. Given the generated test data, I have compared the detection results of Lingua, and Whatlanggo running over the data of Lingua's supported 75 languages. Additionally, I have added Google's CLD3 (https://github.com/google/cld3/) to the comparison with the help of the gocld3 bindings (https://github.com/jmhodges/gocld3). Languages that are not supported by CLD3 or Whatlanggo are simply ignored during the detection process. The bar and box plots (https://github.com/pemistahl/lingua-go/blob/main/ACCURACY_PLOTS.md) show the measured accuracy values for all three performed tasks: Single word detection, word pair detection and sentence detection. Lingua clearly outperforms its contenders. Detailed statistics including mean, median and standard deviation values for each language and classifier are available in tabular form (https://github.com/pemistahl/lingua-go/blob/main/ACCURACY_TABLE.md) as well. Every language detector uses a probabilistic n-gram (https://en.wikipedia.org/wiki/N-gram) model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language. A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance. In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable. There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance. The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages. Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? In the example below, a slice of ConfidenceValue is returned containing all possible languages sorted by their confidence value in descending order. The values that this method computes are part of a relative confidence metric, not of an absolute one. Each value is a number between 0.0 and 1.0. The most likely language is always returned with value 1.0. All other languages get values assigned which are lower than 1.0, denoting how less likely those languages are in comparison to the most likely language. The slice returned by this method does not necessarily contain all languages which the calling instance of LanguageDetector was built from. If the rule-based engine decides that a specific language is truly impossible, then it will not be part of the returned slice. Likewise, if no ngram probabilities can be found within the detector's languages for the given input text, the returned slice will be empty. The confidence value for each language not being part of the returned slice is assumed to be 0.0. By default, Lingua uses lazy-loading to load only those language models on demand which are considered relevant by the rule-based filter engine. For web services, for instance, it is rather beneficial to preload all language models into memory to avoid unexpected latency while waiting for the service response. If you want to enable the eager-loading mode, you can do it as seen below. Multiple instances of LanguageDetector share the same language models in memory which are accessed asynchronously by the instances. By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word `prologue`, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated as seen below. Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise Unknown will be returned most of the time as in the example below. This is the return value for cases where language detection is not reliably possible.

v2.0.5 • 2 years ago

gopkg.in/vench/nlp.v2

Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The initial primary focus being on the implementation of algorithms supporting LSA (Latent Semantic Analysis), often referred to as Latent Semantic Indexing in the context of information retrieval. The algorithms in the package typically support document input as text strings which are then encoded as a matrix of numerical feature vectors called a `term document matrix`. Columns in this matrix represent the documents in the corpus and the rows represent terms occurring in the documents. The individual elements within the matrix contains counts of the number of occurrences of each term in the associated document. This matrix can be manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions. A common transformation is for the purpose of weighting features to remove natural biases which would skew results e.g. commonly occurring words like `the`, `of`, `and`, etc. which should carry lower weight than unusual words. Term Document matrices typically have a very large number of dimensions and so transformations are often applied to reduce the dimensionality using techniques such as Locality Sensitive Hashing or Latent Semantic Analysis (typically performed using matrix SVD - `Singular Value Decomposition`) which approximates the original term document matrix with a new matrix of much lower rank (typically around 100 rather than 1000s). Truncated SVD is a fundamental part of LSA (Latent Semantic Analysis aka Latent Semantic Indexing) and serves a number of purposes: 1. The reduced dimensionality of the data theoretically requires less memory. 2. As less significant dimensions are removed, there is less `noise` in the data which could have artificially skewed results. 3. Perhaps most importantly, the SVD effectively encodes the co-occurrence of terms within the documents to capture semantic meaning rather than simply the presence (or lack of presence) of words. This combats the problem of synonymy (a common challenge in NLP) where different words in the English language can be used to mean the same thing (synonyms). In LSA, documents can have a high degree of semantic similarity with very few words in common. The post SVD matrix (with each column being a feature vector representing a document within the corpus) can be compared for similarity with each other (for clustering) or with a query (also represented as a feature vector projected into the same dimensional space). Similarity is measured by the angle between the two feature vectors being considered.

v2.0.0-20180914075149-2c226c0ca754 • 6 years ago

github.com/jcomputing/libsvm-go

** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Caches Q matrix rows. The cache implements a LRU (Last Recently Used) eviction policy. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Implements the linear, radial-basis function, sigmoid, and polynomial kernels ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Model describes the properties of the Support Vector Machine after training. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Input/output routines for the Support Vector Machine model ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful types/methods for running loops in parallel. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes the parameters of the Supper Vector Machine solver ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Prediciton related APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Probability estimation APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes problem, i.e. label/vector set ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Q matrix for Support Vector Classification (svcQ), Support Vector Regression (svrQ), ** and One-Class Support Vector Machines (oneClassQ) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Sequential Minimal Optimization (SMO) solver ** Ref: C.-C. Chang, C.-J. Lin. "LIBSVM: A library for support vector machines". ACM Transactions on Intelligent Systems and Technology 2 (2011) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Functions for calling the solver for different problem scenerios, i.e. SVC, SVR, or One-Class ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful functions used in various parts of the library ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Working-set selection ** Ref: R.-E. Fan, P.-H. Chen, and C.-J. Lin. "Working set selection using second order information for training SVM". Journal of Machine Learning Research 6 (2005) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Cross validation API ** @author: Ed Walker

v0.0.0-20160211002255-e1c950af6327 • 9 years ago

github.com/marikishtar/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

v0.0.0-20191011083049-7c88ac1bd6ac • 5 years ago

github.com/lovoo/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

v0.0.0-20170508092312-4beee491cc04 • 8 years ago

github.com/ryanbressler/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

v0.0.0-20220205065429-8f151e494fd2 • 3 years ago

github.com/bestbug456/libsvm-go

** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Caches Q matrix rows. The cache implements a LRU (Last Recently Used) eviction policy. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Implements the linear, radial-basis function, sigmoid, and polynomial kernels ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Model describes the properties of the Support Vector Machine after training. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Input/output routines for the Support Vector Machine model ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful types/methods for running loops in parallel. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes the parameters of the Supper Vector Machine solver ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Prediciton related APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Probability estimation APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes problem, i.e. label/vector set ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Q matrix for Support Vector Classification (svcQ), Support Vector Regression (svrQ), ** and One-Class Support Vector Machines (oneClassQ) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Sequential Minimal Optimization (SMO) solver ** Ref: C.-C. Chang, C.-J. Lin. "LIBSVM: A library for support vector machines". ACM Transactions on Intelligent Systems and Technology 2 (2011) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Functions for calling the solver for different problem scenerios, i.e. SVC, SVR, or One-Class ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful functions used in various parts of the library ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Working-set selection ** Ref: R.-E. Fan, P.-H. Chen, and C.-J. Lin. "Working set selection using second order information for training SVM". Journal of Machine Learning Research 6 (2005) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Cross validation API ** @author: Ed Walker

v0.0.0-20161110195358-0f1b781e6284 • 8 years ago

github.com/nevisq/libsvm-go

** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Caches Q matrix rows. The cache implements a LRU (Last Recently Used) eviction policy. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Implements the linear, radial-basis function, sigmoid, and polynomial kernels ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Model describes the properties of the Support Vector Machine after training. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Input/output routines for the Support Vector Machine model ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful types/methods for running loops in parallel. ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes the parameters of the Supper Vector Machine solver ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Prediciton related APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Probability estimation APIs ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Describes problem, i.e. label/vector set ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Q matrix for Support Vector Classification (svcQ), Support Vector Regression (svrQ), ** and One-Class Support Vector Machines (oneClassQ) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Sequential Minimal Optimization (SMO) solver ** Ref: C.-C. Chang, C.-J. Lin. "LIBSVM: A library for support vector machines". ACM Transactions on Intelligent Systems and Technology 2 (2011) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Functions for calling the solver for different problem scenerios, i.e. SVC, SVR, or One-Class ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Useful functions used in various parts of the library ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Working-set selection ** Ref: R.-E. Fan, P.-H. Chen, and C.-J. Lin. "Working set selection using second order information for training SVM". Journal of Machine Learning Research 6 (2005) ** @author: Ed Walker ** Copyright 2014 Edward Walker ** ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** ** http ://www.apache.org/licenses/LICENSE-2.0 ** ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. ** ** Description: Cross validation API ** @author: Ed Walker

v0.0.0-20140912030911-ec8445221ede • 10 years ago

github.com/sangjinsu/golang-machine-learning

github.com/wish-master/go-machine-learning-2022

github.com/cyrusf/libsvm-go

github.com/soypete/golearn

github.com/Arturbox/go-face

github.com/santiaago/ml

github.com/IlyaLab/CloudForest

github.com/ilyalab/cloudforest

github.com/chimerakang/go-face

github.com/mstat/golearn

github.com/vdparikhrh/moderator

github.com/victor914/machine_learning_algorithms

github.com/ndphu/go-face

github.com/sikbrad/matf

github.com/ranjib/goiardi

github.com/golearn/cloudforest

github.com/danilodvaz/lingua-go/v2

gopkg.in/vench/nlp.v2

github.com/aaw/histosketch

github.com/jcomputing/libsvm-go

github.com/marikishtar/cloudforest

github.com/lovoo/cloudforest

github.com/rdbell/golearn

github.com/carck/go-face

github.com/xtaci/golearn

github.com/alexruzin/golearn

github.com/tddhit/golearn

github.com/mengjinglei/golearn

github.com/ryanbressler/cloudforest

github.com/klokare/sparse

github.com/feinedsquirrel/go-face

github.com/RadiusNetworks/lda

github.com/omezro/go-face

github.com/briansorahan/golearn

github.com/krallistic/golearn

github.com/amclay/golearn

github.com/aws/aws-sdk-go-v2/service/lookoutequipment

github.com/PacktPublishing/Machine-Learning-With-Go

github.com/bestbug456/libsvm-go

github.com/ktsimpso/machine_learning

github.com/Victor914/machine_learning_algorithms

github.com/gao8954/golearn

github.com/jeremypansier/gorgonia

github.com/nevisq/libsvm-go

github.com/jasdel/aws-sdk-go-v2/service/machinelearning

github.com/uberbrodt/golearn

github.com/vedhavyas/machine-learning

github.com/farazfazli/golearn

github.com/dhrp/golearn

github.com/andygeiss/machine-learning-golang