Package sorty is a type-specific, fast, efficient, concurrent / parallel sorting library. It is an innovative QuickSort implementation, hence in-place and does not require extra memory. You can call:
Package pargo provides functions and data structures for expressing parallel algorithms. While Go is primarily designed for concurrent programming, it is also usable to some extent for parallel programming, and this library provides convenience functionality to turn otherwise sequential algorithms into parallel algorithms, with the goal to improve performance. For documentation that provides a more structured overview than is possible with Godoc, see the wiki at https://github.com/exascience/pargo/wiki Pargo provides the following subpackages: pargo/parallel provides simple functions for executing series of thunks or predicates, as well as thunks, predicates, or reducers over ranges in parallel. See also https://github.com/ExaScience/pargo/wiki/TaskParallelism pargo/speculative provides speculative implementations of most of the functions from pargo/parallel. These implementations not only execute in parallel, but also attempt to terminate early as soon as the final result is known. See also https://github.com/ExaScience/pargo/wiki/TaskParallelism pargo/sequential provides sequential implementations of all functions from pargo/parallel, for testing and debugging purposes. pargo/sort provides parallel sorting algorithms. pargo/sync provides an efficient parallel map implementation. pargo/pipeline provides functions and data structures to construct and execute parallel pipelines. Pargo has been influenced to various extents by ideas from Cilk, Threading Building Blocks, and Java's java.util.concurrent and java.util.stream packages. See http://supertech.csail.mit.edu/papers/steal.pdf for some theoretical background, and the sample chapter at https://mitpress.mit.edu/books/introduction-algorithms for a more practical overview of the underlying concepts.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package cgi implements the common gateway interface (CGI) for Caddy, a modern, full-featured, easy-to-use web server. This plugin lets you generate dynamic content on your website by means of command line scripts. To collect information about the inbound HTTP request, your script examines certain environment variables such as PATH_INFO and QUERY_STRING. Then, to return a dynamically generated web page to the client, your script simply writes content to standard output. In the case of POST requests, your script reads additional inbound content from standard input. The advantage of CGI is that you do not need to fuss with server startup and persistence, long term memory management, sockets, and crash recovery. Your script is called when a request matches one of the patterns that you specify in your Caddyfile. As soon as your script completes its response, it terminates. This simplicity makes CGI a perfect complement to the straightforward operation and configuration of Caddy. The benefits of Caddy, including HTTPS by default, basic access authentication, and lots of middleware options extend easily to your CGI scripts. CGI has some disadvantages. For one, Caddy needs to start a new process for each request. This can adversely impact performance and, if resources are shared between CGI applications, may require the use of some interprocess synchronization mechanism such as a file lock. Your server’s responsiveness could in some circumstances be affected, such as when your web server is hit with very high demand, when your script’s dependencies require a long startup, or when concurrently running scripts take a long time to respond. However, in many cases, such as using a pre-compiled CGI application like fossil or a Lua script, the impact will generally be insignificant. Another restriction of CGI is that scripts will be run with the same permissions as Caddy itself. This can sometimes be less than ideal, for example when your script needs to read or write files associated with a different owner. Serving dynamic content exposes your server to more potential threats than serving static pages. There are a number of considerations of which you should be aware when using CGI applications. CGI SCRIPTS SHOULD BE LOCATED OUTSIDE OF CADDY’S DOCUMENT ROOT. Otherwise, an inadvertent misconfiguration could result in Caddy delivering the script as an ordinary static resource. At best, this could merely confuse the site visitor. At worst, it could expose sensitive internal information that should not leave the server. MISTRUST THE CONTENTS OF PATH_INFO, QUERY_STRING AND STANDARD INPUT. Most of the environment variables available to your CGI program are inherently safe because they originate with Caddy and cannot be modified by external users. This is not the case with PATH_INFO, QUERY_STRING and, in the case of POST actions, the contents of standard input. Be sure to validate and sanitize all inbound content. If you use a CGI library or framework to process your scripts, make sure you understand its limitations. An error in a CGI application is generally handled within the application itself and reported in the headers it returns. Additionally, if the Caddy errors directive is enabled, any content the application writes to its standard error stream will be written to the error log. This can be useful to diagnose problems with the execution of the CGI application. Your CGI application can be executed directly or indirectly. In the direct case, the application can be a compiled native executable or it can be a shell script that contains as its first line a shebang that identifies the interpreter to which the file’s name should be passed. Caddy must have permission to execute the application. On Posix systems this will mean making sure the application’s ownership and permission bits are set appropriately; on Windows, this may involve properly setting up the filename extension association. In the indirect case, the name of the CGI script is passed to an interpreter such as lua, perl or python. The basic cgi directive lets you associate a single pattern with a particular script. The directive can be repeated any reasonable number of times. Here is the basic syntax: For example: When a request such as https://example.com/report or https://example.com/report/weekly arrives, the cgi middleware will detect the match and invoke the script named /usr/local/cgi-bin/report. The current working directory will be the same as Caddy itself. Here, it is assumed that the script is self-contained, for example a pre-compiled CGI application or a shell script. Here is an example of a standalone script, similar to one used in the cgi plugin’s test suite: The environment variables PATH_INFO and QUERY_STRING are populated and passed to the script automatically. There are a number of other standard CGI variables included that are described below. If you need to pass any special environment variables or allow any environment variables that are part of Caddy’s process to pass to your script, you will need to use the advanced directive syntax described below. The values used for the script name and its arguments are subject to placeholder replacement. In addition to the standard Caddy placeholders such as {method} and {host}, the following placeholder substitutions are made: - {.} is replaced with Caddy’s current working directory - {match} is replaced with the portion of the request that satisfies the match directive - {root} is replaced with Caddy’s specified root directory You can include glob wildcards in your matches. Basically, an asterisk represents a sequence of zero or more non-slash characters and a question mark represents a single non-slash character. These wildcards can be used multiple times in a match expression. See the documentation for path/Match in the Go standard library for more details about glob matching. Here is an example directive: In this case, the cgi middleware will match requests such as https://example.com/report/weekly.lua and https://example.com/report/report.lua/weekly but not https://example.com/report.lua. The use of the asterisk expands to any character sequence within a directory. For example, if the request is made, the following command is executed: Note that the portion of the request that follows the match is not included. That information is conveyed to the script by means of environment variables. In this example, the Lua interpreter is invoked directly from Caddy, so the Lua script does not need the shebang that would be needed in a standalone script. This method facilitates the use of CGI on the Windows platform. In order to specify custom environment variables, pass along one or more environment variables known to Caddy, or specify more than one match pattern for a given rule, you will need to use the advanced directive syntax. That looks like this: For example, With the advanced syntax, the exec subdirective must appear exactly once. The match subdirective must appear at least once. The env, pass_env, empty_env, and except subdirectives can appear any reasonable number of times. pass_all_env, dir may appear once. The dir subdirective specifies the CGI executable’s working directory. If it is not specified, Caddy’s current working directory is used. The except subdirective uses the same pattern matching logic that is used with the match subdirective except that the request must match a rule fully; no request path prefix matching is performed. Any request that matches a match pattern is then checked with the patterns in except, if any. If any matches are made with the except pattern, the request is rejected and passed along to subsequent handlers. This is a convenient way to have static file resources served properly rather than being confused as CGI applications. The empty_env subdirective is used to pass one or more empty environment variables. Some CGI scripts may expect the server to pass certain empty variables rather than leaving them unset. This subdirective allows you to deal with those situations. The values associated with environment variable keys are all subject to placeholder substitution, just as with the script name and arguments. If your CGI application runs properly at the command line but fails to run from Caddy it is possible that certain environment variables may be missing. For example, the ruby gem loader evidently requires the HOME environment variable to be set; you can do this with the subdirective pass_env HOME. Another class of problematic applications require the COMPUTERNAME variable. The pass_all_env subdirective instructs Caddy to pass each environment variable it knows about to the CGI excutable. This addresses a common frustration that is caused when an executable requires an environment variable and fails without a descriptive error message when the variable cannot be found. These applications often run fine from the command prompt but fail when invoked with CGI. The risk with this subdirective is that a lot of server information is shared with the CGI executable. Use this subdirective only with CGI applications that you trust not to leak this information. If you protect your CGI application with the Caddy JWT middleware, your program will have access to the token’s payload claims by means of environment variables. For example, the following token claims will be available with the following environment variables All values are conveyed as strings, so some conversion may be necessary in your program. No placeholder substitutions are made on these values. If you run into unexpected results with the CGI plugin, you are able to examine the environment in which your CGI application runs. To enter inspection mode, add the subdirective inspect to your CGI configuration block. This is a development option that should not be used in production. When in inspection mode, the plugin will respond to matching requests with a page that displays variables of interest. In particular, it will show the replacement value of {match} and the environment variables to which your CGI application has access. For example, consider this example CGI block: When you request a matching URL, for example, the Caddy server will deliver a text page similar to the following. The CGI application (in this case, wapptclsh) will not be called. This information can be used to diagnose problems with how a CGI application is called. To return to operation mode, remove or comment out the inspect subdirective. In this example, the Caddyfile looks like this: Note that a request for /show gets mapped to a script named /usr/local/cgi-bin/report/gen. There is no need for any element of the script name to match any element of the match pattern. The contents of /usr/local/cgi-bin/report/gen are: The purpose of this script is to show how request information gets communicated to a CGI script. Note that POST data must be read from standard input. In this particular case, posted data gets stored in the variable POST_DATA. Your script may use a different method to read POST content. Secondly, the SCRIPT_EXEC variable is not a CGI standard. It is provided by this middleware and contains the entire command line, including all arguments, with which the CGI script was executed. When a browser requests the response looks like When a client makes a POST request, such as with the following command the response looks the same except for the following lines: The fossil distributed software management tool is a native executable that supports interaction as a CGI application. In this example, /usr/bin/fossil is the executable and /home/quixote/projects.fossil is the fossil repository. To configure Caddy to serve it, use a cgi directive something like this in your Caddyfile: In your /usr/local/cgi-bin directory, make a file named projects with the following single line: The fossil documentation calls this a command file. When fossil is invoked after a request to /projects, it examines the relevant environment variables and responds as a CGI application. If you protect /projects with basic HTTP authentication, you may wish to enable the ALLOW REMOTE_USER AUTHENTICATION option when setting up fossil. This lets fossil dispense with its own authentication, assuming it has an account for the user. The agedu utility can be used to identify unused files that are taking up space on your storage media. Like fossil, it can be used in different modes including CGI. First, use it from the command line to generate an index of a directory, for example In your Caddyfile, include a directive that references the generated index: You will want to protect the /agedu resource with some sort of access control, for example HTTP Basic Authentication. This small example demonstrates how to write a CGI program in Go. The use of a bytes.Buffer makes it easy to report the content length in the CGI header. When this program is compiled and installed as /usr/local/bin/servertime, the following directive in your Caddy file will make it available: The cgit application provides an attractive and useful web interface to git repositories. Here is how to run it with Caddy. After compiling cgit, you can place the executable somewhere out of Caddy’s document root. In this example, it is located in /usr/local/cgi-bin. A sample configuration file is included in the project’s cgitrc.5.txt file. You can use it as a starting point for your configuration. The default location for this file is /etc/cgitrc but in this example the location /home/quixote/caddy/cgitrc. Note that changing the location of this file from its default will necessitate the inclusion of the environment variable CGIT_CONFIG in the Caddyfile cgi directive. When you edit the repository stanzas in this file, be sure each repo.path item refers to the .git directory within a working checkout. Here is an example stanza: Also, you will likely want to change cgit’s cache directory from its default in /var/cache (generally accessible only to root) to a location writeable by Caddy. In this example, cgitrc contains the line You may need to create the cgit subdirectory. There are some static cgit resources (namely, cgit.css, favicon.ico, and cgit.png) that will be accessed from Caddy’s document tree. For this example, these files are placed in a directory named cgit-resource. The following lines are part of the cgitrc file: Additionally, you will likely need to tweak the various file viewer filters such source-filter and about-filter based on your system. The following Caddyfile directive will allow you to access the cgit application at /cgit: Feeling reckless? You can run PHP in CGI mode. In general, FastCGI is the preferred method to run PHP if your application has many pages or a fair amount of database activity. But for small PHP programs that are seldom used, CGI can work fine. You’ll need the php-cgi interpreter for your platform. This may involve downloading the executable or downloading and then compiling the source code. For this example, assume the interpreter is installed as /usr/local/bin/php-cgi. Additionally, because of the way PHP operates in CGI mode, you will need an intermediate script. This one works in Posix environments: This script can be reused for multiple cgi directives. In this example, it is installed as /usr/local/cgi-bin/phpwrap. The argument following -c is your initialization file for PHP. In this example, it is named /home/quixote/.config/php/php-cgi.ini. Two PHP files will be used for this example. The first, /usr/local/cgi-bin/sample/min.php, looks like this: The second, /usr/local/cgi-bin/sample/action.php, follows: The following directive in your Caddyfile will make the application available at sample/min.php: This examples demonstrates printing a CGI rule