Search API

github.com/open-ch/enmime

Package enmime implements a MIME encoding and decoding library. It's built on top of Go's included mime/multipart support where possible, but is geared towards parsing MIME encoded emails. The enmime API has two conceptual layers. The lower layer is a tree of Part structs, representing each component of a decoded MIME message. The upper layer, called an Envelope provides an intuitive way to interact with a MIME message. Calling ReadParts causes enmime to parse the body of a MIME message into a tree of Part objects, each of which is aware of its content type, filename and headers. The content of a Part is available as a slice of bytes via the Content field. If the part was encoded in quoted-printable or base64, it is decoded prior to being placed in Content. If the Part contains text in a character set other than utf-8, enmime will attempt to convert it to utf-8. To locate a particular Part, pass a custom PartMatcher function into the BreadthMatchFirst() or DepthMatchFirst() methods to search the Part tree. BreadthMatchAll() and DepthMatchAll() will collect all Parts matching your criteria. ReadEnvelope returns an Envelope struct. Behind the scenes a Part tree is constructed, and then sorted into the correct fields of the Envelope. The Envelope contains both the plain text and HTML portions of the email. If there was no plain text Part available, the HTML Part will be down-converted using the html2text library1. The root of the Part tree, as well as slices of the inline and attachment Parts are also available. Every MIME Part has its own headers, accessible via the Part.Header field. The raw headers for an Envelope are available in Root.Header. Envelope also provides helper methods to fetch headers: GetHeader(key) will return the RFC 2047 decoded value of the specified header. AddressList(key) will convert the specified address header into a slice of net/mail.Address values. enmime attempts to be tolerant of poorly encoded MIME messages. In situations where parsing is not possible, the ReadEnvelope and ReadParts functions will return a hard error. If enmime is able to continue parsing the message, it will add an entry to the Errors slice on the relevant Part. After parsing is complete, all Part errors will be appended to the Envelope Errors slice. The Error* constants can be used to identify a specific class of error. Please note that enmime parses messages into memory, so it is not likely to perform well with multi-gigabyte attachments. enmime is open source software released under the MIT License. The latest version can be found at https://github.com/jhillyerd/enmime

v0.2.1 • 7 years ago

github.com/UmboCV/enmime

Package enmime implements a MIME encoding and decoding library. It's built on top of Go's included mime/multipart support where possible, but is geared towards parsing MIME encoded emails. The enmime API has two conceptual layers. The lower layer is a tree of Part structs, representing each component of a decoded MIME message. The upper layer, called an Envelope provides an intuitive way to interact with a MIME message. Calling ReadParts causes enmime to parse the body of a MIME message into a tree of Part objects, each of which is aware of its content type, filename and headers. The content of a Part is available as a slice of bytes via the Content field. If the part was encoded in quoted-printable or base64, it is decoded prior to being placed in Content. If the Part contains text in a character set other than utf-8, enmime will attempt to convert it to utf-8. To locate a particular Part, pass a custom PartMatcher function into the BreadthMatchFirst() or DepthMatchFirst() methods to search the Part tree. BreadthMatchAll() and DepthMatchAll() will collect all Parts matching your criteria. ReadEnvelope returns an Envelope struct. Behind the scenes a Part tree is constructed, and then sorted into the correct fields of the Envelope. The Envelope contains both the plain text and HTML portions of the email. If there was no plain text Part available, the HTML Part will be down-converted using the html2text library1. The root of the Part tree, as well as slices of the inline and attachment Parts are also available. Every MIME Part has its own headers, accessible via the Part.Header field. The raw headers for an Envelope are available in Root.Header. Envelope also provides helper methods to fetch headers: GetHeader(key) will return the RFC 2047 decoded value of the specified header. AddressList(key) will convert the specified address header into a slice of net/mail.Address values. enmime attempts to be tolerant of poorly encoded MIME messages. In situations where parsing is not possible, the ReadEnvelope and ReadParts functions will return a hard error. If enmime is able to continue parsing the message, it will add an entry to the Errors slice on the relevant Part. After parsing is complete, all Part errors will be appended to the Envelope Errors slice. The Error* constants can be used to identify a specific class of error. Please note that enmime parses messages into memory, so it is not likely to perform well with multi-gigabyte attachments. enmime is open source software released under the MIT License. The latest version can be found at https://github.com/jhillyerd/enmime

v0.8.0 • 5 years ago

github.com/mattcui/ibm-global-search-client

* Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git) * Global Search * * Search for resources with the global and shared resource properties repository integrated in the IBM Cloud platform. The search repository stores and searches cloud resources attributes, which categorize or classify resources. A resource is a physical or logical component that can be created or reserved for an application or service instance and is owned by resource providers, such as Cloud Foundry, IBM Kubernetes Service, or resource controller in IBM Cloud. Resources are uniquely identified by a Cloud Resource Name (CRN) or by an IMS ID. The properties of a resource include tags and system properties. Both properties are defined in an IBM Cloud billing account, and span across many regions. * * API version: 3.0.0 * Generated by: Swagger Codegen (https://github.com/swagger-api/swagger-codegen.git)

v1.0.1 • 4 years ago

github.com/marikishtar/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

v0.0.0-20191011083049-7c88ac1bd6ac • 6 years ago

bitbucket.org/asante/log

Package log is an important part of the application and having a consistent logging mechanism and structure is mandatory. With several teams writing different components that talk to each other, being able to read each others logs could be the difference between finding bugs quickly or wasting hours. With the log package in the standard library, we have the ability to create custom loggers that can be configured to write to one or many devices. Since we use syslog to send logging output to a central log repository, our logger can be configured to just write to stdout. This not only simplifies things for us, but will keep each log trace in correct sequence. This package does not included logging levels. Everything needs to be logged to help trace the code and find bugs. There is no such thing as over logging. By the time you decide to change the logging level, it is always too late. The question of performance comes up quite a bit. If the only performance issue we see is coming from logging, we are doing very well. I have had these opinions for a long time, but if you want more clarity on the subject listen to this recent podcast: Jon Gifford On Logging And Logging Infrastructure: Robert Blumen talks to Jon Gifford of Loggly about logging and logging infrastructure. Topics include logging defined, purposes of logging, uses of logging in understanding the run-time behavior of programs, who produces logs, who consumes logs and for what reasons, software as the consumer of logs, log formats (structured versus free form), log meta-data, logging APIs, logging as coding, logging and frameworks, the massive hairball of log file management, modern logging infrastructure in which log records are stored and indexed in a search engine, how searchable logs have transformed the uses of log data, log data and analytics, leveraging the log database for statistical insights, performance and resource issues of logging, are logs really different than other data that systems record in databases, and how log visualization gives users insights into their system. The show wraps up with a discussion of open source logging platforms versus commercial SAAS providers. There are two types of tracing lines we need to log. One is a trace line that describes where the program is, what it is doing and any data associated with that trace. The second is formatted data such as a JSON document or binary dump of data. Each serve a different purpose but they both exists within the same scope of space and time. The format of each trace line needs to be consistent and helpful or else the logging will just be noise and ultimately useless. Here is a breakdown of each section and a sample value: Here are examples of how trace lines would show in the log: In the end, we want to see the flow of most functions starting and completing so we can follow the code in the logs. We want to quickly see and filter errors, which can be accomplished by using a capitalized version of the word ERROR. The context is an important value. The context allows us to extract trace lines for one context over others. Maybe in this case 8890 represents a user id. When there is a need to dump formatted data into the logs, there are three approaches. If the data can be represented as key/value pairs, you can write each pair on their own line with the DATA tag: When there is a single block of data to dump, then it can be written as a single multi-line trace: When special block formatting required, the Stringer interface can be implemented to format data in custom ways: The API for the log package is focused on initializing the logger and then provides function abstractions for the different tags we have defined.

v0.0.0-20220129155338-f2c16e661e32 • 4 years ago

github.com/peerfekt/go-rod

This example opens https://github.com/, searches for "git", and then gets the header element which gives the description for Git. Rod use https://golang.org/pkg/context to handle cancelations for IO blocking operations, most times it's timeout. Context will be recursively passed to all sub-methods. For example, methods like Page.Context(ctx) will return a clone of the page with the ctx, all the methods of the returned page will use the ctx if they have IO blocking operations. Page.Timeout or Page.WithCancel is just a shortcut for Page.Context. Of course, Browser or Element works the same way. Shows how we can further customize the browser with the launcher library. Usually you use launcher lib to set the browser's command line flags (switches). Doc for flags: https://peter.sh/experiments/chromium-command-line-switches Shows how to change the retry/polling options that is used to query elements. This is useful when you want to customize the element query retry logic. When rod doesn't have a feature that you need. You can easily call the cdp to achieve it. List of cdp API: https://github.com/go-rod/rod/tree/master/lib/proto Shows how to disable headless mode and debug. Rod provides a lot of debug options, you can set them with setter methods or use environment variables. Doc for environment variables: https://pkg.go.dev/github.com/go-rod/rod/lib/defaults We use "Must" prefixed functions to write example code. But in production you may want to use the no-prefix version of them. About why we use "Must" as the prefix, it's similar to https://golang.org/pkg/regexp/#MustCompile Shows how to share a remote object reference between two Eval Shows how to listen for events. Shows how to intercept requests and modify both the request and the response. The entire process of hijacking one request: The --req-> and --res-> are the parts that can be modified. Show how to handle multiple results of an action. Such as when you login a page, the result can be success or wrong password. Example_search shows how to use Search to get element inside nested iframes or shadow DOMs. It works the same as https://developers.google.com/web/tools/chrome-devtools/dom#search Shows how to update the state of the current page. In this example we enable the network domain. Rod uses mouse cursor to simulate clicks, so if a button is moving because of animation, the click may not work as expected. We usually use WaitStable to make sure the target isn't changing anymore. When you want to wait for an ajax request to complete, this example will be useful.

v0.107.5 • 3 years ago

github.com/xhbase/enmime

Package enmime implements a MIME encoding and decoding library. It's built on top of Go's included mime/multipart support where possible, but is geared towards parsing MIME encoded emails. The enmime API has two conceptual layers. The lower layer is a tree of Part structs, representing each component of a decoded MIME message. The upper layer, called an Envelope provides an intuitive way to interact with a MIME message. Calling ReadParts causes enmime to parse the body of a MIME message into a tree of Part objects, each of which is aware of its content type, filename and headers. The content of a Part is available as a slice of bytes via the Content field. If the part was encoded in quoted-printable or base64, it is decoded prior to being placed in Content. If the Part contains text in a character set other than utf-8, enmime will attempt to convert it to utf-8. To locate a particular Part, pass a custom PartMatcher function into the BreadthMatchFirst() or DepthMatchFirst() methods to search the Part tree. BreadthMatchAll() and DepthMatchAll() will collect all Parts matching your criteria. ReadEnvelope returns an Envelope struct. Behind the scenes a Part tree is constructed, and then sorted into the correct fields of the Envelope. The Envelope contains both the plain text and HTML portions of the email. If there was no plain text Part available, the HTML Part will be down-converted using the html2text library1. The root of the Part tree, as well as slices of the inline and attachment Parts are also available. Every MIME Part has its own headers, accessible via the Part.Header field. The raw headers for an Envelope are available in Root.Header. Envelope also provides helper methods to fetch headers: GetHeader(key) will return the RFC 2047 decoded value of the specified header. AddressList(key) will convert the specified address header into a slice of net/mail.Address values. enmime attempts to be tolerant of poorly encoded MIME messages. In situations where parsing is not possible, the ReadEnvelope and ReadParts functions will return a hard error. If enmime is able to continue parsing the message, it will add an entry to the Errors slice on the relevant Part. After parsing is complete, all Part errors will be appended to the Envelope Errors slice. The Error* constants can be used to identify a specific class of error. Please note that enmime parses messages into memory, so it is not likely to perform well with multi-gigabyte attachments. enmime is open source software released under the MIT License. The latest version can be found at https://github.com/xhbase/enmime

v1.0.4 • 5 years ago

github.com/pzeinlinger/enmime

Package enmime implements a MIME encoding and decoding library. It's built on top of Go's included mime/multipart support where possible, but is geared towards parsing MIME encoded emails. The enmime API has two conceptual layers. The lower layer is a tree of Part structs, representing each component of a decoded MIME message. The upper layer, called an Envelope provides an intuitive way to interact with a MIME message. Calling ReadParts causes enmime to parse the body of a MIME message into a tree of Part objects, each of which is aware of its content type, filename and headers. The content of a Part is available as a slice of bytes via the Content field. If the part was encoded in quoted-printable or base64, it is decoded prior to being placed in Content. If the Part contains text in a character set other than utf-8, enmime will attempt to convert it to utf-8. To locate a particular Part, pass a custom PartMatcher function into the BreadthMatchFirst() or DepthMatchFirst() methods to search the Part tree. BreadthMatchAll() and DepthMatchAll() will collect all Parts matching your criteria. ReadEnvelope returns an Envelope struct. Behind the scenes a Part tree is constructed, and then sorted into the correct fields of the Envelope. The Envelope contains both the plain text and HTML portions of the email. If there was no plain text Part available, the HTML Part will be down-converted using the html2text library1. The root of the Part tree, as well as slices of the inline and attachment Parts are also available. Every MIME Part has its own headers, accessible via the Part.Header field. The raw headers for an Envelope are available in Root.Header. Envelope also provides helper methods to fetch headers: GetHeader(key) will return the RFC 2047 decoded value of the specified header. AddressList(key) will convert the specified address header into a slice of net/mail.Address values. enmime attempts to be tolerant of poorly encoded MIME messages. In situations where parsing is not possible, the ReadEnvelope and ReadParts functions will return a hard error. If enmime is able to continue parsing the message, it will add an entry to the Errors slice on the relevant Part. After parsing is complete, all Part errors will be appended to the Envelope Errors slice. The Error* constants can be used to identify a specific class of error. Please note that enmime parses messages into memory, so it is not likely to perform well with multi-gigabyte attachments. enmime is open source software released under the MIT License. The latest version can be found at https://github.com/jhillyerd/enmime

v0.7.0 • 6 years ago

github.com/abhinavdangeti/moss

Package moss stands for "memory-oriented sorted segments", and provides a data structure that manages an ordered Collection of key-val entries, with optional persistence. The design is similar to a simplified LSM tree (log structured merge tree), but is more like a "LSM array", in that a stack of immutable, sorted key-val arrays or "segments" is maintained. When there's an incoming Batch of key-val mutations (see: ExecuteBatch()), the Batch, which is an array of key-val mutations, is sorted in-place and becomes an immutable "segment". Then, the segment is atomically pushed onto a stack of segment pointers. A higher segment in the stack will shadow mutations of the same key from lower segments. Separately, an asynchronous goroutine (the "merger") will continuously merge N sorted segments to keep stack height low. In the best case, a remaining, single, large sorted segment will be efficient in memory usage and efficient for binary search and range iteration. Iterations when the stack height is > 1 are implementing using a N-way heap merge. A Batch and a segment is actually two arrays: a byte array of contiguous key-val entries; and an uint64 array of entry offsets and key-val lengths that refer to the previous key-val entries byte array. In this design, stacks are treated as immutable via a copy-on-write approach whenever a stack is "modified". So, readers and writers essentially don't block each other, and taking a Snapshot is also a relatively simple operation of atomically cloning the stack of segment pointers. Of note: mutations are only supported through Batch operations, which acknowledges the common practice of using batching to achieve higher write performance and embraces it. Additionally, higher performance can be attained by using the batch memory pre-allocation parameters and the Batch.Alloc() API, allowing applications to serialize keys and vals directly into memory maintained by a batch, which can avoid extra memory copying. IMPORTANT: The keys in a Batch must be unique. That is, myBatch.Set("x", "foo"); myBatch.Set("x", "bar") is not supported. Applications that do not naturally meet this requirement might maintain their own map[key]val data structures to ensure this uniqueness constraint. An optional, asynchronous persistence goroutine (the "persister") can drain mutations to a lower level, ordered key-value storage layer. An optional, built-in storage layer ("mossStore") is available, that will asynchronously write segments to the end of a file (append only design), with reads performed using mmap(), and with user controllable compaction configuration. See: OpenStoreCollection(). NOTE: the mossStore persistence design does not currently support moving files created on one machine endian'ness type to another machine with a different endian'ness type.

v0.0.0-20170109173433-d75712e522f1 • 9 years ago

github.com/malview/enmime

Package enmime implements a MIME encoding and decoding library. It's built on top of Go's included mime/multipart support where possible, but is geared towards parsing MIME encoded emails. The enmime API has two conceptual layers. The lower layer is a tree of Part structs, representing each component of a decoded MIME message. The upper layer, called an Envelope provides an intuitive way to interact with a MIME message. Calling ReadParts causes enmime to parse the body of a MIME message into a tree of Part objects, each of which is aware of its content type, filename and headers. The content of a Part is available as a slice of bytes via the Content field. If the part was encoded in quoted-printable or base64, it is decoded prior to being placed in Content. If the Part contains text in a character set other than utf-8, enmime will attempt to convert it to utf-8. To locate a particular Part, pass a custom PartMatcher function into the BreadthMatchFirst() or DepthMatchFirst() methods to search the Part tree. BreadthMatchAll() and DepthMatchAll() will collect all Parts matching your criteria. ReadEnvelope returns an Envelope struct. Behind the scenes a Part tree is constructed, and then sorted into the correct fields of the Envelope. The Envelope contains both the plain text and HTML portions of the email. If there was no plain text Part available, the HTML Part will be down-converted using the html2text library1. The root of the Part tree, as well as slices of the inline and attachment Parts are also available. Every MIME Part has its own headers, accessible via the Part.Header field. The raw headers for an Envelope are available in Root.Header. Envelope also provides helper methods to fetch headers: GetHeader(key) will return the RFC 2047 decoded value of the specified header. AddressList(key) will convert the specified address header into a slice of net/mail.Address values. enmime attempts to be tolerant of poorly encoded MIME messages. In situations where parsing is not possible, the ReadEnvelope and ReadParts functions will return a hard error. If enmime is able to continue parsing the message, it will add an entry to the Errors slice on the relevant Part. After parsing is complete, all Part errors will be appended to the Envelope Errors slice. The Error* constants can be used to identify a specific class of error. Please note that enmime parses messages into memory, so it is not likely to perform well with multi-gigabyte attachments. enmime is open source software released under the MIT License. The latest version can be found at https://github.com/jhillyerd/enmime

v0.8.3 • 5 years ago

github.com/sourcegraph/goloader

Package loader loads a complete Go program from source code, parsing and type-checking the initial packages plus their transitive closure of dependencies. The ASTs and the derived facts are retained for later use. THIS INTERFACE IS EXPERIMENTAL AND IS LIKELY TO CHANGE. The package defines two primary types: Config, which specifies a set of initial packages to load and various other options; and Program, which is the result of successfully loading the packages specified by a configuration. The configuration can be set directly, but *Config provides various convenience methods to simplify the common cases, each of which can be called any number of times. Finally, these are followed by a call to Load() to actually load and type-check the program. See examples_test.go for examples of API usage. The WORKSPACE is the set of packages accessible to the loader. The workspace is defined by Config.Build, a *build.Context. The default context treats subdirectories of $GOROOT and $GOPATH as packages, but this behavior may be overridden. An AD HOC package is one specified as a set of source files on the command line. In the simplest case, it may consist of a single file such as $GOROOT/src/net/http/triv.go. EXTERNAL TEST packages are those comprised of a set of *_test.go files all with the same 'package foo_test' declaration, all in the same directory. (go/build.Package calls these files XTestFiles.) An IMPORTABLE package is one that can be referred to by some import spec. Every importable package is uniquely identified by its PACKAGE PATH or just PATH, a string such as "fmt", "encoding/json", or "cmd/vendor/golang.org/x/arch/x86/x86asm". A package path typically denotes a subdirectory of the workspace. An import declaration uses an IMPORT PATH to refer to a package. Most import declarations use the package path as the import path. Due to VENDORING (https://golang.org/s/go15vendor), the interpretation of an import path may depend on the directory in which it appears. To resolve an import path to a package path, go/build must search the enclosing directories for a subdirectory named "vendor". ad hoc packages and external test packages are NON-IMPORTABLE. The path of an ad hoc package is inferred from the package declarations of its files and is therefore not a unique package key. For example, Config.CreatePkgs may specify two initial ad hoc packages, both with path "main". An AUGMENTED package is an importable package P plus all the *_test.go files with same 'package foo' declaration as P. (go/build.Package calls these files TestFiles.) The INITIAL packages are those specified in the configuration. A DEPENDENCY is a package loaded to satisfy an import in an initial package or another dependency.

v0.0.0-20170526185034-87c4e16c3486 • 8 years ago

github.com/chirayu/lingua-go

Package lingua accurately detects the natural language of written text, be it long or short. Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages. Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy. So far, the only other comprehensive open source library in the Go ecosystem for this task is Whatlanggo (https://github.com/abadojack/whatlanggo). Unfortunately, it has two major drawbacks: 1. Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it does not provide adequate results. 2. The more languages take part in the decision process, the less accurate are the detection results. Lingua aims at eliminating these problems. It nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline. Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, 75 languages are supported. They are listed as variants of type Language. Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts: 1. a list of single words with a minimum length of 5 characters 2. a list of word pairs with a minimum length of 10 characters 3. a list of complete grammatical sentences of various lengths Both the language models and the test data have been created from separate documents of the Wortschatz corpora (https://wortschatz.uni-leipzig.de) offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively. Given the generated test data, I have compared the detection results of Lingua, and Whatlanggo running over the data of Lingua's supported 75 languages. Additionally, I have added Google's CLD3 (https://github.com/google/cld3/) to the comparison with the help of the gocld3 bindings (https://github.com/jmhodges/gocld3). Languages that are not supported by CLD3 or Whatlanggo are simply ignored during the detection process. The bar and box plots (https://github.com/pemistahl/lingua-go/blob/main/ACCURACY_PLOTS.md) show the measured accuracy values for all three performed tasks: Single word detection, word pair detection and sentence detection. Lingua clearly outperforms its contenders. Detailed statistics including mean, median and standard deviation values for each language and classifier are available in tabular form (https://github.com/pemistahl/lingua-go/blob/main/ACCURACY_TABLE.md) as well. Every language detector uses a probabilistic n-gram (https://en.wikipedia.org/wiki/N-gram) model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language. A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance. In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable. There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance. The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages. Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? In the example below, a slice of ConfidenceValue is returned containing all possible languages sorted by their confidence value in descending order. The values that this method computes are part of a relative confidence metric, not of an absolute one. Each value is a number between 0.0 and 1.0. The most likely language is always returned with value 1.0. All other languages get values assigned which are lower than 1.0, denoting how less likely those languages are in comparison to the most likely language. The slice returned by this method does not necessarily contain all languages which the calling instance of LanguageDetector was built from. If the rule-based engine decides that a specific language is truly impossible, then it will not be part of the returned slice. Likewise, if no ngram probabilities can be found within the detector's languages for the given input text, the returned slice will be empty. The confidence value for each language not being part of the returned slice is assumed to be 0.0. By default, Lingua uses lazy-loading to load only those language models on demand which are considered relevant by the rule-based filter engine. For web services, for instance, it is rather beneficial to preload all language models into memory to avoid unexpected latency while waiting for the service response. If you want to enable the eager-loading mode, you can do it as seen below. Multiple instances of LanguageDetector share the same language models in memory which are accessed asynchronously by the instances. By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word `prologue`, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated as seen below. Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise Unknown will be returned most of the time as in the example below. This is the return value for cases where language detection is not reliably possible.

v1.0.0 • 4 years ago

github.com/lytics/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.

v0.0.0-20201116174008-381792ef996b • 5 years ago

github.com/golearn/cloudforest

Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang). It includes implementations of Breiman and Cutler's Random Forest for classification and regression on heterogeneous numerical/categorical data with missing values and several related algorithms including entropy and cost driven classification, L1 regression and feature selection with artificial contrasts and hooks for modifying the algorithms for your needs. Command line utilities to grow, apply and analyze forests are provided in sub directories. CloudForest is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. Documentation has been generated with godoc and can be viewed live at: http://godoc.org/github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome; Code Repo and Issue tracker can be found at: https://github.com/ryanbressler/CloudForest CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the origional slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record Variable Importance in CloudForest is calculated as the mean decrease in impurity over all of the splits made using a feature. To provide a baseline for evaluating importance, artificial contrast features can be used by including shuffled copies of existing features. In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface. When compiled with go1.1 CloudForest achieves running times similar to implementations in other languages. Using gccgo (4.8.0 at least) results in longer running times and is not recommended until full go1.1 support is implemented in gcc 4.8.1. Development of CloudForest is being driven by our needs as we analyze large biomedical data sets. As such new and modified analysis will be added as needed. The basic functionality has stabilized but we have discussed several possible changes that may require additional abstraction and/or changes in the api. These include: Allow additional types of candidate features. Some multidimensional data types may not be best served by decomposition into categorical and numerical features. It would be possible to allow arbitrary feature types by adding CanidateFeature (which should expose BestSplit), CodedSplitter and Splitter abstraction. Allowing data to reside anywhere. This would involve abstracting FeatureMatrix to allow database etc driven implementations. "growforest" trains a forest using the following parameters which can be listed with -h "applyforest" applies a forest to the specified feature matrix and outputs predictions as a two column (caselabel predictedvalue) tsv. errorrate calculates the error of a forest vs a testing data set and reports it to standard out leafcount outputs counts of case case co-occurrence on leaf nodes (Brieman's proximity) and counts of the number of times a feature is used to split a node containing each case (a measure of relative/local importance). CloudForest borrows the annotated feature matrix (.afm) and stoicastic forest (.sf) file formats from Timo Erkkila's rf-ace which can be found at https://code.google.com/p/rf-ace/ An annotated feature matrix (.afm) file is a tab delineated file with column and row headers. Columns represent cases and rows represent features. A row header/feature id includes a prefix to specify the feature type Categorical and boolean features use strings for their category labels. Missing values are represented by "?","nan","na", or "null" (case insensitive). A short example: A stochastic forest (.sf) file contains a forest of decision trees. The main advantage of this format as opposed to an established format like json is that an sf file can be written iteratively tree by tree and multiple .sf files can be combined with minimal logic required allowing for massively parallel growth of forests with low memory use. An .sf file consists of lines each of which is a comma separated list of key value pairs. Lines can designate either a FOREST, TREE, or NODE. Each tree belongs to the preceding forest and each node to the preceding tree. Nodes must be written in order of increasing depth. CloudForest generates fewer fields then rf-ace but requires the following. Other fields will be ignored Forest requires forest type (only RF currently), target and ntrees: Tree requires only an int and the value is ignored though the line is needed to designate a new tree: Node requires a path encoded so that the root node is specified by "*" and each split left or right as "L" or "R". Leaf nodes should also define PRED such as "PRED=1.5" or "PRED=red". Splitter nodes should define SPLITTER with a feature id inside of double quotes, SPLITTERTYPE=[CATEGORICAL|NUMERICAL] and a LVALUE term which can be either a float inside of double quotes representing the highest value sent left or a ":" separated list of categorical values sent left. An example .sf file: Cloud forest can parse and apply .sf files generated by at least some versions of rf-ace. The idea for (and trademark of the term) Random Forests originated with Leo Brieman and Adele Cuttler. Their code and paper's can be found at: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm All code in CloudForest is original but some ideas for methods and optimizations were inspired by Timo Erkilla's rf-ace and Andy Liaw and Matthew Wiener randomForest R package based on Brieman and Cuttler's code: https://code.google.com/p/rf-ace/ http://cran.r-project.org/web/packages/randomForest/index.html The idea for Artificial Contrasts was found in: Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's paper "Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination" http://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf The idea for growing trees to minimize categorical entropy comes from Ross Quinlan's ID3: http://en.wikipedia.org/wiki/ID3_algorithm "The Elements of Statistical Learning" 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman was also consulted during development.

v0.0.0-20131023235215-805a285a97ce • 12 years ago

github.com/csucu/enmime

Package enmime implements a MIME encoding and decoding library. It's built on top of Go's included mime/multipart support where possible, but is geared towards parsing MIME encoded emails. The enmime API has two conceptual layers. The lower layer is a tree of Part structs, representing each component of a decoded MIME message. The upper layer, called an Envelope provides an intuitive way to interact with a MIME message. Calling ReadParts causes enmime to parse the body of a MIME message into a tree of Part objects, each of which is aware of its content type, filename and headers. The content of a Part is available as a slice of bytes via the Content field. If the part was encoded in quoted-printable or base64, it is decoded prior to being placed in Content. If the Part contains text in a character set other than utf-8, enmime will attempt to convert it to utf-8. To locate a particular Part, pass a custom PartMatcher function into the BreadthMatchFirst() or DepthMatchFirst() methods to search the Part tree. BreadthMatchAll() and DepthMatchAll() will collect all Parts matching your criteria. ReadEnvelope returns an Envelope struct. Behind the scenes a Part tree is constructed, and then sorted into the correct fields of the Envelope. The Envelope contains both the plain text and HTML portions of the email. If there was no plain text Part available, the HTML Part will be down-converted using the html2text library1. The root of the Part tree, as well as slices of the inline and attachment Parts are also available. Every MIME Part has its own headers, accessible via the Part.Header field. The raw headers for an Envelope are available in Root.Header. Envelope also provides helper methods to fetch headers: GetHeader(key) will return the RFC 2047 decoded value of the specified header. AddressList(key) will convert the specified address header into a slice of net/mail.Address values. enmime attempts to be tolerant of poorly encoded MIME messages. In situations where parsing is not possible, the ReadEnvelope and ReadParts functions will return a hard error. If enmime is able to continue parsing the message, it will add an entry to the Errors slice on the relevant Part. After parsing is complete, all Part errors will be appended to the Envelope Errors slice. The Error* constants can be used to identify a specific class of error. Please note that enmime parses messages into memory, so it is not likely to perform well with multi-gigabyte attachments. enmime is open source software released under the MIT License. The latest version can be found at https://github.com/jhillyerd/enmime

v0.9.2 • 4 years ago

github.com/peerfekt/rod

This example opens https://github.com/, searches for "git", and then gets the header element which gives the description for Git. Rod use https://golang.org/pkg/context to handle cancelations for IO blocking operations, most times it's timeout. Context will be recursively passed to all sub-methods. For example, methods like Page.Context(ctx) will return a clone of the page with the ctx, all the methods of the returned page will use the ctx if they have IO blocking operations. Page.Timeout or Page.WithCancel is just a shortcut for Page.Context. Of course, Browser or Element works the same way. Shows how we can further customize the browser with the launcher library. Usually you use launcher lib to set the browser's command line flags (switches). Doc for flags: https://peter.sh/experiments/chromium-command-line-switches Shows how to change the retry/polling options that is used to query elements. This is useful when you want to customize the element query retry logic. When rod doesn't have a feature that you need. You can easily call the cdp to achieve it. List of cdp API: https://github.com/go-rod/rod/tree/master/lib/proto Shows how to disable headless mode and debug. Rod provides a lot of debug options, you can set them with setter methods or use environment variables. Doc for environment variables: https://pkg.go.dev/github.com/go-rod/rod/lib/defaults We use "Must" prefixed functions to write example code. But in production you may want to use the no-prefix version of them. About why we use "Must" as the prefix, it's similar to https://golang.org/pkg/regexp/#MustCompile Shows how to share a remote object reference between two Eval Shows how to listen for events. Shows how to intercept requests and modify both the request and the response. The entire process of hijacking one request: The --req-> and --res-> are the parts that can be modified. Show how to handle multiple results of an action. Such as when you login a page, the result can be success or wrong password. Example_search shows how to use Search to get element inside nested iframes or shadow DOMs. It works the same as https://developers.google.com/web/tools/chrome-devtools/dom#search Shows how to update the state of the current page. In this example we enable the network domain. Rod uses mouse cursor to simulate clicks, so if a button is moving because of animation, the click may not work as expected. We usually use WaitStable to make sure the target isn't changing anymore. When you want to wait for an ajax request to complete, this example will be useful.

v0.106.9 • 3 years ago

github.com/neuspaces/enmime

Package enmime implements a MIME encoding and decoding library. It's built on top of Go's included mime/multipart support where possible, but is geared towards parsing MIME encoded emails. The enmime API has two conceptual layers. The lower layer is a tree of Part structs, representing each component of a decoded MIME message. The upper layer, called an Envelope provides an intuitive way to interact with a MIME message. Calling ReadParts causes enmime to parse the body of a MIME message into a tree of Part objects, each of which is aware of its content type, filename and headers. The content of a Part is available as a slice of bytes via the Content field. If the part was encoded in quoted-printable or base64, it is decoded prior to being placed in Content. If the Part contains text in a character set other than utf-8, enmime will attempt to convert it to utf-8. To locate a particular Part, pass a custom PartMatcher function into the BreadthMatchFirst() or DepthMatchFirst() methods to search the Part tree. BreadthMatchAll() and DepthMatchAll() will collect all Parts matching your criteria. ReadEnvelope returns an Envelope struct. Behind the scenes a Part tree is constructed, and then sorted into the correct fields of the Envelope. The Envelope contains both the plain text and HTML portions of the email. If there was no plain text Part available, the HTML Part will be down-converted using the html2text library1. The root of the Part tree, as well as slices of the inline and attachment Parts are also available. Every MIME Part has its own headers, accessible via the Part.Header field. The raw headers for an Envelope are available in Root.Header. Envelope also provides helper methods to fetch headers: GetHeader(key) will return the RFC 2047 decoded value of the specified header. AddressList(key) will convert the specified address header into a slice of net/mail.Address values. enmime attempts to be tolerant of poorly encoded MIME messages. In situations where parsing is not possible, the ReadEnvelope and ReadParts functions will return a hard error. If enmime is able to continue parsing the message, it will add an entry to the Errors slice on the relevant Part. After parsing is complete, all Part errors will be appended to the Envelope Errors slice. The Error* constants can be used to identify a specific class of error. Please note that enmime parses messages into memory, so it is not likely to perform well with multi-gigabyte attachments. enmime is open source software released under the MIT License. The latest version can be found at https://github.com/jhillyerd/enmime

v0.8.2 • 5 years ago

github.com/18F/seekret

github.com/open-ch/enmime

github.com/SherinV/search-api

github.com/awskii/hotellook

github.com/abhisek/asn-search-api

github.com/UmboCV/enmime

github.com/kentquirk/stringset

github.com/deface90/sqlittle

github.com/discoriver/dohyo

github.com/mattcui/ibm-global-search-client

github.com/marikishtar/cloudforest

bitbucket.org/asante/log

github.com/DiscoRiver/dohyo

github.com/peerfekt/go-rod

github.com/inamik/go_bst

github.com/xhbase/enmime

github.com/pzeinlinger/enmime

github.com/abhinavdangeti/moss

github.com/nigelsmith/nationbuilder

github.com/overvenus/errors

git.kiefte.eu/lapingvino/phelpsify

github.com/gmccue/go-epa-echo

github.com/yanatan16/golang-spsa

xgit.xiaoniangao.cn/xngo/service/xng-search-gateway-api

github.com/malview/enmime

github.com/dexterlb/osdb

github.com/orangejohny/sbweb

github.com/pappa/strike

github.com/sourcegraph/goloader

github.com/chirayu/lingua-go

github.com/imqs/simplexml

gopkg.in/eraclitux/cfgp.v0

github.com/webklex/freegeoip

github.com/swarvanusg/GoPlug

github.com/slvmnd/enmime

github.com/lytics/cloudforest

github.com/mohamedattahri/go-enigma

github.com/hhh0pe/osdb

github.com/golearn/cloudforest

github.com/alireza-ahmadi/siahe

github.com/Webklex/freegeoip

github.com/csucu/enmime

github.com/kilicoglutuncay/elasticlient

github.com/peerfekt/rod

github.com/henrylee2cn/scrape

github.com/neuspaces/enmime

github.com/qbui/freegeoip

github.com/getconversio/freegeoip

github.com/han2015/esql

github.com/cention-sany/enmime