Package pointer implements Andersen's analysis, an inclusion-based pointer analysis algorithm first described in (Andersen, 1994). A pointer analysis relates every pointer expression in a whole program to the set of memory locations to which it might point. This information can be used to construct a call graph of the program that precisely represents the destinations of dynamic function and method calls. It can also be used to determine, for example, which pairs of channel operations operate on the same channel. The package allows the client to request a set of expressions of interest for which the points-to information will be returned once the analysis is complete. In addition, the client may request that a callgraph is constructed. The example program in example_test.go demonstrates both of these features. Clients should not request more information than they need since it may increase the cost of the analysis significantly. Our algorithm is INCLUSION-BASED: the points-to sets for x and y will be related by pts(y) ⊇ pts(x) if the program contains the statement y = x. It is FLOW-INSENSITIVE: it ignores all control flow constructs and the order of statements in a program. It is therefore a "MAY ALIAS" analysis: its facts are of the form "P may/may not point to L", not "P must point to L". It is FIELD-SENSITIVE: it builds separate points-to sets for distinct fields, such as x and y in struct { x, y *int }. It is mostly CONTEXT-INSENSITIVE: most functions are analyzed once, so values can flow in at one call to the function and return out at another. Only some smaller functions are analyzed with consideration of their calling context. It has a CONTEXT-SENSITIVE HEAP: objects are named by both allocation site and context, so the objects returned by two distinct calls to f: are distinguished up to the limits of the calling context. It is a WHOLE PROGRAM analysis: it requires SSA-form IR for the complete Go program and summaries for native code. See the (Hind, PASTE'01) survey paper for an explanation of these terms. The analysis is fully sound when invoked on pure Go programs that do not use reflection or unsafe.Pointer conversions. In other words, if there is any possible execution of the program in which pointer P may point to object O, the analysis will report that fact. By default, the "reflect" library is ignored by the analysis, as if all its functions were no-ops, but if the client enables the Reflection flag, the analysis will make a reasonable attempt to model the effects of calls into this library. However, this comes at a significant performance cost, and not all features of that library are yet implemented. In addition, some simplifying approximations must be made to ensure that the analysis terminates; for example, reflection can be used to construct an infinite set of types and values of those types, but the analysis arbitrarily bounds the depth of such types. Most but not all reflection operations are supported. In particular, addressable reflect.Values are not yet implemented, so operations such as (reflect.Value).Set have no analytic effect. The pointer analysis makes no attempt to understand aliasing between the operand x and result y of an unsafe.Pointer conversion: It is as if the conversion allocated an entirely new object: The analysis cannot model the aliasing effects of functions written in languages other than Go, such as runtime intrinsics in C or assembly, or code accessed via cgo. The result is as if such functions are no-ops. However, various important intrinsics are understood by the analysis, along with built-ins such as append. The analysis currently provides no way for users to specify the aliasing effects of native code. ------------------------------------------------------------------------ The remaining documentation is intended for package maintainers and pointer analysis specialists. Maintainers should have a solid understanding of the referenced papers (especially those by H&L and PKH) before making making significant changes. The implementation is similar to that described in (Pearce et al, PASTE'04). Unlike many algorithms which interleave constraint generation and solving, constructing the callgraph as they go, this implementation for the most part observes a phase ordering (generation before solving), with only simple (copy) constraints being generated during solving. (The exception is reflection, which creates various constraints during solving as new types flow to reflect.Value operations.) This improves the traction of presolver optimisations, but imposes certain restrictions, e.g. potential context sensitivity is limited since all variants must be created a priori. A type is said to be "pointer-like" if it is a reference to an object. Pointer-like types include pointers and also interfaces, maps, channels, functions and slices. We occasionally use C's x->f notation to distinguish the case where x is a struct pointer from x.f where is a struct value. Pointer analysis literature (and our comments) often uses the notation dst=*src+offset to mean something different than what it means in Go. It means: for each node index p in pts(src), the node index p+offset is in pts(dst). Similarly *dst+offset=src is used for store constraints and dst=src+offset for offset-address constraints. Nodes are the key datastructure of the analysis, and have a dual role: they represent both constraint variables (equivalence classes of pointers) and members of points-to sets (things that can be pointed at, i.e. "labels"). Nodes are naturally numbered. The numbering enables compact representations of sets of nodes such as bitvectors (or BDDs); and the ordering enables a very cheap way to group related nodes together. For example, passing n parameters consists of generating n parallel constraints from caller+i to callee+i for 0<=i<n. The zero nodeid means "not a pointer". For simplicity, we generate flow constraints even for non-pointer types such as int. The pointer equivalence (PE) presolver optimization detects which variables cannot point to anything; this includes not only all variables of non-pointer types (such as int) but also variables of pointer-like types if they are always nil, or are parameters to a function that is never called. Each node represents a scalar part of a value or object. Aggregate types (structs, tuples, arrays) are recursively flattened out into a sequential list of scalar component types, and all the elements of an array are represented by a single node. (The flattening of a basic type is a list containing a single node.) Nodes are connected into a graph with various kinds of labelled edges: simple edges (or copy constraints) represent value flow. Complex edges (load, store, etc) trigger the creation of new simple edges during the solving phase. Conceptually, an "object" is a contiguous sequence of nodes denoting an addressable location: something that a pointer can point to. The first node of an object has a non-nil obj field containing information about the allocation: its size, context, and ssa.Value. Objects include: Many objects have no Go types. For example, the func, map and chan type kinds in Go are all varieties of pointers, but their respective objects are actual functions (executable code), maps (hash tables), and channels (synchronized queues). Given the way we model interfaces, they too are pointers to "tagged" objects with no Go type. And an *ssa.Global denotes the address of a global variable, but the object for a Global is the actual data. So, the types of an ssa.Value that creates an object is "off by one indirection": a pointer to the object. The individual nodes of an object are sometimes referred to as "labels". For uniformity, all objects have a non-zero number of fields, even those of the empty type struct{}. (All arrays are treated as if of length 1, so there are no empty arrays. The empty tuple is never address-taken, so is never an object.) An tagged object has the following layout: The T node's typ field is the dynamic type of the "payload": the value v which follows, flattened out. The T node's obj has the otTagged flag. Tagged objects are needed when generalizing across types: interfaces, reflect.Values, reflect.Types. Each of these three types is modelled as a pointer that exclusively points to tagged objects. Tagged objects may be indirect (obj.flags ⊇ {otIndirect}) meaning that the value v is not of type T but *T; this is used only for reflect.Values that represent lvalues. (These are not implemented yet.) Variables of the following "scalar" types may be represented by a single node: basic types, pointers, channels, maps, slices, 'func' pointers, interfaces. Pointers: Nothing to say here, oddly. Basic types (bool, string, numbers, unsafe.Pointer): Currently all fields in the flattening of a type, including non-pointer basic types such as int, are represented in objects and values. Though non-pointer nodes within values are uninteresting, non-pointer nodes in objects may be useful (if address-taken) because they permit the analysis to deduce, in this example, that p points to s.x. If we ignored such object fields, we could only say that p points somewhere within s. All other basic types are ignored. Expressions of these types have zero nodeid, and fields of these types within aggregate other types are omitted. unsafe.Pointers are not modelled as pointers, so a conversion of an unsafe.Pointer to *T is (unsoundly) treated equivalent to new(T). Channels: An expression of type 'chan T' is a kind of pointer that points exclusively to channel objects, i.e. objects created by MakeChan (or reflection). 'chan T' is treated like *T. *ssa.MakeChan is treated as equivalent to new(T). *ssa.Send and receive (*ssa.UnOp(ARROW)) and are equivalent to store Maps: An expression of type 'map[K]V' is a kind of pointer that points exclusively to map objects, i.e. objects created by MakeMap (or reflection). map K[V] is treated like *M where M = struct{k K; v V}. *ssa.MakeMap is equivalent to new(M). *ssa.MapUpdate is equivalent to *y=x where *y and x have type M. *ssa.Lookup is equivalent to y=x.v where x has type *M. Slices: A slice []T, which dynamically resembles a struct{array *T, len, cap int}, is treated as if it were just a *T pointer; the len and cap fields are ignored. *ssa.MakeSlice is treated like new([1]T): an allocation of a *ssa.Index on a slice is equivalent to a load. *ssa.IndexAddr on a slice returns the address of the sole element of the slice, i.e. the same address. *ssa.Slice is treated as a simple copy. Functions: An expression of type 'func...' is a kind of pointer that points exclusively to function objects. A function object has the following layout: There may be multiple function objects for the same *ssa.Function due to context-sensitive treatment of some functions. The first node is the function's identity node. Associated with every callsite is a special "targets" variable, whose pts() contains the identity node of each function to which the call may dispatch. Identity words are not otherwise used during the analysis, but we construct the call graph from the pts() solution for such nodes. The following block of contiguous nodes represents the flattened-out types of the parameters ("P-block") and results ("R-block") of the function object. The treatment of free variables of closures (*ssa.FreeVar) is like that of global variables; it is not context-sensitive. *ssa.MakeClosure instructions create copy edges to Captures. A Go value of type 'func' (i.e. a pointer to one or more functions) is a pointer whose pts() contains function objects. The valueNode() for an *ssa.Function returns a singleton for that function. Interfaces: An expression of type 'interface{...}' is a kind of pointer that points exclusively to tagged objects. All tagged objects pointed to by an interface are direct (the otIndirect flag is clear) and concrete (the tag type T is not itself an interface type). The associated ssa.Value for an interface's tagged objects may be an *ssa.MakeInterface instruction, or nil if the tagged object was created by an instrinsic (e.g. reflection). Constructing an interface value causes generation of constraints for all of the concrete type's methods; we can't tell a priori which ones may be called. TypeAssert y = x.(T) is implemented by a dynamic constraint triggered by each tagged object O added to pts(x): a typeFilter constraint if T is an interface type, or an untag constraint if T is a concrete type. A typeFilter tests whether O.typ implements T; if so, O is added to pts(y). An untagFilter tests whether O.typ is assignable to T,and if so, a copy edge O.v -> y is added. ChangeInterface is a simple copy because the representation of tagged objects is independent of the interface type (in contrast to the "method tables" approach used by the gc runtime). y := Invoke x.m(...) is implemented by allocating contiguous P/R blocks for the callsite and adding a dynamic rule triggered by each tagged object added to pts(x). The rule adds param/results copy edges to/from each discovered concrete method. (Q. Why do we model an interface as a pointer to a pair of type and value, rather than as a pair of a pointer to type and a pointer to value? A. Control-flow joins would merge interfaces ({T1}, {V1}) and ({T2}, {V2}) to make ({T1,T2}, {V1,V2}), leading to the infeasible and type-unsafe combination (T1,V2). Treating the value and its concrete type as inseparable makes the analysis type-safe.) Type parameters: Type parameters are not directly supported by the analysis. Calls to generic functions will be left as if they had empty bodies. Users of the package are expected to use the ssa.InstantiateGenerics builder mode when building code that uses or depends on code containing generics. reflect.Value: A reflect.Value is modelled very similar to an interface{}, i.e. as a pointer exclusively to tagged objects, but with two generalizations. 1. a reflect.Value that represents an lvalue points to an indirect (obj.flags ⊇ {otIndirect}) tagged object, which has a similar layout to an tagged object except that the value is a pointer to the dynamic type. Indirect tagged objects preserve the correct aliasing so that mutations made by (reflect.Value).Set can be observed. Indirect objects only arise when an lvalue is derived from an rvalue by indirection, e.g. the following code: Whether indirect or not, the concrete type of the tagged object corresponds to the user-visible dynamic type, and the existence of a pointer is an implementation detail. (NB: indirect tagged objects are not yet implemented) 2. The dynamic type tag of a tagged object pointed to by a reflect.Value may be an interface type; it need not be concrete. This arises in code such as this: pts(eface) is a singleton containing an interface{}-tagged object. That tagged object's payload is an interface{} value, i.e. the pts of the payload contains only concrete-tagged objects, although in this example it's the zero interface{} value, so its pts is empty. reflect.Type: Just as in the real "reflect" library, we represent a reflect.Type as an interface whose sole implementation is the concrete type, *reflect.rtype. (This choice is forced on us by go/types: clients cannot fabricate types with arbitrary method sets.) rtype instances are canonical: there is at most one per dynamic type. (rtypes are in fact large structs but since identity is all that matters, we represent them by a single node.) The payload of each *rtype-tagged object is an *rtype pointer that points to exactly one such canonical rtype object. We exploit this by setting the node.typ of the payload to the dynamic type, not '*rtype'. This saves us an indirection in each resolution rule. As an optimisation, *rtype-tagged objects are canonicalized too. Aggregate types: Aggregate types are treated as if all directly contained aggregates are recursively flattened out. Structs: *ssa.Field y = x.f creates a simple edge to y from x's node at f's offset. *ssa.FieldAddr y = &x->f requires a dynamic closure rule to create The nodes of a struct consist of a special 'identity' node (whose type is that of the struct itself), followed by the nodes for all the struct's fields, recursively flattened out. A pointer to the struct is a pointer to its identity node. That node allows us to distinguish a pointer to a struct from a pointer to its first field. Field offsets are logical field offsets (plus one for the identity node), so the sizes of the fields can be ignored by the analysis. (The identity node is non-traditional but enables the distinction described above, which is valuable for code comprehension tools. Typical pointer analyses for C, whose purpose is compiler optimization, must soundly model unsafe.Pointer (void*) conversions, and this requires fidelity to the actual memory layout using physical field offsets.) *ssa.Field y = x.f creates a simple edge to y from x's node at f's offset. *ssa.FieldAddr y = &x->f requires a dynamic closure rule to create Arrays: We model an array by an identity node (whose type is that of the array itself) followed by a node representing all the elements of the array; the analysis does not distinguish elements with different indices. Effectively, an array is treated like struct{elem T}, a load y=x[i] like y=x.elem, and a store x[i]=y like x.elem=y; the index i is ignored. A pointer to an array is pointer to its identity node. (A slice is also a pointer to an array's identity node.) The identity node allows us to distinguish a pointer to an array from a pointer to one of its elements, but it is rather costly because it introduces more offset constraints into the system. Furthermore, sound treatment of unsafe.Pointer would require us to dispense with this node. Arrays may be allocated by Alloc, by make([]T), by calls to append, and via reflection. Tuples (T, ...): Tuples are treated like structs with naturally numbered fields. *ssa.Extract is analogous to *ssa.Field. However, tuples have no identity field since by construction, they cannot be address-taken. There are three kinds of function call: Cases 1 and 2 apply equally to methods and standalone functions. Static calls: A static call consists three steps: A static function call is little more than two struct value copies between the P/R blocks of caller and callee: Context sensitivity: Static calls (alone) may be treated context sensitively, i.e. each callsite may cause a distinct re-analysis of the callee, improving precision. Our current context-sensitivity policy treats all intrinsics and getter/setter methods in this manner since such functions are small and seem like an obvious source of spurious confluences, though this has not yet been evaluated. Dynamic function calls: Dynamic calls work in a similar manner except that the creation of copy edges occurs dynamically, in a similar fashion to a pair of struct copies in which the callee is indirect: (Recall that the function object's P- and R-blocks are contiguous.) Interface method invocation: For invoke-mode calls, we create a params/results block for the callsite and attach a dynamic closure rule to the interface. For each new tagged object that flows to the interface, we look up the concrete method, find its function object, and connect its P/R blocks to the callsite's P/R blocks, adding copy edges to the graph during solving. Recording call targets: The analysis notifies its clients of each callsite it encounters, passing a CallSite interface. Among other things, the CallSite contains a synthetic constraint variable ("targets") whose points-to solution includes the set of all function objects to which the call may dispatch. It is via this mechanism that the callgraph is made available. Clients may also elect to be notified of callgraph edges directly; internally this just iterates all "targets" variables' pts(·)s. We implement Hash-Value Numbering (HVN), a pre-solver constraint optimization described in Hardekopf & Lin, SAS'07. This is documented in more detail in hvn.go. We intend to add its cousins HR and HU in future. The solver is currently a naive Andersen-style implementation; it does not perform online cycle detection, though we plan to add solver optimisations such as Hybrid- and Lazy- Cycle Detection from (Hardekopf & Lin, PLDI'07). It uses difference propagation (Pearce et al, SQC'04) to avoid redundant re-triggering of closure rules for values already seen. Points-to sets are represented using sparse bit vectors (similar to those used in LLVM and gcc), which are more space- and time-efficient than sets based on Go's built-in map type or dense bit vectors. Nodes are permuted prior to solving so that object nodes (which may appear in points-to sets) are lower numbered than non-object (var) nodes. This improves the density of the set over which the PTSs range, and thus the efficiency of the representation. Partly thanks to avoiding map iteration, the execution of the solver is 100% deterministic, a great help during debugging. Andersen, L. O. 1994. Program analysis and specialization for the C programming language. Ph.D. dissertation. DIKU, University of Copenhagen. David J. Pearce, Paul H. J. Kelly, and Chris Hankin. 2004. Efficient field-sensitive pointer analysis for C. In Proceedings of the 5th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering (PASTE '04). ACM, New York, NY, USA, 37-42. http://doi.acm.org/10.1145/996821.996835 David J. Pearce, Paul H. J. Kelly, and Chris Hankin. 2004. Online Cycle Detection and Difference Propagation: Applications to Pointer Analysis. Software Quality Control 12, 4 (December 2004), 311-337. http://dx.doi.org/10.1023/B:SQJO.0000039791.93071.a2 David Grove and Craig Chambers. 2001. A framework for call graph construction algorithms. ACM Trans. Program. Lang. Syst. 23, 6 (November 2001), 685-746. http://doi.acm.org/10.1145/506315.506316 Ben Hardekopf and Calvin Lin. 2007. The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation (PLDI '07). ACM, New York, NY, USA, 290-299. http://doi.acm.org/10.1145/1250734.1250767 Ben Hardekopf and Calvin Lin. 2007. Exploiting pointer and location equivalence to optimize pointer analysis. In Proceedings of the 14th international conference on Static Analysis (SAS'07), Hanne Riis Nielson and Gilberto Filé (Eds.). Springer-Verlag, Berlin, Heidelberg, 265-280. Atanas Rountev and Satish Chandra. 2000. Off-line variable substitution for scaling points-to analysis. In Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation (PLDI '00). ACM, New York, NY, USA, 47-56. DOI=10.1145/349299.349310 http://doi.acm.org/10.1145/349299.349310 This program demonstrates how to use the pointer analysis to obtain a conservative call-graph of a Go program. It also shows how to compute the points-to set of a variable, in this case, (C).f's ch parameter.
Package grpcreplay supports the capture and replay of gRPC calls. Its main goal is to improve testing. Once you capture the calls of a test that runs against a real service, you have an "automatic mock" that can be replayed against the same test, yielding a unit test that is fast and flake-free. To record a sequence of gRPC calls to a file, create a Recorder and pass its DialOptions to grpc.Dial: It is essential to close the Recorder when the interaction is finished. There is also a NewRecorderWriter function for capturing to an arbitrary io.Writer. To replay a captured file, create a Replayer and ask it for a (fake) connection. We don't actually have to dial a server. (Since we're reading the file and not writing it, we don't have to be as careful about the error returned from Close). A test might use random or time-sensitive values, for instance to create unique resources for isolation from other tests. The test therefore has initial values, such as the current time, or a random seed, that differ from run to run. You must record this initial state and re-establish it on replay. To record the initial state, serialize it into a []byte and pass it as the second argument to NewRecorder: On replay, get the bytes from Replayer.Initial: Recorders and replayers have support for running callbacks before messages are written to or read from the replay file. A Recorder has a BeforeFunc that can modify a request or response before it is written to the replay file. The actual RPCs sent to the service during recording remain unaltered; only what is saved in the replay file can be changed. A Replayer has a BeforeFunc that can modify a request before it is sent for matching. Example uses for these callbacks include customized logging, or scrubbing data before RPCs are written to the replay file. If requests are modified by the callbacks during recording, it is important to perform the same modifications to the requests when replaying, or RPC matching on replay will fail. A common way to analyze and modify the various messages is to use a type switch. A nondeterministic program may invoke RPCs in a different order each time it is run. The order in which RPCs are called during recording may differ from the order during replay. The replayer matches incoming to recorded requests by method name and request contents, so nondeterminism is only a concern for identical requests that result in different responses. A nondeterministic program whose behavior differs depending on the order of such RPCs probably has a race condition: since both the recorded sequence of RPCs and the sequence during replay are valid orderings, the program should behave the same under both. The same is not true of streaming RPCs. The replayer matches streams only by method name, since it has no other information at the time the stream is opened. Two streams with the same method name that are started concurrently may replay in the wrong order. Besides the differences in replay mentioned above, other differences may cause issues for some programs. We list them here. The Replayer delivers a response to an RPC immediately, without waiting for other incoming RPCs. This can violate causality. For example, in a Pub/Sub program where one goroutine publishes and another subscribes, during replay the Subscribe call may finish before the Publish call begins. For streaming RPCs, the Replayer delivers the result of Send and Recv calls in the order they were recorded. No attempt is made to match message contents. At present, this package does not record or replay stream headers and trailers, or the result of the CloseSend method.
Package analysis provides methods to work with a Swagger specification document from package go-openapi/spec. ## Analyzing a specification An analysed specification object (type Spec) provides methods to work with swagger definition. ## Flattening or expanding a specification Flattening a specification bundles all remote $ref in the main spec document. Depending on flattening options, additional preprocessing may take place: ## Merging several specifications Mixin several specifications merges all Swagger constructs, and warns about found conflicts. ## Fixing a specification Unmarshalling a specification with golang json unmarshalling may lead to some unwanted result on present but empty fields. ## Analyzing a Swagger schema Swagger schemas are analyzed to determine their complexity and qualify their content.
Package jump implements the "jump consistent hash" algorithm. Example Reference C++ implementation[1] Jump consistent hash works by computing when its output changes as the number of buckets increases. Let ch(key, num_buckets) be the consistent hash for the key when there are num_buckets buckets. Clearly, for any key, k, ch(k, 1) is 0, since there is only the one bucket. In order for the consistent hash function to balanced, ch(k, 2) will have to stay at 0 for half the keys, k, while it will have to jump to 1 for the other half. In general, ch(k, n+1) has to stay the same as ch(k, n) for n/(n+1) of the keys, and jump to n for the other 1/(n+1) of the keys. Here are examples of the consistent hash values for three keys, k1, k2, and k3, as num_buckets goes up: A linear time algorithm can be defined by using the formula for the probability of ch(key, j) jumping when j increases. It essentially walks across a row of this table. Given a key and number of buckets, the algorithm considers each successive bucket, j, from 1 to num_buckets1, and uses ch(key, j) to compute ch(key, j+1). At each bucket, j, it decides whether to keep ch(k, j+1) the same as ch(k, j), or to jump its value to j. In order to jump for the right fraction of keys, it uses a pseudorandom number generator with the key as its seed. To jump for 1/(j+1) of keys, it generates a uniform random number between 0.0 and 1.0, and jumps if the value is less than 1/(j+1). At the end of the loop, it has computed ch(k, num_buckets), which is the desired answer. In code: We can convert this to a logarithmic time algorithm by exploiting that ch(key, j+1) is usually unchanged as j increases, only jumping occasionally. The algorithm will only compute the destinations of jumps the j’s for which ch(key, j+1) ≠ ch(key, j). Also notice that for these j’s, ch(key, j+1) = j. To develop the algorithm, we will treat ch(key, j) as a random variable, so that we can use the notation for random variables to analyze the fractions of keys for which various propositions are true. That will lead us to a closed form expression for a pseudorandom variable whose value gives the destination of the next jump. Suppose that the algorithm is tracking the bucket numbers of the jumps for a particular key, k. And suppose that b was the destination of the last jump, that is, ch(k, b) ≠ ch(k, b+1), and ch(k, b+1) = b. Now, we want to find the next jump, the smallest j such that ch(k, j+1) ≠ ch(k, b+1), or equivalently, the largest j such that ch(k, j) = ch(k, b+1). We will make a pseudorandom variable whose value is that j. To get a probabilistic constraint on j, note that for any bucket number, i, we have j ≥ i if and only if the consistent hash hasn’t changed by i, that is, if and only if ch(k, i) = ch(k, b+1). Hence, the distribution of j must satisfy Fortunately, it is easy to compute that probability. Notice that since P( ch(k, 10) = ch(k, 11) ) is 10/11, and P( ch(k, 11) = ch(k, 12) ) is 11/12, then P( ch(k, 10) = ch(k, 12) ) is 10/11 * 11/12 = 10/12. In general, if n ≥ m, P( ch(k, n) = ch(k, m) ) = m / n. Thus for any i > b, Now, we generate a pseudorandom variable, r, (depending on k and j) that is uniformly distributed between 0 and 1. Since we want P(j ≥ i) = (b+1) / i, we set P(j ≥ i) iff r ≤ (b+1) / i. Solving the inequality for i yields P(j ≥ i) iff i ≤ (b+1) / r. Since i is a lower bound on j, j will equal the largest i for which P(j ≥ i), thus the largest i satisfying i ≤ (b+1) / r. Thus, by the definition of the floor function, j = floor((b+1) / r). Using this formula, jump consistent hash finds ch(key, num_buckets) by choosing successive jump destinations until it finds a position at or past num_buckets. It then knows that the previous jump destination is the answer. To turn this into the actual code of figure 1, we need to implement random. We want it to be fast, and yet to also to have well distributed successive values. We use a 64bit linear congruential generator; the particular multiplier we use produces random numbers that are especially well distributed in higher dimensions (i.e., when successive random values are used to form tuples). We use the key as the seed. (For keys that don’t fit into 64 bits, a 64 bit hash of the key should be used.) The congruential generator updates the seed on each iteration, and the code derives a double from the current seed. Tests show that this generator has good speed and distribution. It is worth noting that unlike the algorithm of Karger et al., jump consistent hash does not require the key to be hashed if it is already an integer. This is because jump consistent hash has an embedded pseudorandom number generator that essentially rehashes the key on every iteration. The hash is not especially good (i.e., linear congruential), but since it is applied repeatedly, additional hashing of the input key is not necessary. [1] http://arxiv.org/pdf/1406.2294v1.pdf
Package pbc provides structures for building pairing-based cryptosystems. It is a wrapper around the Pairing-Based Cryptography (PBC) Library authored by Ben Lynn (https://crypto.stanford.edu/pbc/). This wrapper provides access to all PBC functions. It supports generation of various types of elliptic curves and pairings, element initialization, I/O, and arithmetic. These features can be used to quickly build pairing-based or conventional cryptosystems. The PBC library is designed to be extremely fast. Internally, it uses GMP for arbitrary-precision arithmetic. It also includes a wide variety of optimizations that make pairing-based cryptography highly efficient. To improve performance, PBC does not perform type checking to ensure that operations actually make sense. The Go wrapper provides the ability to add compatibility checks to most operations, or to use unchecked elements to maximize performance. Since this library provides low-level access to pairing primitives, it is very easy to accidentally construct insecure systems. This library is intended to be used by cryptographers or to implement well-analyzed cryptosystems. Cryptographic pairings are defined over three mathematical groups: G1, G2, and GT, where each group is typically of the same order r. Additionally, a bilinear map e maps a pair of elements — one from G1 and another from G2 — to an element in GT. This map e has the following additional property: If G1 == G2, then a pairing is said to be symmetric. Otherwise, it is asymmetric. Pairings can be used to construct a variety of efficient cryptosystems. The PBC library currently supports 5 different types of pairings, each with configurable parameters. These types are designated alphabetically, roughly in chronological order of introduction. Type A, D, E, F, and G pairings are implemented in the library. Each type has different time and space requirements. For more information about the types, see the documentation for the corresponding generator calls, or the PBC manual page at https://crypto.stanford.edu/pbc/manual/ch05s01.html. This package must be compiled using cgo. It also requires the installation of GMP and PBC. During the build process, this package will attempt to include <gmp.h> and <pbc/pbc.h>, and then dynamically link to GMP and PBC. Most systems include a package for GMP. To install GMP in Debian / Ubuntu: For an RPM installation with YUM: For installation with Fink (http://www.finkproject.org/) on Mac OS X: For more information or to compile from source, visit https://gmplib.org/ To install the PBC library, download the appropriate files for your system from https://crypto.stanford.edu/pbc/download.html. PBC has three dependencies: the gcc compiler, flex (http://flex.sourceforge.net/), and bison (https://www.gnu.org/software/bison/). See the respective sites for installation instructions. Most distributions include packages for these libraries. For example, in Debian / Ubuntu: The PBC source can be compiled and installed using the usual GNU Build System: After installing, you may need to rebuild the search path for libraries: It is possible to install the package on Windows through the use of MinGW and MSYS. MSYS is required for installing PBC, while GMP can be installed through a package. Based on your MinGW installation, you may need to add "-I/usr/local/include" to CPPFLAGS and "-L/usr/local/lib" to LDFLAGS when building PBC. Likewise, you may need to add these options to CGO_CPPFLAGS and CGO_LDFLAGS when installing this package. This package is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For additional details, see the COPYING and COPYING.LESSER files. This example generates a pairing and some random group elements, then applies the pairing operation. This example computes and verifies a Boneh-Lynn-Shacham signature in a simulated conversation between Alice and Bob.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package pbc provides structures for building pairing-based cryptosystems. It is a wrapper around the Pairing-Based Cryptography (PBC) Library authored by Ben Lynn (https://crypto.stanford.edu/pbc/). This wrapper provides access to all PBC functions. It supports generation of various types of elliptic curves and pairings, element initialization, I/O, and arithmetic. These features can be used to quickly build pairing-based or conventional cryptosystems. The PBC library is designed to be extremely fast. Internally, it uses GMP for arbitrary-precision arithmetic. It also includes a wide variety of optimizations that make pairing-based cryptography highly efficient. To improve performance, PBC does not perform type checking to ensure that operations actually make sense. The Go wrapper provides the ability to add compatibility checks to most operations, or to use unchecked elements to maximize performance. Since this library provides low-level access to pairing primitives, it is very easy to accidentally construct insecure systems. This library is intended to be used by cryptographers or to implement well-analyzed cryptosystems. Cryptographic pairings are defined over three mathematical groups: G1, G2, and GT, where each group is typically of the same order r. Additionally, a bilinear map e maps a pair of elements — one from G1 and another from G2 — to an element in GT. This map e has the following additional property: If G1 == G2, then a pairing is said to be symmetric. Otherwise, it is asymmetric. Pairings can be used to construct a variety of efficient cryptosystems. The PBC library currently supports 5 different types of pairings, each with configurable parameters. These types are designated alphabetically, roughly in chronological order of introduction. Type A, D, E, F, and G pairings are implemented in the library. Each type has different time and space requirements. For more information about the types, see the documentation for the corresponding generator calls, or the PBC manual page at https://crypto.stanford.edu/pbc/manual/ch05s01.html. This package must be compiled using cgo. It also requires the installation of GMP and PBC. During the build process, this package will attempt to include <gmp.h> and <pbc/pbc.h>, and then dynamically link to GMP and PBC. Most systems include a package for GMP. To install GMP in Debian / Ubuntu: For an RPM installation with YUM: For installation with Fink (http://www.finkproject.org/) on Mac OS X: For more information or to compile from source, visit https://gmplib.org/ To install the PBC library, download the appropriate files for your system from https://crypto.stanford.edu/pbc/download.html. PBC has three dependencies: the gcc compiler, flex (http://flex.sourceforge.net/), and bison (https://www.gnu.org/software/bison/). See the respective sites for installation instructions. Most distributions include packages for these libraries. For example, in Debian / Ubuntu: The PBC source can be compiled and installed using the usual GNU Build System: After installing, you may need to rebuild the search path for libraries: It is possible to install the package on Windows through the use of MinGW and MSYS. MSYS is required for installing PBC, while GMP can be installed through a package. Based on your MinGW installation, you may need to add "-I/usr/local/include" to CPPFLAGS and "-L/usr/local/lib" to LDFLAGS when building PBC. Likewise, you may need to add these options to CGO_CPPFLAGS and CGO_LDFLAGS when installing this package. This package is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For additional details, see the COPYING and COPYING.LESSER files. This example generates a pairing and some random group elements, then applies the pairing operation. This example computes and verifies a Boneh-Lynn-Shacham signature in a simulated conversation between Alice and Bob.
Package pbc provides structures for building pairing-based cryptosystems. It is a wrapper around the Pairing-Based Cryptography (PBC) Library authored by Ben Lynn (https://crypto.stanford.edu/pbc/). This wrapper provides access to all PBC functions. It supports generation of various types of elliptic curves and pairings, element initialization, I/O, and arithmetic. These features can be used to quickly build pairing-based or conventional cryptosystems. The PBC library is designed to be extremely fast. Internally, it uses GMP for arbitrary-precision arithmetic. It also includes a wide variety of optimizations that make pairing-based cryptography highly efficient. To improve performance, PBC does not perform type checking to ensure that operations actually make sense. The Go wrapper provides the ability to add compatibility checks to most operations, or to use unchecked elements to maximize performance. Since this library provides low-level access to pairing primitives, it is very easy to accidentally construct insecure systems. This library is intended to be used by cryptographers or to implement well-analyzed cryptosystems. Cryptographic pairings are defined over three mathematical groups: G1, G2, and GT, where each group is typically of the same order r. Additionally, a bilinear map e maps a pair of elements — one from G1 and another from G2 — to an element in GT. This map e has the following additional property: If G1 == G2, then a pairing is said to be symmetric. Otherwise, it is asymmetric. Pairings can be used to construct a variety of efficient cryptosystems. The PBC library currently supports 5 different types of pairings, each with configurable parameters. These types are designated alphabetically, roughly in chronological order of introduction. Type A, D, E, F, and G pairings are implemented in the library. Each type has different time and space requirements. For more information about the types, see the documentation for the corresponding generator calls, or the PBC manual page at https://crypto.stanford.edu/pbc/manual/ch05s01.html. This package must be compiled using cgo. It also requires the installation of GMP and PBC. During the build process, this package will attempt to include <gmp.h> and <pbc/pbc.h>, and then dynamically link to GMP and PBC. Most systems include a package for GMP. To install GMP in Debian / Ubuntu: For an RPM installation with YUM: For installation with Fink (http://www.finkproject.org/) on Mac OS X: For more information or to compile from source, visit https://gmplib.org/ To install the PBC library, download the appropriate files for your system from https://crypto.stanford.edu/pbc/download.html. PBC has three dependencies: the gcc compiler, flex (http://flex.sourceforge.net/), and bison (https://www.gnu.org/software/bison/). See the respective sites for installation instructions. Most distributions include packages for these libraries. For example, in Debian / Ubuntu: The PBC source can be compiled and installed using the usual GNU Build System: After installing, you may need to rebuild the search path for libraries: It is possible to install the package on Windows through the use of MinGW and MSYS. MSYS is required for installing PBC, while GMP can be installed through a package. Based on your MinGW installation, you may need to add "-I/usr/local/include" to CPPFLAGS and "-L/usr/local/lib" to LDFLAGS when building PBC. Likewise, you may need to add these options to CGO_CPPFLAGS and CGO_LDFLAGS when installing this package. This package is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For additional details, see the COPYING and COPYING.LESSER files. This example generates a pairing and some random group elements, then applies the pairing operation. This example computes and verifies a Boneh-Lynn-Shacham signature in a simulated conversation between Alice and Bob.
Package analysis provides methods to work with a Swagger specification document from package protodev-site/spec. An analysed specification object (type Spec) provides methods to work with swagger definition. Flattening a specification bundles all remote $ref in the main spec document. Depending on flattening options, additional preprocessing may take place: Mixin several specifications merges all Swagger constructs, and warns about found conflicts. Unmarshalling a specification with golang json unmarshalling may lead to some unwanted result on present but empty fields. Swagger schemas are analyzed to determine their complexity and qualify their content.
Package deps analyzes and recursively installs Go package dependencies. It is library functionality similar to `go get`.
Package grpcreplay supports the capture and replay of gRPC calls. Its main goal is to improve testing. Once you capture the calls of a test that runs against a real service, you have an "automatic mock" that can be replayed against the same test, yielding a unit test that is fast and flake-free. To record a sequence of gRPC calls to a file, create a Recorder and pass its DialOptions to grpc.Dial: It is essential to close the Recorder when the interaction is finished. There is also a NewRecorderWriter function for capturing to an arbitrary io.Writer. To replay a captured file, create a Replayer and ask it for a (fake) connection. We don't actually have to dial a server. (Since we're reading the file and not writing it, we don't have to be as careful about the error returned from Close). A test might use random or time-sensitive values, for instance to create unique resources for isolation from other tests. The test therefore has initial values, such as the current time, or a random seed, that differ from run to run. You must record this initial state and re-establish it on replay. To record the initial state, serialize it into a []byte and pass it as the second argument to NewRecorder: On replay, get the bytes from Replayer.Initial: Recorders and replayers have support for running callbacks before messages are written to or read from the replay file. A Recorder has a BeforeFunc that can modify a request or response before it is written to the replay file. The actual RPCs sent to the service during recording remain unaltered; only what is saved in the replay file can be changed. A Replayer has a BeforeFunc that can modify a request before it is sent for matching. Example uses for these callbacks include customized logging, or scrubbing data before RPCs are written to the replay file. If requests are modified by the callbacks during recording, it is important to perform the same modifications to the requests when replaying, or RPC matching on replay will fail. A common way to analyze and modify the various messages is to use a type switch. A nondeterministic program may invoke RPCs in a different order each time it is run. The order in which RPCs are called during recording may differ from the order during replay. The replayer matches incoming to recorded requests by method name and request contents, so nondeterminism is only a concern for identical requests that result in different responses. A nondeterministic program whose behavior differs depending on the order of such RPCs probably has a race condition: since both the recorded sequence of RPCs and the sequence during replay are valid orderings, the program should behave the same under both. The same is not true of streaming RPCs. The replayer matches streams only by method name, since it has no other information at the time the stream is opened. Two streams with the same method name that are started concurrently may replay in the wrong order. Besides the differences in replay mentioned above, other differences may cause issues for some programs. We list them here. The Replayer delivers a response to an RPC immediately, without waiting for other incoming RPCs. This can violate causality. For example, in a Pub/Sub program where one goroutine publishes and another subscribes, during replay the Subscribe call may finish before the Publish call begins. For streaming RPCs, the Replayer delivers the result of Send and Recv calls in the order they were recorded. No attempt is made to match message contents. At present, this package does not record or replay stream headers and trailers, or the result of the CloseSend method.
Package jump implements the "jump consistent hash" algorithm. Example Reference C++ implementation[1] Jump consistent hash works by computing when its output changes as the number of buckets increases. Let ch(key, num_buckets) be the consistent hash for the key when there are num_buckets buckets. Clearly, for any key, k, ch(k, 1) is 0, since there is only the one bucket. In order for the consistent hash function to balanced, ch(k, 2) will have to stay at 0 for half the keys, k, while it will have to jump to 1 for the other half. In general, ch(k, n+1) has to stay the same as ch(k, n) for n/(n+1) of the keys, and jump to n for the other 1/(n+1) of the keys. Here are examples of the consistent hash values for three keys, k1, k2, and k3, as num_buckets goes up: A linear time algorithm can be defined by using the formula for the probability of ch(key, j) jumping when j increases. It essentially walks across a row of this table. Given a key and number of buckets, the algorithm considers each successive bucket, j, from 1 to num_buckets1, and uses ch(key, j) to compute ch(key, j+1). At each bucket, j, it decides whether to keep ch(k, j+1) the same as ch(k, j), or to jump its value to j. In order to jump for the right fraction of keys, it uses a pseudorandom number generator with the key as its seed. To jump for 1/(j+1) of keys, it generates a uniform random number between 0.0 and 1.0, and jumps if the value is less than 1/(j+1). At the end of the loop, it has computed ch(k, num_buckets), which is the desired answer. In code: We can convert this to a logarithmic time algorithm by exploiting that ch(key, j+1) is usually unchanged as j increases, only jumping occasionally. The algorithm will only compute the destinations of jumps the j’s for which ch(key, j+1) ≠ ch(key, j). Also notice that for these j’s, ch(key, j+1) = j. To develop the algorithm, we will treat ch(key, j) as a random variable, so that we can use the notation for random variables to analyze the fractions of keys for which various propositions are true. That will lead us to a closed form expression for a pseudorandom variable whose value gives the destination of the next jump. Suppose that the algorithm is tracking the bucket numbers of the jumps for a particular key, k. And suppose that b was the destination of the last jump, that is, ch(k, b) ≠ ch(k, b+1), and ch(k, b+1) = b. Now, we want to find the next jump, the smallest j such that ch(k, j+1) ≠ ch(k, b+1), or equivalently, the largest j such that ch(k, j) = ch(k, b+1). We will make a pseudorandom variable whose value is that j. To get a probabilistic constraint on j, note that for any bucket number, i, we have j ≥ i if and only if the consistent hash hasn’t changed by i, that is, if and only if ch(k, i) = ch(k, b+1). Hence, the distribution of j must satisfy Fortunately, it is easy to compute that probability. Notice that since P( ch(k, 10) = ch(k, 11) ) is 10/11, and P( ch(k, 11) = ch(k, 12) ) is 11/12, then P( ch(k, 10) = ch(k, 12) ) is 10/11 * 11/12 = 10/12. In general, if n ≥ m, P( ch(k, n) = ch(k, m) ) = m / n. Thus for any i > b, Now, we generate a pseudorandom variable, r, (depending on k and j) that is uniformly distributed between 0 and 1. Since we want P(j ≥ i) = (b+1) / i, we set P(j ≥ i) iff r ≤ (b+1) / i. Solving the inequality for i yields P(j ≥ i) iff i ≤ (b+1) / r. Since i is a lower bound on j, j will equal the largest i for which P(j ≥ i), thus the largest i satisfying i ≤ (b+1) / r. Thus, by the definition of the floor function, j = floor((b+1) / r). Using this formula, jump consistent hash finds ch(key, num_buckets) by choosing successive jump destinations until it finds a position at or past num_buckets. It then knows that the previous jump destination is the answer. To turn this into the actual code of figure 1, we need to implement random. We want it to be fast, and yet to also to have well distributed successive values. We use a 64bit linear congruential generator; the particular multiplier we use produces random numbers that are especially well distributed in higher dimensions (i.e., when successive random values are used to form tuples). We use the key as the seed. (For keys that don’t fit into 64 bits, a 64 bit hash of the key should be used.) The congruential generator updates the seed on each iteration, and the code derives a double from the current seed. Tests show that this generator has good speed and distribution. It is worth noting that unlike the algorithm of Karger et al., jump consistent hash does not require the key to be hashed if it is already an integer. This is because jump consistent hash has an embedded pseudorandom number generator that essentially rehashes the key on every iteration. The hash is not especially good (i.e., linear congruential), but since it is applied repeatedly, additional hashing of the input key is not necessary. [1] http://arxiv.org/pdf/1406.2294v1.pdf
A dynamic and extensible music library organizer Demlo is a music library organizer. It can encode, fix case, change folder hierarchy according to tags or file properties, tag from an online database, copy covers while ignoring duplicates or those below a quality threshold, and much more. It makes it possible to manage your libraries uniformly and dynamically. You can write your own rules to fit your needs best. Demlo aims at being as lightweight and portable as possible. Its major runtime dependency is the transcoder FFmpeg. The scripts are written in Lua for portability and speed while allowing virtually unlimited extensibility. Usage: For usage options, see: First Demlo creates a list of all input files. When a folder is specified, all files matching the extensions from the 'extensions' variable will be appended to the list. Identical files are appended only once. Next all files get analyzed: - The audio file details (tags, stream properties, format properties, etc.) are stored into the 'input' variable. The 'output' variable gets its default values from 'input', or from an index file if specified from command-line. If no index has been specified and if an attached cuesheet is found, all cuesheet details are appended accordingly. Cuesheet tags override stream tags, which override format tags. Finally, still without index, tags can be retrieved from Internet if the command-line option is set. - If a prescript has been specified, it gets executed. It makes it possible to adjust the input values and global variables before running the other scripts. - The scripts, if any, get executed in the lexicographic order of their basename. The 'output' variable is transformed accordingly. Scripts may contain rules such as defining a new file name, new tags, new encoding properties, etc. You can use conditions on input values to set the output properties, which makes it virtually possible to process a full music library in one single run. - If a postscript has been specified, it gets executed. It makes it possible to adjust the output of the script for the current run only. - Demlo makes some last-minute tweaking if need be: it adjusts the bitrate, the path, the encoding parameters, and so on. - A preview of changes is displayed. - When applying changes, the covers get copied if required and the audio file gets processed: tags are modified as specified, the file is re-encoded if required, and the output is written to the appropriate folder. When destination already exists, the 'exist' action is executed. The program's default behaviour can be changed from the user configuration file. (See the 'Files' section for a template.) Most command-line flags default value can be changed. The configuration file is loaded on startup, before parsing the command-line options. Review the default value of the CLI flags with 'demlo -h'. If you wish to use no configuration file, set the environment variable DEMLORC to ".". Scripts can contain any safe Lua code. Some functions like 'os.execute' are not available for security reasons. It is not possible to print to the standard output/error unless running in debug mode and using the 'debug' function. See the 'sandbox.go' file for a list of allowed functions and variables. Lua patterns are replaced by Go regexps. See https://github.com/google/re2/wiki/Syntax. Scripts have no requirements at all. However, to be useful, they should set values of the 'output' table detailed in the 'Variables' section. You can use the full power of the Lua to set the variables dynamically. For instance: 'input' and 'output' are both accessible from any script. All default functions and variables (excluding 'output') are reset on every script call to enforce consistency. Local variables are lost from one script call to another. Global variables are preserved. Use this feature to pass data like options or new functions. 'output' structure consistency is guaranteed at the start of every script. Demlo will only extract the fields with the right type as described in the 'Variables' section. Warning: Do not abuse of global variables, especially when processing non-fixed size data (e.g. tables). Data could grow big and slow down the program. By default, when the destination exists, Demlo will append a suffix to the output destination. This behaviour can be changed from the 'exist' action specified by the user. Demlo comes with a few default actions. The 'exist' action works just like scripts with the following differences: - Any change to 'output.path' will be skipped. - An additional variable is accessible from the action: 'existinfo' holds the file details of the existing files in the same fashion as 'input'. This allows for comparing the input file and the existing destination. The writing rules can be tweaked the following way: Word of caution: overwriting breaks Demlo's rule of not altering existing files. It can lead to undesired results if the overwritten file is also part of the (yet to be processed) input. The overwrite capability can be useful when syncing music libraries however. The user scripts should be generic. Therefore they may not properly handle some uncommon input values. Tweak the input with temporary overrides from command-line. The prescript and postscript defined on command-line will let you run arbitrary code that is run before and after all other scripts, respectively. Use global variables to transfer data and parameters along. If the prescript and postscript end up being too long, consider writing a demlo script. You can also define shell aliases or use wrapper scripts as convenience. The 'input' table describes the file: Bitrate is in bits per seconds (bps). That is, for 320 kbps you would specify The 'time' is the modification time of the file. It holds the sec seconds and nsec nanoseconds since January 1, 1970 UTC. The entry 'streams' and 'format' are as returned by It gives access to most metadata that FFmpeg can return. For instance, to get the duration of the track in seconds, query the variable 'input.format.duration'. Since there may be more than one stream (covers, other data), the first audio stream is assumed to be the music stream. For convenience, the index of the music stream is stored in 'audioindex'. The tags returned by FFmpeg are found in streams, format and in the cuesheet. To make tag queries easier, all tags are stored in the 'tags' table, with the following precedence: You can remove a tag by setting it to 'nil' or the empty string. This is equivalent, except that 'nil' saves some memory during the process. The 'output' table describes the transformation to apply to the file: The 'parameters' array holds the CLI parameters passed to FFmpeg. It can be anything supported by FFmpeg, although this variable is supposed to hold encoding information. See the 'Examples' section. The 'embeddedcovers', 'externalcovers' and 'onlinecover' variables are detailed in the 'Covers' section. The 'write' variable is covered in the 'Existing destination' section. The 'rmsrc' variable is a boolean: when true, Demlo removes the source file after processing. This can speed up the process when not re-encoding. This option is ignored for multi-track files. For convenience, the following shortcuts are provided: Demlo provides some non-standard Lua functions to ease scripting. Display a message on stderr if debug mode is on. Return lowercase string without non-alphanumeric characters nor leading zeros. Return the relation coefficient of the two input strings. The result is a float in 0.0...1.0, 0.0 means no relation at all, 1.0 means identical strings. A format is a container in FFmpeg's terminology. 'output.parameters' contains CLI flags passed to FFmpeg. They are meant to set the stream codec, the bitrate, etc. If 'output.parameters' is {'-c:a', 'copy'} and the format is identical, then taglib will be used instead of FFmpeg. Use this rule from a (post)script to disable encoding by setting the same format and the copy parameters. This speeds up the process. The official scripts are usually very smart at guessing the right values. They might make mistakes however. If you are unsure, you can (and you are advised to) preview the results before proceeding. The 'diff' preview is printed to stderr. A JSON preview of the changes is printed to stdout if stdout is redirected. The initial values of the 'output' table can be completed with tags fetched from the MusicBrainz database. Audio files are fingerprinted for the queries, so even with initially wrong file names and tags, the right values should still be retrieved. The front album cover can also be retrieved. Proxy parameters will be fetched automatically from the 'http_proxy' and 'https_proxy' environment variables. As this process requires network access it can be quite slow. Nevertheless, Demlo is specifically optimized for albums, so that network queries are used for only one track per album, when possible. Some tracks can be released on different albums: Demlo tries to guess it from the tags, but if the tags are wrong there is no way to know which one it is. There is a case where the selection can be controlled: let's assume we have tracks A, B and C from the same album Z. A and B were also released in album Y, whereas C was release in Z only. Tags for A will be checked online; let's assume it gets tagged to album Y. B will use A details, so album Y too. Then C does not match neither A's nor B's album, so another online query will be made and it will be tagged to album Z. This is slow and does not yield the expected result. Now let's call Tags for C will be queried online, and C will be tagged to Z. Then both A and B will match album Z so they will be tagged using C details, which is the desired result. Conclusion: when using online tagging, the first argument should be the lesser known track of the album. Demlo can set the output variables according to the values set in a text file before calling the script. The input values are ignored as well as online tagging, but it is still possible to access the input table from scripts. This 'index' file is formatted in JSON. It corresponds to what Demlo outputs when printing the JSON preview. This is valid JSON except for the missing beginning and the missing end. It makes it possible to concatenate and to append to existing index files. Demlo will automatically complete the missing parts so that it becomes valid JSON. The index file is useful when you want to edit tags manually: You can redirect the output to a file, edit the content manually with your favorite text editor, then run Demlo again with the index as argument. See the 'Examples' section. This feature can also be used to interface Demlo with other programs. Demlo can manage embedded covers as well as external covers. External covers are queried from files matching known extensions in the file's folder. Embedded covers are queried from static video streams in the file. Covers are accessed from The embedded covers are indexed numerically by order of appearance in the streams. The first cover will be at index 1 and so on. This is not necessarily the index of the stream. 'inputcover' is the following structure: 'format' is the picture format. FFmpeg makes a distinction between format and codec, but it is not useful for covers. The name of the format is specified by Demlo, not by FFmpeg. Hence the 'jpeg' name, instead of 'mjpeg' as FFmpeg puts it. 'width' and 'height' hold the size in pixels. 'checksum' can be used to identify files uniquely. For performance reasons, only a partial checksum is performed. This variable is typically used for skipping duplicates. Cover transformations are specified in 'outputcover' has the following structure: The format is specified by FFmpeg this time. See the comments on 'format' for 'inputcover'. 'parameters' is used in the same fashion as 'output.parameters'. User configuration: This must be a Lua file. See the 'demlorc' file provided with this package for an exhaustive list of options. Folder containing the official scripts: User script folder: Create this folder and add your own scripts inside. This folder takes precedence over the system folder, so scripts with the same name will be found in the user folder first. The following examples will not proceed unless the '-p' command-line option is true. Important: you _must_ use single quotes for the runtime Lua command to prevent expansion. Inside the Lua code, use double quotes for strings and escape single quotes. Show default options: Preview changes made by the default scripts: Use 'alternate' script if found in user or system script folder (user folder first): Add the Lua file to the list of scripts. This feature is convenient if you want to write scripts that are too complex to fit on the command-line, but not generic enough to fit the user or system script folders. Remove all script from the list, then add '30-case' and '60-path' scripts. Note that '30-case' will be run before '60-path'. Do not use any script but '60-path'. The file content is unchanged and the file is renamed to a dynamically computed destination. Demlo performs an instant rename if destination is on the same device. Otherwise it copies the file and removes the source. Use the default scripts (if set in configuration file), but do not re-encode: Set 'artist' to the value of 'composer', and 'title' to be preceded by the new value of 'artist', then apply the default script. Do not re-encode. Order in runtime script matters. Mind the double quotes. Set track number to first number in input file name: Use the default scripts but keep original value for the 'artist' tag: 1) Preview default scripts transformation and save it to an index. 2) Edit file to fix any potential mistake. 3) Run Demlo over the same files using the index information only. Same as above but generate output filename according to the custom '61-rename' script. The numeric prefix is important: it ensures that '61-rename' will be run after all the default tag related scripts and after '60-path'. Otherwise, if a change in tags would occur later on, it would not affect the renaming script. Retrieve tags from Internet: Same as above but for a whole album, and saving the result to an index: Only download the cover for the album corresponding to the track. Use 'rmsrc' to avoid duplicating the audio file. Change tags inplace with entries from MusicBrainz: Set tags to titlecase while casing AC-DC correctly: To easily switch between formats from command-line, create one script per format (see 50-encoding.lua), e.g. ogg.lua and flac.lua. Then Add support for non-default formats from CLI: Overwrite existing destination if input is newer: ffmpeg(1), ffprobe(1), http://www.lua.org/pil/contents.html
Package analysis provides methods to work with a Swagger specification document from package circl-dev/spec. An analysed specification object (type Spec) provides methods to work with swagger definition. Flattening a specification bundles all remote $ref in the main spec document. Depending on flattening options, additional preprocessing may take place: Mixin several specifications merges all Swagger constructs, and warns about found conflicts. Unmarshalling a specification with golang json unmarshalling may lead to some unwanted result on present but empty fields. Swagger schemas are analyzed to determine their complexity and qualify their content.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/IlyaLab/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/IlyaLab/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/IlyaLab/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/IlyaLab/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package pbc provides structures for building pairing-based cryptosystems. It is a wrapper around the Pairing-Based Cryptography (PBC) Library authored by Ben Lynn (https://crypto.stanford.edu/pbc/). This wrapper provides access to all PBC functions. It supports generation of various types of elliptic curves and pairings, element initialization, I/O, and arithmetic. These features can be used to quickly build pairing-based or conventional cryptosystems. The PBC library is designed to be extremely fast. Internally, it uses GMP for arbitrary-precision arithmetic. It also includes a wide variety of optimizations that make pairing-based cryptography highly efficient. To improve performance, PBC does not perform type checking to ensure that operations actually make sense. The Go wrapper provides the ability to add compatibility checks to most operations, or to use unchecked elements to maximize performance. Since this library provides low-level access to pairing primitives, it is very easy to accidentally construct insecure systems. This library is intended to be used by cryptographers or to implement well-analyzed cryptosystems. Cryptographic pairings are defined over three mathematical groups: G1, G2, and GT, where each group is typically of the same order r. Additionally, a bilinear map e maps a pair of elements — one from G1 and another from G2 — to an element in GT. This map e has the following additional property: If G1 == G2, then a pairing is said to be symmetric. Otherwise, it is asymmetric. Pairings can be used to construct a variety of efficient cryptosystems. The PBC library currently supports 5 different types of pairings, each with configurable parameters. These types are designated alphabetically, roughly in chronological order of introduction. Type A, D, E, F, and G pairings are implemented in the library. Each type has different time and space requirements. For more information about the types, see the documentation for the corresponding generator calls, or the PBC manual page at https://crypto.stanford.edu/pbc/manual/ch05s01.html. This package must be compiled using cgo. It also requires the installation of GMP and PBC. During the build process, this package will attempt to include <gmp.h> and <pbc/pbc.h>, and then dynamically link to GMP and PBC. Most systems include a package for GMP. To install GMP in Debian / Ubuntu: For an RPM installation with YUM: For installation with Fink (http://www.finkproject.org/) on Mac OS X: For more information or to compile from source, visit https://gmplib.org/ To install the PBC library, download the appropriate files for your system from https://crypto.stanford.edu/pbc/download.html. PBC has three dependencies: the gcc compiler, flex (http://flex.sourceforge.net/), and bison (https://www.gnu.org/software/bison/). See the respective sites for installation instructions. Most distributions include packages for these libraries. For example, in Debian / Ubuntu: The PBC source can be compiled and installed using the usual GNU Build System: After installing, you may need to rebuild the search path for libraries: It is possible to install the package on Windows through the use of MinGW and MSYS. MSYS is required for installing PBC, while GMP can be installed through a package. Based on your MinGW installation, you may need to add "-I/usr/local/include" to CPPFLAGS and "-L/usr/local/lib" to LDFLAGS when building PBC. Likewise, you may need to add these options to CGO_CPPFLAGS and CGO_LDFLAGS when installing this package. This package is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For additional details, see the COPYING and COPYING.LESSER files. This example generates a pairing and some random group elements, then applies the pairing operation. This example computes and verifies a Boneh-Lynn-Shacham signature in a simulated conversation between Alice and Bob.
Example Reference C++ implementation[1] Jump consistent hash works by computing when its output changes as the number of buckets increases. Let ch(key, num_buckets) be the consistent hash for the key when there are num_buckets buckets. Clearly, for any key, k, ch(k, 1) is 0, since there is only the one bucket. In order for the consistent hash function to balanced, ch(k, 2) will have to stay at 0 for half the keys, k, while it will have to jump to 1 for the other half. In general, ch(k, n+1) has to stay the same as ch(k, n) for n/(n+1) of the keys, and jump to n for the other 1/(n+1) of the keys. Here are examples of the consistent hash values for three keys, k1, k2, and k3, as num_buckets goes up: A linear time algorithm can be defined by using the formula for the probability of ch(key, j) jumping when j increases. It essentially walks across a row of this table. Given a key and number of buckets, the algorithm considers each successive bucket, j, from 1 to num_buckets1, and uses ch(key, j) to compute ch(key, j+1). At each bucket, j, it decides whether to keep ch(k, j+1) the same as ch(k, j), or to jump its value to j. In order to jump for the right fraction of keys, it uses a pseudorandom number generator with the key as its seed. To jump for 1/(j+1) of keys, it generates a uniform random number between 0.0 and 1.0, and jumps if the value is less than 1/(j+1). At the end of the loop, it has computed ch(k, num_buckets), which is the desired answer. In code: We can convert this to a logarithmic time algorithm by exploiting that ch(key, j+1) is usually unchanged as j increases, only jumping occasionally. The algorithm will only compute the destinations of jumps the j’s for which ch(key, j+1) ≠ ch(key, j). Also notice that for these j’s, ch(key, j+1) = j. To develop the algorithm, we will treat ch(key, j) as a random variable, so that we can use the notation for random variables to analyze the fractions of keys for which various propositions are true. That will lead us to a closed form expression for a pseudorandom variable whose value gives the destination of the next jump. Suppose that the algorithm is tracking the bucket numbers of the jumps for a particular key, k. And suppose that b was the destination of the last jump, that is, ch(k, b) ≠ ch(k, b+1), and ch(k, b+1) = b. Now, we want to find the next jump, the smallest j such that ch(k, j+1) ≠ ch(k, b+1), or equivalently, the largest j such that ch(k, j) = ch(k, b+1). We will make a pseudorandom variable whose value is that j. To get a probabilistic constraint on j, note that for any bucket number, i, we have j ≥ i if and only if the consistent hash hasn’t changed by i, that is, if and only if ch(k, i) = ch(k, b+1). Hence, the distribution of j must satisfy Fortunately, it is easy to compute that probability. Notice that since P( ch(k, 10) = ch(k, 11) ) is 10/11, and P( ch(k, 11) = ch(k, 12) ) is 11/12, then P( ch(k, 10) = ch(k, 12) ) is 10/11 * 11/12 = 10/12. In general, if n ≥ m, P( ch(k, n) = ch(k, m) ) = m / n. Thus for any i > b, Now, we generate a pseudorandom variable, r, (depending on k and j) that is uniformly distributed between 0 and 1. Since we want P(j ≥ i) = (b+1) / i, we set P(j ≥ i) iff r ≤ (b+1) / i. Solving the inequality for i yields P(j ≥ i) iff i ≤ (b+1) / r. Since i is a lower bound on j, j will equal the largest i for which P(j ≥ i), thus the largest i satisfying i ≤ (b+1) / r. Thus, by the definition of the floor function, j = floor((b+1) / r). Using this formula, jump consistent hash finds ch(key, num_buckets) by choosing successive jump destinations until it finds a position at or past num_buckets. It then knows that the previous jump destination is the answer. To turn this into the actual code of figure 1, we need to implement random. We want it to be fast, and yet to also to have well distributed successive values. We use a 64bit linear congruential generator; the particular multiplier we use produces random numbers that are especially well distributed in higher dimensions (i.e., when successive random values are used to form tuples). We use the key as the seed. (For keys that don’t fit into 64 bits, a 64 bit hash of the key should be used.) The congruential generator updates the seed on each iteration, and the code derives a double from the current seed. Tests show that this generator has good speed and distribution. It is worth noting that unlike the algorithm of Karger et al., jump consistent hash does not require the key to be hashed if it is already an integer. This is because jump consistent hash has an embedded pseudorandom number generator that essentially rehashes the key on every iteration. The hash is not especially good (i.e., linear congruential), but since it is applied repeatedly, additional hashing of the input key is not necessary. [1] http://arxiv.org/pdf/1406.2294v1.pdf
Example Reference C++ implementation[1] Jump consistent hash works by computing when its output changes as the number of buckets increases. Let ch(key, num_buckets) be the consistent hash for the key when there are num_buckets buckets. Clearly, for any key, k, ch(k, 1) is 0, since there is only the one bucket. In order for the consistent hash function to balanced, ch(k, 2) will have to stay at 0 for half the keys, k, while it will have to jump to 1 for the other half. In general, ch(k, n+1) has to stay the same as ch(k, n) for n/(n+1) of the keys, and jump to n for the other 1/(n+1) of the keys. Here are examples of the consistent hash values for three keys, k1, k2, and k3, as num_buckets goes up: A linear time algorithm can be defined by using the formula for the probability of ch(key, j) jumping when j increases. It essentially walks across a row of this table. Given a key and number of buckets, the algorithm considers each successive bucket, j, from 1 to num_buckets1, and uses ch(key, j) to compute ch(key, j+1). At each bucket, j, it decides whether to keep ch(k, j+1) the same as ch(k, j), or to jump its value to j. In order to jump for the right fraction of keys, it uses a pseudorandom number generator with the key as its seed. To jump for 1/(j+1) of keys, it generates a uniform random number between 0.0 and 1.0, and jumps if the value is less than 1/(j+1). At the end of the loop, it has computed ch(k, num_buckets), which is the desired answer. In code: We can convert this to a logarithmic time algorithm by exploiting that ch(key, j+1) is usually unchanged as j increases, only jumping occasionally. The algorithm will only compute the destinations of jumps the j’s for which ch(key, j+1) ≠ ch(key, j). Also notice that for these j’s, ch(key, j+1) = j. To develop the algorithm, we will treat ch(key, j) as a random variable, so that we can use the notation for random variables to analyze the fractions of keys for which various propositions are true. That will lead us to a closed form expression for a pseudorandom variable whose value gives the destination of the next jump. Suppose that the algorithm is tracking the bucket numbers of the jumps for a particular key, k. And suppose that b was the destination of the last jump, that is, ch(k, b) ≠ ch(k, b+1), and ch(k, b+1) = b. Now, we want to find the next jump, the smallest j such that ch(k, j+1) ≠ ch(k, b+1), or equivalently, the largest j such that ch(k, j) = ch(k, b+1). We will make a pseudorandom variable whose value is that j. To get a probabilistic constraint on j, note that for any bucket number, i, we have j ≥ i if and only if the consistent hash hasn’t changed by i, that is, if and only if ch(k, i) = ch(k, b+1). Hence, the distribution of j must satisfy Fortunately, it is easy to compute that probability. Notice that since P( ch(k, 10) = ch(k, 11) ) is 10/11, and P( ch(k, 11) = ch(k, 12) ) is 11/12, then P( ch(k, 10) = ch(k, 12) ) is 10/11 * 11/12 = 10/12. In general, if n ≥ m, P( ch(k, n) = ch(k, m) ) = m / n. Thus for any i > b, Now, we generate a pseudorandom variable, r, (depending on k and j) that is uniformly distributed between 0 and 1. Since we want P(j ≥ i) = (b+1) / i, we set P(j ≥ i) iff r ≤ (b+1) / i. Solving the inequality for i yields P(j ≥ i) iff i ≤ (b+1) / r. Since i is a lower bound on j, j will equal the largest i for which P(j ≥ i), thus the largest i satisfying i ≤ (b+1) / r. Thus, by the definition of the floor function, j = floor((b+1) / r). Using this formula, jump consistent hash finds ch(key, num_buckets) by choosing successive jump destinations until it finds a position at or past num_buckets. It then knows that the previous jump destination is the answer. To turn this into the actual code of figure 1, we need to implement random. We want it to be fast, and yet to also to have well distributed successive values. We use a 64bit linear congruential generator; the particular multiplier we use produces random numbers that are especially well distributed in higher dimensions (i.e., when successive random values are used to form tuples). We use the key as the seed. (For keys that don’t fit into 64 bits, a 64 bit hash of the key should be used.) The congruential generator updates the seed on each iteration, and the code derives a double from the current seed. Tests show that this generator has good speed and distribution. It is worth noting that unlike the algorithm of Karger et al., jump consistent hash does not require the key to be hashed if it is already an integer. This is because jump consistent hash has an embedded pseudorandom number generator that essentially rehashes the key on every iteration. The hash is not especially good (i.e., linear congruential), but since it is applied repeatedly, additional hashing of the input key is not necessary. [1] http://arxiv.org/pdf/1406.2294v1.pdf
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package grpcreplay supports the capture and replay of gRPC calls. Its main goal is to improve testing. Once you capture the calls of a test that runs against a real service, you have an "automatic mock" that can be replayed against the same test, yielding a unit test that is fast and flake-free. To record a sequence of gRPC calls to a file, create a Recorder and pass its DialOptions to grpc.Dial: It is essential to close the Recorder when the interaction is finished. There is also a NewRecorderWriter function for capturing to an arbitrary io.Writer. To replay a captured file, create a Replayer and ask it for a (fake) connection. We don't actually have to dial a server. (Since we're reading the file and not writing it, we don't have to be as careful about the error returned from Close). A test might use random or time-sensitive values, for instance to create unique resources for isolation from other tests. The test therefore has initial values, such as the current time, or a random seed, that differ from run to run. You must record this initial state and re-establish it on replay. To record the initial state, serialize it into a []byte and pass it as the second argument to NewRecorder: On replay, get the bytes from Replayer.Initial: Recorders and replayers have support for running callbacks before messages are written to or read from the replay file. A Recorder has a BeforeFunc that can modify a request or response before it is written to the replay file. The actual RPCs sent to the service during recording remain unaltered; only what is saved in the replay file can be changed. A Replayer has a BeforeFunc that can modify a request before it is sent for matching. Example uses for these callbacks include customized logging, or scrubbing data before RPCs are written to the replay file. If requests are modified by the callbacks during recording, it is important to perform the same modifications to the requests when replaying, or RPC matching on replay will fail. A common way to analyze and modify the various messages is to use a type switch. A nondeterministic program may invoke RPCs in a different order each time it is run. The order in which RPCs are called during recording may differ from the order during replay. The replayer matches incoming to recorded requests by method name and request contents, so nondeterminism is only a concern for identical requests that result in different responses. A nondeterministic program whose behavior differs depending on the order of such RPCs probably has a race condition: since both the recorded sequence of RPCs and the sequence during replay are valid orderings, the program should behave the same under both. The same is not true of streaming RPCs. The replayer matches streams only by method name, since it has no other information at the time the stream is opened. Two streams with the same method name that are started concurrently may replay in the wrong order. Besides the differences in replay mentioned above, other differences may cause issues for some programs. We list them here. The Replayer delivers a response to an RPC immediately, without waiting for other incoming RPCs. This can violate causality. For example, in a Pub/Sub program where one goroutine publishes and another subscribes, during replay the Subscribe call may finish before the Publish call begins. For streaming RPCs, the Replayer delivers the result of Send and Recv calls in the order they were recorded. No attempt is made to match message contents. At present, this package does not record or replay stream headers and trailers, or the result of the CloseSend method.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang). It includes implementations of Breiman and Cutler's Random Forest for classification and regression on heterogeneous numerical/categorical data with missing values and several related algorithms including entropy and cost driven classification, L1 regression and feature selection with artificial contrasts and hooks for modifying the algorithms for your needs. Command line utilities to grow, apply and analyze forests are provided in sub directories. CloudForest is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. Documentation has been generated with godoc and can be viewed live at: http://godoc.org/github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome; Code Repo and Issue tracker can be found at: https://github.com/ryanbressler/CloudForest CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the origional slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record Variable Importance in CloudForest is calculated as the mean decrease in impurity over all of the splits made using a feature. To provide a baseline for evaluating importance, artificial contrast features can be used by including shuffled copies of existing features. In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface. When compiled with go1.1 CloudForest achieves running times similar to implementations in other languages. Using gccgo (4.8.0 at least) results in longer running times and is not recommended until full go1.1 support is implemented in gcc 4.8.1. Development of CloudForest is being driven by our needs as we analyze large biomedical data sets. As such new and modified analysis will be added as needed. The basic functionality has stabilized but we have discussed several possible changes that may require additional abstraction and/or changes in the api. These include: Allow additional types of candidate features. Some multidimensional data types may not be best served by decomposition into categorical and numerical features. It would be possible to allow arbitrary feature types by adding CanidateFeature (which should expose BestSplit), CodedSplitter and Splitter abstraction. Allowing data to reside anywhere. This would involve abstracting FeatureMatrix to allow database etc driven implementations. "growforest" trains a forest using the following parameters which can be listed with -h "applyforest" applies a forest to the specified feature matrix and outputs predictions as a two column (caselabel predictedvalue) tsv. errorrate calculates the error of a forest vs a testing data set and reports it to standard out leafcount outputs counts of case case co-occurrence on leaf nodes (Brieman's proximity) and counts of the number of times a feature is used to split a node containing each case (a measure of relative/local importance). CloudForest borrows the annotated feature matrix (.afm) and stoicastic forest (.sf) file formats from Timo Erkkila's rf-ace which can be found at https://code.google.com/p/rf-ace/ An annotated feature matrix (.afm) file is a tab delineated file with column and row headers. Columns represent cases and rows represent features. A row header/feature id includes a prefix to specify the feature type Categorical and boolean features use strings for their category labels. Missing values are represented by "?","nan","na", or "null" (case insensitive). A short example: A stochastic forest (.sf) file contains a forest of decision trees. The main advantage of this format as opposed to an established format like json is that an sf file can be written iteratively tree by tree and multiple .sf files can be combined with minimal logic required allowing for massively parallel growth of forests with low memory use. An .sf file consists of lines each of which is a comma separated list of key value pairs. Lines can designate either a FOREST, TREE, or NODE. Each tree belongs to the preceding forest and each node to the preceding tree. Nodes must be written in order of increasing depth. CloudForest generates fewer fields then rf-ace but requires the following. Other fields will be ignored Forest requires forest type (only RF currently), target and ntrees: Tree requires only an int and the value is ignored though the line is needed to designate a new tree: Node requires a path encoded so that the root node is specified by "*" and each split left or right as "L" or "R". Leaf nodes should also define PRED such as "PRED=1.5" or "PRED=red". Splitter nodes should define SPLITTER with a feature id inside of double quotes, SPLITTERTYPE=[CATEGORICAL|NUMERICAL] and a LVALUE term which can be either a float inside of double quotes representing the highest value sent left or a ":" separated list of categorical values sent left. An example .sf file: Cloud forest can parse and apply .sf files generated by at least some versions of rf-ace. The idea for (and trademark of the term) Random Forests originated with Leo Brieman and Adele Cuttler. Their code and paper's can be found at: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm All code in CloudForest is original but some ideas for methods and optimizations were inspired by Timo Erkilla's rf-ace and Andy Liaw and Matthew Wiener randomForest R package based on Brieman and Cuttler's code: https://code.google.com/p/rf-ace/ http://cran.r-project.org/web/packages/randomForest/index.html The idea for Artificial Contrasts was found in: Eugene Tuv, Alexander Borisov, George Runger and Kari Torkkola's paper "Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination" http://www.researchgate.net/publication/220320233_Feature_Selection_with_Ensembles_Artificial_Variables_and_Redundancy_Elimination/file/d912f5058a153a8b35.pdf The idea for growing trees to minimize categorical entropy comes from Ross Quinlan's ID3: http://en.wikipedia.org/wiki/ID3_algorithm "The Elements of Statistical Learning" 2nd edition by Trevor Hastie, Robert Tibshirani and Jerome Friedman was also consulted during development.
A dynamic and extensible music library organizer Demlo is a music library organizer. It can encode, fix case, change folder hierarchy according to tags or file properties, tag from an online database, copy covers while ignoring duplicates or those below a quality threshold, and much more. It makes it possible to manage your libraries uniformly and dynamically. You can write your own rules to fit your needs best. Demlo aims at being as lightweight and portable as possible. Its major runtime dependency is the transcoder FFmpeg. The scripts are written in Lua for portability and speed while allowing virtually unlimited extensibility. Usage: For usage options, see: First Demlo creates a list of all input files. When a folder is specified, all files matching the extensions from the 'extensions' variable will be appended to the list. Identical files are appended only once. Next all files get analyzed: - The audio file details (tags, stream properties, format properties, etc.) are stored into the 'input' variable. The 'output' variable gets its default values from 'input', or from an index file if specified from command-line. If no index has been specified and if an attached cuesheet is found, all cuesheet details are appended accordingly. Cuesheet tags override stream tags, which override format tags. Finally, still without index, tags can be retrieved from Internet if the command-line option is set. - If a prescript has been specified, it gets executed. It makes it possible to adjust the input values and global variables before running the other scripts. - The scripts, if any, get executed in the lexicographic order of their basename. The 'output' variable is transformed accordingly. Scripts may contain rules such as defining a new file name, new tags, new encoding properties, etc. You can use conditions on input values to set the output properties, which makes it virtually possible to process a full music library in one single run. - If a postscript has been specified, it gets executed. It makes it possible to adjust the output of the script for the current run only. - Demlo makes some last-minute tweaking if need be: it adjusts the bitrate, the path, the encoding parameters, and so on. - A preview of changes is displayed. - When applying changes, the covers get copied if required and the audio file gets processed: tags are modified as specified, the file is re-encoded if required, and the output is written to the appropriate folder. When destination already exists, the 'exist' action is executed. The program's default behaviour can be changed from the user configuration file. (See the 'Files' section for a template.) Most command-line flags default value can be changed. The configuration file is loaded on startup, before parsing the command-line options. Review the default value of the CLI flags with 'demlo -h'. If you wish to use no configuration file, set the environment variable DEMLORC to ".". Scripts can contain any safe Lua code. Some functions like 'os.execute' are not available for security reasons. It is not possible to print to the standard output/error unless running in debug mode and using the 'debug' function. See the 'sandbox.go' file for a list of allowed functions and variables. Lua patterns are replaced by Go regexps. See https://github.com/google/re2/wiki/Syntax. Scripts have no requirements at all. However, to be useful, they should set values of the 'output' table detailed in the 'Variables' section. You can use the full power of the Lua to set the variables dynamically. For instance: 'input' and 'output' are both accessible from any script. All default functions and variables (excluding 'output') are reset on every script call to enforce consistency. Local variables are lost from one script call to another. Global variables are preserved. Use this feature to pass data like options or new functions. 'output' structure consistency is guaranteed at the start of every script. Demlo will only extract the fields with the right type as described in the 'Variables' section. Warning: Do not abuse of global variables, especially when processing non-fixed size data (e.g. tables). Data could grow big and slow down the program. By default, when the destination exists, Demlo will append a suffix to the output destination. This behaviour can be changed from the 'exist' action specified by the user. Demlo comes with a few default actions. The 'exist' action works just like scripts with the following differences: - Any change to 'output.path' will be skipped. - An additional variable is accessible from the action: 'existinfo' holds the file details of the existing files in the same fashion as 'input'. This allows for comparing the input file and the existing destination. The writing rules can be tweaked the following way: Word of caution: overwriting breaks Demlo's rule of not altering existing files. It can lead to undesired results if the overwritten file is also part of the (yet to be processed) input. The overwrite capability can be useful when syncing music libraries however. The user scripts should be generic. Therefore they may not properly handle some uncommon input values. Tweak the input with temporary overrides from command-line. The prescript and postscript defined on command-line will let you run arbitrary code that is run before and after all other scripts, respectively. Use global variables to transfer data and parameters along. If the prescript and postscript end up being too long, consider writing a demlo script. You can also define shell aliases or use wrapper scripts as convenience. The 'input' table describes the file: Bitrate is in bits per seconds (bps). That is, for 320 kbps you would specify The 'time' is the modification time of the file. It holds the sec seconds and nsec nanoseconds since January 1, 1970 UTC. The entry 'streams' and 'format' are as returned by It gives access to most metadata that FFmpeg can return. For instance, to get the duration of the track in seconds, query the variable 'input.format.duration'. Since there may be more than one stream (covers, other data), the first audio stream is assumed to be the music stream. For convenience, the index of the music stream is stored in 'audioindex'. The tags returned by FFmpeg are found in streams, format and in the cuesheet. To make tag queries easier, all tags are stored in the 'tags' table, with the following precedence: You can remove a tag by setting it to 'nil' or the empty string. This is equivalent, except that 'nil' saves some memory during the process. The 'output' table describes the transformation to apply to the file: The 'parameters' array holds the CLI parameters passed to FFmpeg. It can be anything supported by FFmpeg, although this variable is supposed to hold encoding information. See the 'Examples' section. The 'embeddedcovers', 'externalcovers' and 'onlinecover' variables are detailed in the 'Covers' section. The 'write' variable is covered in the 'Existing destination' section. The 'rmsrc' variable is a boolean: when true, Demlo removes the source file after processing. This can speed up the process when not re-encoding. This option is ignored for multi-track files. For convenience, the following shortcuts are provided: Demlo provides some non-standard Lua functions to ease scripting. Display a message on stderr if debug mode is on. Return lowercase string without non-alphanumeric characters nor leading zeros. Return the relation coefficient of the two input strings. The result is a float in 0.0...1.0, 0.0 means no relation at all, 1.0 means identical strings. A format is a container in FFmpeg's terminology. 'output.parameters' contains CLI flags passed to FFmpeg. They are meant to set the stream codec, the bitrate, etc. If 'output.parameters' is {'-c:a', 'copy'} and the format is identical, then taglib will be used instead of FFmpeg. Use this rule from a (post)script to disable encoding by setting the same format and the copy parameters. This speeds up the process. The official scripts are usually very smart at guessing the right values. They might make mistakes however. If you are unsure, you can (and you are advised to) preview the results before proceeding. The 'diff' preview is printed to stderr. A JSON preview of the changes is printed to stdout if stdout is redirected. The initial values of the 'output' table can be completed with tags fetched from the MusicBrainz database. Audio files are fingerprinted for the queries, so even with initially wrong file names and tags, the right values should still be retrieved. The front album cover can also be retrieved. Proxy parameters will be fetched automatically from the 'http_proxy' and 'https_proxy' environment variables. As this process requires network access it can be quite slow. Nevertheless, Demlo is specifically optimized for albums, so that network queries are used for only one track per album, when possible. Some tracks can be released on different albums: Demlo tries to guess it from the tags, but if the tags are wrong there is no way to know which one it is. There is a case where the selection can be controlled: let's assume we have tracks A, B and C from the same album Z. A and B were also released in album Y, whereas C was release in Z only. Tags for A will be checked online; let's assume it gets tagged to album Y. B will use A details, so album Y too. Then C does not match neither A's nor B's album, so another online query will be made and it will be tagged to album Z. This is slow and does not yield the expected result. Now let's call Tags for C will be queried online, and C will be tagged to Z. Then both A and B will match album Z so they will be tagged using C details, which is the desired result. Conclusion: when using online tagging, the first argument should be the lesser known track of the album. Demlo can set the output variables according to the values set in a text file before calling the script. The input values are ignored as well as online tagging, but it is still possible to access the input table from scripts. This 'index' file is formatted in JSON. It corresponds to what Demlo outputs when printing the JSON preview. This is valid JSON except for the missing beginning and the missing end. It makes it possible to concatenate and to append to existing index files. Demlo will automatically complete the missing parts so that it becomes valid JSON. The index file is useful when you want to edit tags manually: You can redirect the output to a file, edit the content manually with your favorite text editor, then run Demlo again with the index as argument. See the 'Examples' section. This feature can also be used to interface Demlo with other programs. Demlo can manage embedded covers as well as external covers. External covers are queried from files matching known extensions in the file's folder. Embedded covers are queried from static video streams in the file. Covers are accessed from The embedded covers are indexed numerically by order of appearance in the streams. The first cover will be at index 1 and so on. This is not necessarily the index of the stream. 'inputcover' is the following structure: 'format' is the picture format. FFmpeg makes a distinction between format and codec, but it is not useful for covers. The name of the format is specified by Demlo, not by FFmpeg. Hence the 'jpeg' name, instead of 'mjpeg' as FFmpeg puts it. 'width' and 'height' hold the size in pixels. 'checksum' can be used to identify files uniquely. For performance reasons, only a partial checksum is performed. This variable is typically used for skipping duplicates. Cover transformations are specified in 'outputcover' has the following structure: The format is specified by FFmpeg this time. See the comments on 'format' for 'inputcover'. 'parameters' is used in the same fashion as 'output.parameters'. User configuration: This must be a Lua file. See the 'demlorc' file provided with this package for an exhaustive list of options. Folder containing the official scripts: User script folder: Create this folder and add your own scripts inside. This folder takes precedence over the system folder, so scripts with the same name will be found in the user folder first. The following examples will not proceed unless the '-p' command-line option is true. Important: you _must_ use single quotes for the runtime Lua command to prevent expansion. Inside the Lua code, use double quotes for strings and escape single quotes. Show default options: Preview changes made by the default scripts: Use 'alternate' script if found in user or system script folder (user folder first): Add the Lua file to the list of scripts. This feature is convenient if you want to write scripts that are too complex to fit on the command-line, but not generic enough to fit the user or system script folders. Remove all script from the list, then add '30-case' and '60-path' scripts. Note that '30-case' will be run before '60-path'. Do not use any script but '60-path'. The file content is unchanged and the file is renamed to a dynamically computed destination. Demlo performs an instant rename if destination is on the same device. Otherwise it copies the file and removes the source. Use the default scripts (if set in configuration file), but do not re-encode: Set 'artist' to the value of 'composer', and 'title' to be preceded by the new value of 'artist', then apply the default script. Do not re-encode. Order in runtime script matters. Mind the double quotes. Set track number to first number in input file name: Use the default scripts but keep original value for the 'artist' tag: 1) Preview default scripts transformation and save it to an index. 2) Edit file to fix any potential mistake. 3) Run Demlo over the same files using the index information only. Same as above but generate output filename according to the custom '61-rename' script. The numeric prefix is important: it ensures that '61-rename' will be run after all the default tag related scripts and after '60-path'. Otherwise, if a change in tags would occur later on, it would not affect the renaming script. Retrieve tags from Internet: Same as above but for a whole album, and saving the result to an index: Only download the cover for the album corresponding to the track. Use 'rmsrc' to avoid duplicating the audio file. Change tags inplace with entries from MusicBrainz: Set tags to titlecase while casing AC-DC correctly: To easily switch between formats from command-line, create one script per format (see 50-encoding.lua), e.g. ogg.lua and flac.lua. Then Add support for non-default formats from CLI: Overwrite existing destination if input is newer: ffmpeg(1), ffprobe(1), http://www.lua.org/pil/contents.html
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
Benchwrap automates running and analysing Go benchmarks. Usage: Benchwrap runs a set of benchmarks n times for one or more git revisions. It feeds the collected benchmark results to `benchstat`, which in turn analyzes the data and prints it to stdout. Each input rev must be a valid git commit or reference, e.g. a hash, tag or branch. Options to `go test` and `benchstat` can be given by using the appropriate flags. Options: Dependencies: In a git repository, run all `Foo` benchmarks 10 times each for git tag `v0.42`, commit `cdd48c8a` and branch master, and analyse results with benchstat:
Package pbc provides structures for building pairing-based cryptosystems. It is a wrapper around the Pairing-Based Cryptography (PBC) Library authored by Ben Lynn (https://crypto.stanford.edu/pbc/). This wrapper provides access to all PBC functions. It supports generation of various types of elliptic curves and pairings, element initialization, I/O, and arithmetic. These features can be used to quickly build pairing-based or conventional cryptosystems. The PBC library is designed to be extremely fast. Internally, it uses GMP for arbitrary-precision arithmetic. It also includes a wide variety of optimizations that make pairing-based cryptography highly efficient. To improve performance, PBC does not perform type checking to ensure that operations actually make sense. The Go wrapper provides the ability to add compatibility checks to most operations, or to use unchecked elements to maximize performance. Since this library provides low-level access to pairing primitives, it is very easy to accidentally construct insecure systems. This library is intended to be used by cryptographers or to implement well-analyzed cryptosystems. Cryptographic pairings are defined over three mathematical groups: G1, G2, and GT, where each group is typically of the same order r. Additionally, a bilinear map e maps a pair of elements — one from G1 and another from G2 — to an element in GT. This map e has the following additional property: If G1 == G2, then a pairing is said to be symmetric. Otherwise, it is asymmetric. Pairings can be used to construct a variety of efficient cryptosystems. The PBC library currently supports 5 different types of pairings, each with configurable parameters. These types are designated alphabetically, roughly in chronological order of introduction. Type A, D, E, F, and G pairings are implemented in the library. Each type has different time and space requirements. For more information about the types, see the documentation for the corresponding generator calls, or the PBC manual page at https://crypto.stanford.edu/pbc/manual/ch05s01.html. This package must be compiled using cgo. It also requires the installation of GMP and PBC. During the build process, this package will attempt to include <gmp.h> and <pbc/pbc.h>, and then dynamically link to GMP and PBC. Most systems include a package for GMP. To install GMP in Debian / Ubuntu: For an RPM installation with YUM: For installation with Fink (http://www.finkproject.org/) on Mac OS X: For more information or to compile from source, visit https://gmplib.org/ To install the PBC library, download the appropriate files for your system from https://crypto.stanford.edu/pbc/download.html. PBC has three dependencies: the gcc compiler, flex (http://flex.sourceforge.net/), and bison (https://www.gnu.org/software/bison/). See the respective sites for installation instructions. Most distributions include packages for these libraries. For example, in Debian / Ubuntu: The PBC source can be compiled and installed using the usual GNU Build System: After installing, you may need to rebuild the search path for libraries: It is possible to install the package on Windows through the use of MinGW and MSYS. MSYS is required for installing PBC, while GMP can be installed through a package. Based on your MinGW installation, you may need to add "-I/usr/local/include" to CPPFLAGS and "-L/usr/local/lib" to LDFLAGS when building PBC. Likewise, you may need to add these options to CGO_CPPFLAGS and CGO_LDFLAGS when installing this package. This package is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For additional details, see the COPYING and COPYING.LESSER files. This example generates a pairing and some random group elements, then applies the pairing operation. This example computes and verifies a Boneh-Lynn-Shacham signature in a simulated conversation between Alice and Bob.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
A dynamic and extensible music library organizer Demlo is a music library organizer. It can encode, fix case, change folder hierarchy according to tags or file properties, tag from an online database, copy covers while ignoring duplicates or those below a quality threshold, and much more. It makes it possible to manage your libraries uniformly and dynamically. You can write your own rules to fit your needs best. Demlo aims at being as lightweight and portable as possible. Its major runtime dependency is the transcoder FFmpeg. The scripts are written in Lua for portability and speed while allowing virtually unlimited extensibility. Usage: For usage options, see: First Demlo creates a list of all input files. When a folder is specified, all files matching the extensions from the 'extensions' variable will be appended to the list. Identical files are appended only once. Next all files get analyzed: - The audio file details (tags, stream properties, format properties, etc.) are stored into the 'input' variable. The 'output' variable gets its default values from 'input', or from an index file if specified from command-line. If no index has been specified and if an attached cuesheet is found, all cuesheet details are appended accordingly. Cuesheet tags override stream tags, which override format tags. Finally, still without index, tags can be retrieved from Internet if the command-line option is set. - If a prescript has been specified, it gets executed. It makes it possible to adjust the input values and global variables before running the other scripts. - The scripts, if any, get executed in the lexicographic order of their basename. The 'output' variable is transformed accordingly. Scripts may contain rules such as defining a new file name, new tags, new encoding properties, etc. You can use conditions on input values to set the output properties, which makes it virtually possible to process a full music library in one single run. - If a postscript has been specified, it gets executed. It makes it possible to adjust the output of the script for the current run only. - Demlo makes some last-minute tweaking if need be: it adjusts the bitrate, the path, the encoding parameters, and so on. - A preview of changes is displayed. - When applying changes, the covers get copied if required and the audio file gets processed: tags are modified as specified, the file is re-encoded if required, and the output is written to the appropriate folder. When destination already exists, the 'exist' action is executed. The program's default behaviour can be changed from the user configuration file. (See the 'Files' section for a template.) Most command-line flags default value can be changed. The configuration file is loaded on startup, before parsing the command-line options. Review the default value of the CLI flags with 'demlo -h'. If you wish to use no configuration file, set the environment variable DEMLORC to ".". Scripts can contain any safe Lua code. Some functions like 'os.execute' are not available for security reasons. It is not possible to print to the standard output/error unless running in debug mode and using the 'debug' function. See the 'sandbox.go' file for a list of allowed functions and variables. Lua patterns are replaced by Go regexps. See https://github.com/google/re2/wiki/Syntax. Scripts have no requirements at all. However, to be useful, they should set values of the 'output' table detailed in the 'Variables' section. You can use the full power of the Lua to set the variables dynamically. For instance: 'input' and 'output' are both accessible from any script. All default functions and variables (excluding 'output') are reset on every script call to enforce consistency. Local variables are lost from one script call to another. Global variables are preserved. Use this feature to pass data like options or new functions. 'output' structure consistency is guaranteed at the start of every script. Demlo will only extract the fields with the right type as described in the 'Variables' section. Warning: Do not abuse of global variables, especially when processing non-fixed size data (e.g. tables). Data could grow big and slow down the program. By default, when the destination exists, Demlo will append a suffix to the output destination. This behaviour can be changed from the 'exist' action specified by the user. Demlo comes with a few default actions. The 'exist' action works just like scripts with the following differences: - Any change to 'output.path' will be skipped. - An additional variable is accessible from the action: 'existinfo' holds the file details of the existing files in the same fashion as 'input'. This allows for comparing the input file and the existing destination. The writing rules can be tweaked the following way: Word of caution: overwriting breaks Demlo's rule of not altering existing files. It can lead to undesired results if the overwritten file is also part of the (yet to be processed) input. The overwrite capability can be useful when syncing music libraries however. The user scripts should be generic. Therefore they may not properly handle some uncommon input values. Tweak the input with temporary overrides from command-line. The prescript and postscript defined on command-line will let you run arbitrary code that is run before and after all other scripts, respectively. Use global variables to transfer data and parameters along. If the prescript and postscript end up being too long, consider writing a demlo script. You can also define shell aliases or use wrapper scripts as convenience. The 'input' table describes the file: Bitrate is in bits per seconds (bps). That is, for 320 kbps you would specify The 'time' is the modification time of the file. It holds the sec seconds and nsec nanoseconds since January 1, 1970 UTC. The entry 'streams' and 'format' are as returned by It gives access to most metadata that FFmpeg can return. For instance, to get the duration of the track in seconds, query the variable 'input.format.duration'. Since there may be more than one stream (covers, other data), the first audio stream is assumed to be the music stream. For convenience, the index of the music stream is stored in 'audioindex'. The tags returned by FFmpeg are found in streams, format and in the cuesheet. To make tag queries easier, all tags are stored in the 'tags' table, with the following precedence: You can remove a tag by setting it to 'nil' or the empty string. This is equivalent, except that 'nil' saves some memory during the process. The 'output' table describes the transformation to apply to the file: The 'parameters' array holds the CLI parameters passed to FFmpeg. It can be anything supported by FFmpeg, although this variable is supposed to hold encoding information. See the 'Examples' section. The 'embeddedcovers', 'externalcovers' and 'onlinecover' variables are detailed in the 'Covers' section. The 'write' variable is covered in the 'Existing destination' section. The 'rmsrc' variable is a boolean: when true, Demlo removes the source file after processing. This can speed up the process when not re-encoding. This option is ignored for multi-track files. For convenience, the following shortcuts are provided: Demlo provides some non-standard Lua functions to ease scripting. Display a message on stderr if debug mode is on. Return lowercase string without non-alphanumeric characters nor leading zeros. Return the relation coefficient of the two input strings. The result is a float in 0.0...1.0, 0.0 means no relation at all, 1.0 means identical strings. A format is a container in FFmpeg's terminology. 'output.parameters' contains CLI flags passed to FFmpeg. They are meant to set the stream codec, the bitrate, etc. If 'output.parameters' is {'-c:a', 'copy'} and the format is identical, then taglib will be used instead of FFmpeg. Use this rule from a (post)script to disable encoding by setting the same format and the copy parameters. This speeds up the process. The official scripts are usually very smart at guessing the right values. They might make mistakes however. If you are unsure, you can (and you are advised to) preview the results before proceeding. The 'diff' preview is printed to stderr. A JSON preview of the changes is printed to stdout if stdout is redirected. The initial values of the 'output' table can be completed with tags fetched from the MusicBrainz database. Audio files are fingerprinted for the queries, so even with initially wrong file names and tags, the right values should still be retrieved. The front album cover can also be retrieved. Proxy parameters will be fetched automatically from the 'http_proxy' and 'https_proxy' environment variables. As this process requires network access it can be quite slow. Nevertheless, Demlo is specifically optimized for albums, so that network queries are used for only one track per album, when possible. Some tracks can be released on different albums: Demlo tries to guess it from the tags, but if the tags are wrong there is no way to know which one it is. There is a case where the selection can be controlled: let's assume we have tracks A, B and C from the same album Z. A and B were also released in album Y, whereas C was release in Z only. Tags for A will be checked online; let's assume it gets tagged to album Y. B will use A details, so album Y too. Then C does not match neither A's nor B's album, so another online query will be made and it will be tagged to album Z. This is slow and does not yield the expected result. Now let's call Tags for C will be queried online, and C will be tagged to Z. Then both A and B will match album Z so they will be tagged using C details, which is the desired result. Conclusion: when using online tagging, the first argument should be the lesser known track of the album. Demlo can set the output variables according to the values set in a text file before calling the script. The input values are ignored as well as online tagging, but it is still possible to access the input table from scripts. This 'index' file is formatted in JSON. It corresponds to what Demlo outputs when printing the JSON preview. This is valid JSON except for the missing beginning and the missing end. It makes it possible to concatenate and to append to existing index files. Demlo will automatically complete the missing parts so that it becomes valid JSON. The index file is useful when you want to edit tags manually: You can redirect the output to a file, edit the content manually with your favorite text editor, then run Demlo again with the index as argument. See the 'Examples' section. This feature can also be used to interface Demlo with other programs. Demlo can manage embedded covers as well as external covers. External covers are queried from files matching known extensions in the file's folder. Embedded covers are queried from static video streams in the file. Covers are accessed from The embedded covers are indexed numerically by order of appearance in the streams. The first cover will be at index 1 and so on. This is not necessarily the index of the stream. 'inputcover' is the following structure: 'format' is the picture format. FFmpeg makes a distinction between format and codec, but it is not useful for covers. The name of the format is specified by Demlo, not by FFmpeg. Hence the 'jpeg' name, instead of 'mjpeg' as FFmpeg puts it. 'width' and 'height' hold the size in pixels. 'checksum' can be used to identify files uniquely. For performance reasons, only a partial checksum is performed. This variable is typically used for skipping duplicates. Cover transformations are specified in 'outputcover' has the following structure: The format is specified by FFmpeg this time. See the comments on 'format' for 'inputcover'. 'parameters' is used in the same fashion as 'output.parameters'. User configuration: This must be a Lua file. See the 'demlorc' file provided with this package for an exhaustive list of options. Folder containing the official scripts: User script folder: Create this folder and add your own scripts inside. This folder takes precedence over the system folder, so scripts with the same name will be found in the user folder first. The following examples will not proceed unless the '-p' command-line option is true. Important: you _must_ use single quotes for the runtime Lua command to prevent expansion. Inside the Lua code, use double quotes for strings and escape single quotes. Show default options: Preview changes made by the default scripts: Use 'alternate' script if found in user or system script folder (user folder first): Add the Lua file to the list of scripts. This feature is convenient if you want to write scripts that are too complex to fit on the command-line, but not generic enough to fit the user or system script folders. Remove all script from the list, then add '30-case' and '60-path' scripts. Note that '30-case' will be run before '60-path'. Do not use any script but '60-path'. The file content is unchanged and the file is renamed to a dynamically computed destination. Demlo performs an instant rename if destination is on the same device. Otherwise it copies the file and removes the source. Use the default scripts (if set in configuration file), but do not re-encode: Set 'artist' to the value of 'composer', and 'title' to be preceded by the new value of 'artist', then apply the default script. Do not re-encode. Order in runtime script matters. Mind the double quotes. Set track number to first number in input file name: Use the default scripts but keep original value for the 'artist' tag: 1) Preview default scripts transformation and save it to an index. 2) Edit file to fix any potential mistake. 3) Run Demlo over the same files using the index information only. Same as above but generate output filename according to the custom '61-rename' script. The numeric prefix is important: it ensures that '61-rename' will be run after all the default tag related scripts and after '60-path'. Otherwise, if a change in tags would occur later on, it would not affect the renaming script. Retrieve tags from Internet: Same as above but for a whole album, and saving the result to an index: Only download the cover for the album corresponding to the track. Use 'rmsrc' to avoid duplicating the audio file. Change tags inplace with entries from MusicBrainz: Set tags to titlecase while casing AC-DC correctly: To easily switch between formats from command-line, create one script per format (see 50-encoding.lua), e.g. ogg.lua and flac.lua. Then Add support for non-default formats from CLI: Overwrite existing destination if input is newer: ffmpeg(1), ffprobe(1), http://www.lua.org/pil/contents.html
Package pbc provides structures for building pairing-based cryptosystems. It is a wrapper around the Pairing-Based Cryptography (PBC) Library authored by Ben Lynn (https://crypto.stanford.edu/pbc/). This wrapper provides access to all PBC functions. It supports generation of various types of elliptic curves and pairings, element initialization, I/O, and arithmetic. These features can be used to quickly build pairing-based or conventional cryptosystems. The PBC library is designed to be extremely fast. Internally, it uses GMP for arbitrary-precision arithmetic. It also includes a wide variety of optimizations that make pairing-based cryptography highly efficient. To improve performance, PBC does not perform type checking to ensure that operations actually make sense. The Go wrapper provides the ability to add compatibility checks to most operations, or to use unchecked elements to maximize performance. Since this library provides low-level access to pairing primitives, it is very easy to accidentally construct insecure systems. This library is intended to be used by cryptographers or to implement well-analyzed cryptosystems. Cryptographic pairings are defined over three mathematical groups: G1, G2, and GT, where each group is typically of the same order r. Additionally, a bilinear map e maps a pair of elements — one from G1 and another from G2 — to an element in GT. This map e has the following additional property: If G1 == G2, then a pairing is said to be symmetric. Otherwise, it is asymmetric. Pairings can be used to construct a variety of efficient cryptosystems. The PBC library currently supports 5 different types of pairings, each with configurable parameters. These types are designated alphabetically, roughly in chronological order of introduction. Type A, D, E, F, and G pairings are implemented in the library. Each type has different time and space requirements. For more information about the types, see the documentation for the corresponding generator calls, or the PBC manual page at https://crypto.stanford.edu/pbc/manual/ch05s01.html. This package must be compiled using cgo. It also requires the installation of GMP and PBC. During the build process, this package will attempt to include <gmp.h> and <pbc/pbc.h>, and then dynamically link to GMP and PBC. Most systems include a package for GMP. To install GMP in Debian / Ubuntu: For an RPM installation with YUM: For installation with Fink (http://www.finkproject.org/) on Mac OS X: For more information or to compile from source, visit https://gmplib.org/ To install the PBC library, download the appropriate files for your system from https://crypto.stanford.edu/pbc/download.html. PBC has three dependencies: the gcc compiler, flex (http://flex.sourceforge.net/), and bison (https://www.gnu.org/software/bison/). See the respective sites for installation instructions. Most distributions include packages for these libraries. For example, in Debian / Ubuntu: The PBC source can be compiled and installed using the usual GNU Build System: After installing, you may need to rebuild the search path for libraries: It is possible to install the package on Windows through the use of MinGW and MSYS. MSYS is required for installing PBC, while GMP can be installed through a package. Based on your MinGW installation, you may need to add "-I/usr/local/include" to CPPFLAGS and "-L/usr/local/lib" to LDFLAGS when building PBC. Likewise, you may need to add these options to CGO_CPPFLAGS and CGO_LDFLAGS when installing this package. This package is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For additional details, see the COPYING and COPYING.LESSER files. This example generates a pairing and some random group elements, then applies the pairing operation. This example computes and verifies a Boneh-Lynn-Shacham signature in a simulated conversation between Alice and Bob.
Package grpcreplay supports the capture and replay of gRPC calls. Its main goal is to improve testing. Once you capture the calls of a test that runs against a real service, you have an "automatic mock" that can be replayed against the same test, yielding a unit test that is fast and flake-free. To record a sequence of gRPC calls to a file, create a Recorder and pass its DialOptions to grpc.Dial: It is essential to close the Recorder when the interaction is finished. There is also a NewRecorderWriter function for capturing to an arbitrary io.Writer. To replay a captured file, create a Replayer and ask it for a (fake) connection. We don't actually have to dial a server. (Since we're reading the file and not writing it, we don't have to be as careful about the error returned from Close). A test might use random or time-sensitive values, for instance to create unique resources for isolation from other tests. The test therefore has initial values, such as the current time, or a random seed, that differ from run to run. You must record this initial state and re-establish it on replay. To record the initial state, serialize it into a []byte and pass it as the second argument to NewRecorder: On replay, get the bytes from Replayer.Initial: Recorders and replayers have support for running callbacks before messages are written to or read from the replay file. A Recorder has a BeforeFunc that can modify a request or response before it is written to the replay file. The actual RPCs sent to the service during recording remain unaltered; only what is saved in the replay file can be changed. A Replayer has a BeforeFunc that can modify a request before it is sent for matching. Example uses for these callbacks include customized logging, or scrubbing data before RPCs are written to the replay file. If requests are modified by the callbacks during recording, it is important to perform the same modifications to the requests when replaying, or RPC matching on replay will fail. A common way to analyze and modify the various messages is to use a type switch. A nondeterministic program may invoke RPCs in a different order each time it is run. The order in which RPCs are called during recording may differ from the order during replay. The replayer matches incoming to recorded requests by method name and request contents, so nondeterminism is only a concern for identical requests that result in different responses. A nondeterministic program whose behavior differs depending on the order of such RPCs probably has a race condition: since both the recorded sequence of RPCs and the sequence during replay are valid orderings, the program should behave the same under both. The same is not true of streaming RPCs. The replayer matches streams only by method name, since it has no other information at the time the stream is opened. Two streams with the same method name that are started concurrently may replay in the wrong order. Besides the differences in replay mentioned above, other differences may cause issues for some programs. We list them here. The Replayer delivers a response to an RPC immediately, without waiting for other incoming RPCs. This can violate causality. For example, in a Pub/Sub program where one goroutine publishes and another subscribes, during replay the Subscribe call may finish before the Publish call begins. For streaming RPCs, the Replayer delivers the result of Send and Recv calls in the order they were recorded. No attempt is made to match message contents. At present, this package does not record or replay stream headers and trailers, or the result of the CloseSend method.
Package pbc provides structures for building pairing-based cryptosystems. It is a wrapper around the Pairing-Based Cryptography (PBC) Library authored by Ben Lynn (https://crypto.stanford.edu/pbc/). This wrapper provides access to all PBC functions. It supports generation of various types of elliptic curves and pairings, element initialization, I/O, and arithmetic. These features can be used to quickly build pairing-based or conventional cryptosystems. The PBC library is designed to be extremely fast. Internally, it uses GMP for arbitrary-precision arithmetic. It also includes a wide variety of optimizations that make pairing-based cryptography highly efficient. To improve performance, PBC does not perform type checking to ensure that operations actually make sense. The Go wrapper provides the ability to add compatibility checks to most operations, or to use unchecked elements to maximize performance. Since this library provides low-level access to pairing primitives, it is very easy to accidentally construct insecure systems. This library is intended to be used by cryptographers or to implement well-analyzed cryptosystems. Cryptographic pairings are defined over three mathematical groups: G1, G2, and GT, where each group is typically of the same order r. Additionally, a bilinear map e maps a pair of elements — one from G1 and another from G2 — to an element in GT. This map e has the following additional property: If G1 == G2, then a pairing is said to be symmetric. Otherwise, it is asymmetric. Pairings can be used to construct a variety of efficient cryptosystems. The PBC library currently supports 5 different types of pairings, each with configurable parameters. These types are designated alphabetically, roughly in chronological order of introduction. Type A, D, E, F, and G pairings are implemented in the library. Each type has different time and space requirements. For more information about the types, see the documentation for the corresponding generator calls, or the PBC manual page at https://crypto.stanford.edu/pbc/manual/ch05s01.html. This package must be compiled using cgo. It also requires the installation of GMP and PBC. During the build process, this package will attempt to include <gmp.h> and <pbc/pbc.h>, and then dynamically link to GMP and PBC. Most systems include a package for GMP. To install GMP in Debian / Ubuntu: For an RPM installation with YUM: For installation with Fink (http://www.finkproject.org/) on Mac OS X: For more information or to compile from source, visit https://gmplib.org/ To install the PBC library, download the appropriate files for your system from https://crypto.stanford.edu/pbc/download.html. PBC has three dependencies: the gcc compiler, flex (http://flex.sourceforge.net/), and bison (https://www.gnu.org/software/bison/). See the respective sites for installation instructions. Most distributions include packages for these libraries. For example, in Debian / Ubuntu: The PBC source can be compiled and installed using the usual GNU Build System: After installing, you may need to rebuild the search path for libraries: It is possible to install the package on Windows through the use of MinGW and MSYS. MSYS is required for installing PBC, while GMP can be installed through a package. Based on your MinGW installation, you may need to add "-I/usr/local/include" to CPPFLAGS and "-L/usr/local/lib" to LDFLAGS when building PBC. Likewise, you may need to add these options to CGO_CPPFLAGS and CGO_LDFLAGS when installing this package. This package is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. For additional details, see the COPYING and COPYING.LESSER files. This example generates a pairing and some random group elements, then applies the pairing operation. This example computes and verifies a Boneh-Lynn-Shacham signature in a simulated conversation between Alice and Bob.
Package grpcreplay supports the capture and replay of gRPC calls. Its main goal is to improve testing. Once you capture the calls of a test that runs against a real service, you have an "automatic mock" that can be replayed against the same test, yielding a unit test that is fast and flake-free. To record a sequence of gRPC calls to a file, create a Recorder and pass its DialOptions to grpc.Dial: It is essential to close the Recorder when the interaction is finished. There is also a NewRecorderWriter function for capturing to an arbitrary io.Writer. To replay a captured file, create a Replayer and ask it for a (fake) connection. We don't actually have to dial a server. (Since we're reading the file and not writing it, we don't have to be as careful about the error returned from Close). A test might use random or time-sensitive values, for instance to create unique resources for isolation from other tests. The test therefore has initial values, such as the current time, or a random seed, that differ from run to run. You must record this initial state and re-establish it on replay. To record the initial state, serialize it into a []byte and pass it as the second argument to NewRecorder: On replay, get the bytes from Replayer.Initial: Recorders and replayers have support for running callbacks before messages are written to or read from the replay file. A Recorder has a BeforeFunc that can modify a request or response before it is written to the replay file. The actual RPCs sent to the service during recording remain unaltered; only what is saved in the replay file can be changed. A Replayer has a BeforeFunc that can modify a request before it is sent for matching. Example uses for these callbacks include customized logging, or scrubbing data before RPCs are written to the replay file. If requests are modified by the callbacks during recording, it is important to perform the same modifications to the requests when replaying, or RPC matching on replay will fail. A common way to analyze and modify the various messages is to use a type switch. A nondeterministic program may invoke RPCs in a different order each time it is run. The order in which RPCs are called during recording may differ from the order during replay. The replayer matches incoming to recorded requests by method name and request contents, so nondeterminism is only a concern for identical requests that result in different responses. A nondeterministic program whose behavior differs depending on the order of such RPCs probably has a race condition: since both the recorded sequence of RPCs and the sequence during replay are valid orderings, the program should behave the same under both. The same is not true of streaming RPCs. The replayer matches streams only by method name, since it has no other information at the time the stream is opened. Two streams with the same method name that are started concurrently may replay in the wrong order. Besides the differences in replay mentioned above, other differences may cause issues for some programs. We list them here. The Replayer delivers a response to an RPC immediately, without waiting for other incoming RPCs. This can violate causality. For example, in a Pub/Sub program where one goroutine publishes and another subscribes, during replay the Subscribe call may finish before the Publish call begins. For streaming RPCs, the Replayer delivers the result of Send and Recv calls in the order they were recorded. No attempt is made to match message contents. At present, this package does not record or replay stream headers and trailers, or the result of the CloseSend method.
Package grpcreplay supports the capture and replay of gRPC calls. Its main goal is to improve testing. Once you capture the calls of a test that runs against a real service, you have an "automatic mock" that can be replayed against the same test, yielding a unit test that is fast and flake-free. To record a sequence of gRPC calls to a file, create a Recorder and pass its DialOptions to grpc.Dial: It is essential to close the Recorder when the interaction is finished. There is also a NewRecorderWriter function for capturing to an arbitrary io.Writer. To replay a captured file, create a Replayer and ask it for a (fake) connection. We don't actually have to dial a server. (Since we're reading the file and not writing it, we don't have to be as careful about the error returned from Close). A test might use random or time-sensitive values, for instance to create unique resources for isolation from other tests. The test therefore has initial values, such as the current time, or a random seed, that differ from run to run. You must record this initial state and re-establish it on replay. To record the initial state, serialize it into a []byte and pass it as the second argument to NewRecorder: On replay, get the bytes from Replayer.Initial: Recorders and replayers have support for running callbacks before messages are written to or read from the replay file. A Recorder has a BeforeFunc that can modify a request or response before it is written to the replay file. The actual RPCs sent to the service during recording remain unaltered; only what is saved in the replay file can be changed. A Replayer has a BeforeFunc that can modify a request before it is sent for matching. Example uses for these callbacks include customized logging, or scrubbing data before RPCs are written to the replay file. If requests are modified by the callbacks during recording, it is important to perform the same modifications to the requests when replaying, or RPC matching on replay will fail. A common way to analyze and modify the various messages is to use a type switch. A nondeterministic program may invoke RPCs in a different order each time it is run. The order in which RPCs are called during recording may differ from the order during replay. The replayer matches incoming to recorded requests by method name and request contents, so nondeterminism is only a concern for identical requests that result in different responses. A nondeterministic program whose behavior differs depending on the order of such RPCs probably has a race condition: since both the recorded sequence of RPCs and the sequence during replay are valid orderings, the program should behave the same under both. The same is not true of streaming RPCs. The replayer matches streams only by method name, since it has no other information at the time the stream is opened. Two streams with the same method name that are started concurrently may replay in the wrong order. Besides the differences in replay mentioned above, other differences may cause issues for some programs. We list them here. The Replayer delivers a response to an RPC immediately, without waiting for other incoming RPCs. This can violate causality. For example, in a Pub/Sub program where one goroutine publishes and another subscribes, during replay the Subscribe call may finish before the Publish call begins. For streaming RPCs, the Replayer delivers the result of Send and Recv calls in the order they were recorded. No attempt is made to match message contents. At present, this package does not record or replay stream headers and trailers, or the result of the CloseSend method.
go-bliss are Go bindings to bliss, a small library that can analyze songs and compute the distance between two songs. It can be useful for creating "intelligent" playlists of songs that are similar. The main algorithm works by outputting a vector V = (a,b,c,d) for each song. The euclidean distance between these vectors corresponds to the actual distance felt by listening to them: a playlist can then be built by queuing close songs, or do other stuff using mathematical tools available on euclidean 4-dimensionnal spaces. go-bliss depends on bliss (compile-time and runtime), which itself depends on some libav* libraries (compile-time and runtime). Check the project README for detailed setup instructions. Check the project README for links to technical details about the analysis process. go-bliss can compute: - for each song, a force (float) for the overall song intensity, corresponding to a force rating (ForceRating), either calm, loud, or unknown - for each song, a vector containg 4 floats (ForceVector), each rating an aspect of the song: tempo is the BPM of the song, amplitude is the physical force of the song, frequency is the ratio between high and low frequencies, attack is a sum of intensity of all the attacks divided by the song length - for two songs, the distance between them (float) (the euclidian distance between their force vectors), or their cosine similarity There is no global library setup or teardown method. go-bliss can decode an audio file with Decode into a struct, Song, that contains metadata about the file and its samples. Song also contains fields corresponding to the song analysis result, that is not filled by this call. A Song owns internal memory that can be freed explicitly by using Close when done using it, or automatically by the GC after some time (with runtime.SetFinalizer). go-bliss can analyze an audio file with Analyze into a Song, like Decode, except it also analyzes the song and fills its analysis-related fields. go-bliss can compute the distance between two songs (either audio files with DistanceFile, or files already decoded to songs with Distance), or their cosine similarity (with CosineSimilarity and CosineSimilarityFile). go-bliss can also compute a specific value of a song rather than all of them with EnvelopeSort, AmplitudeSort, and FrequencySort. go-bliss also has Mean, Variance, and RectangularFilter helpers, as well as a Version function. As far as I know bliss does not store global state, so processing two different songs concurrently should be fine.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
A dynamic and extensible music library organizer Demlo is a music library organizer. It can encode, fix case, change folder hierarchy according to tags or file properties, tag from an online database, copy covers while ignoring duplicates or those below a quality threshold, and much more. It makes it possible to manage your libraries uniformly and dynamically. You can write your own rules to fit your needs best. Demlo aims at being as lightweight and portable as possible. Its major runtime dependency is the transcoder FFmpeg. The scripts are written in Lua for portability and speed while allowing virtually unlimited extensibility. Usage: For usage options, see: First Demlo creates a list of all input files. When a folder is specified, all files matching the extensions from the 'extensions' variable will be appended to the list. Identical files are appended only once. Next all files get analyzed: - The audio file details (tags, stream properties, format properties, etc.) are stored into the 'input' variable. The 'output' variable gets its default values from 'input', or from an index file if specified from command-line. If no index has been specified and if an attached cuesheet is found, all cuesheet details are appended accordingly. Cuesheet tags override stream tags, which override format tags. Finally, still without index, tags can be retrieved from Internet if the command-line option is set. - If a prescript has been specified, it gets executed. It makes it possible to adjust the input values and global variables before running the other scripts. - The scripts, if any, get executed in the lexicographic order of their basename. The 'output' variable is transformed accordingly. Scripts may contain rules such as defining a new file name, new tags, new encoding properties, etc. You can use conditions on input values to set the output properties, which makes it virtually possible to process a full music library in one single run. - If a postscript has been specified, it gets executed. It makes it possible to adjust the output of the script for the current run only. - Demlo makes some last-minute tweaking if need be: it adjusts the bitrate, the path, the encoding parameters, and so on. - A preview of changes is displayed. - When applying changes, the covers get copied if required and the audio file gets processed: tags are modified as specified, the file is re-encoded if required, and the output is written to the appropriate folder. When destination already exists, the 'exist' action is executed. The program's default behaviour can be changed from the user configuration file. (See the 'Files' section for a template.) Most command-line flags default value can be changed. The configuration file is loaded on startup, before parsing the command-line options. Review the default value of the CLI flags with 'demlo -h'. If you wish to use no configuration file, set the environment variable DEMLORC to ".". Scripts can contain any safe Lua code. Some functions like 'os.execute' are not available for security reasons. It is not possible to print to the standard output/error unless running in debug mode and using the 'debug' function. See the 'sandbox.go' file for a list of allowed functions and variables. Lua patterns are replaced by Go regexps. See https://github.com/google/re2/wiki/Syntax. Scripts have no requirements at all. However, to be useful, they should set values of the 'output' table detailed in the 'Variables' section. You can use the full power of the Lua to set the variables dynamically. For instance: 'input' and 'output' are both accessible from any script. All default functions and variables (excluding 'output') are reset on every script call to enforce consistency. Local variables are lost from one script call to another. Global variables are preserved. Use this feature to pass data like options or new functions. 'output' structure consistency is guaranteed at the start of every script. Demlo will only extract the fields with the right type as described in the 'Variables' section. Warning: Do not abuse of global variables, especially when processing non-fixed size data (e.g. tables). Data could grow big and slow down the program. By default, when the destination exists, Demlo will append a suffix to the output destination. This behaviour can be changed from the 'exist' action specified by the user. Demlo comes with a few default actions. The 'exist' action works just like scripts with the following differences: - Any change to 'output.path' will be skipped. - An additional variable is accessible from the action: 'existinfo' holds the file details of the existing files in the same fashion as 'input'. This allows for comparing the input file and the existing destination. The writing rules can be tweaked the following way: Word of caution: overwriting breaks Demlo's rule of not altering existing files. It can lead to undesired results if the overwritten file is also part of the (yet to be processed) input. The overwrite capability can be useful when syncing music libraries however. The user scripts should be generic. Therefore they may not properly handle some uncommon input values. Tweak the input with temporary overrides from command-line. The prescript and postscript defined on command-line will let you run arbitrary code that is run before and after all other scripts, respectively. Use global variables to transfer data and parameters along. If the prescript and postscript end up being too long, consider writing a demlo script. You can also define shell aliases or use wrapper scripts as convenience. The 'input' table describes the file: Bitrate is in bits per seconds (bps). That is, for 320 kbps you would specify The 'time' is the modification time of the file. It holds the sec seconds and nsec nanoseconds since January 1, 1970 UTC. The entry 'streams' and 'format' are as returned by It gives access to most metadata that FFmpeg can return. For instance, to get the duration of the track in seconds, query the variable 'input.format.duration'. Since there may be more than one stream (covers, other data), the first audio stream is assumed to be the music stream. For convenience, the index of the music stream is stored in 'audioindex'. The tags returned by FFmpeg are found in streams, format and in the cuesheet. To make tag queries easier, all tags are stored in the 'tags' table, with the following precedence: You can remove a tag by setting it to 'nil' or the empty string. This is equivalent, except that 'nil' saves some memory during the process. The 'output' table describes the transformation to apply to the file: The 'parameters' array holds the CLI parameters passed to FFmpeg. It can be anything supported by FFmpeg, although this variable is supposed to hold encoding information. See the 'Examples' section. The 'embeddedcovers', 'externalcovers' and 'onlinecover' variables are detailed in the 'Covers' section. The 'write' variable is covered in the 'Existing destination' section. The 'rmsrc' variable is a boolean: when true, Demlo removes the source file after processing. This can speed up the process when not re-encoding. This option is ignored for multi-track files. For convenience, the following shortcuts are provided: Demlo provides some non-standard Lua functions to ease scripting. Display a message on stderr if debug mode is on. Return lowercase string without non-alphanumeric characters nor leading zeros. Return the relation coefficient of the two input strings. The result is a float in 0.0...1.0, 0.0 means no relation at all, 1.0 means identical strings. A format is a container in FFmpeg's terminology. 'output.parameters' contains CLI flags passed to FFmpeg. They are meant to set the stream codec, the bitrate, etc. If 'output.parameters' is {'-c:a', 'copy'} and the format is identical, then taglib will be used instead of FFmpeg. Use this rule from a (post)script to disable encoding by setting the same format and the copy parameters. This speeds up the process. The official scripts are usually very smart at guessing the right values. They might make mistakes however. If you are unsure, you can (and you are advised to) preview the results before proceeding. The 'diff' preview is printed to stderr. A JSON preview of the changes is printed to stdout if stdout is redirected. The initial values of the 'output' table can be completed with tags fetched from the MusicBrainz database. Audio files are fingerprinted for the queries, so even with initially wrong file names and tags, the right values should still be retrieved. The front album cover can also be retrieved. Proxy parameters will be fetched automatically from the 'http_proxy' and 'https_proxy' environment variables. As this process requires network access it can be quite slow. Nevertheless, Demlo is specifically optimized for albums, so that network queries are used for only one track per album, when possible. Some tracks can be released on different albums: Demlo tries to guess it from the tags, but if the tags are wrong there is no way to know which one it is. There is a case where the selection can be controlled: let's assume we have tracks A, B and C from the same album Z. A and B were also released in album Y, whereas C was release in Z only. Tags for A will be checked online; let's assume it gets tagged to album Y. B will use A details, so album Y too. Then C does not match neither A's nor B's album, so another online query will be made and it will be tagged to album Z. This is slow and does not yield the expected result. Now let's call Tags for C will be queried online, and C will be tagged to Z. Then both A and B will match album Z so they will be tagged using C details, which is the desired result. Conclusion: when using online tagging, the first argument should be the lesser known track of the album. Demlo can set the output variables according to the values set in a text file before calling the script. The input values are ignored as well as online tagging, but it is still possible to access the input table from scripts. This 'index' file is formatted in JSON. It corresponds to what Demlo outputs when printing the JSON preview. This is valid JSON except for the missing beginning and the missing end. It makes it possible to concatenate and to append to existing index files. Demlo will automatically complete the missing parts so that it becomes valid JSON. The index file is useful when you want to edit tags manually: You can redirect the output to a file, edit the content manually with your favorite text editor, then run Demlo again with the index as argument. See the 'Examples' section. This feature can also be used to interface Demlo with other programs. Demlo can manage embedded covers as well as external covers. External covers are queried from files matching known extensions in the file's folder. Embedded covers are queried from static video streams in the file. Covers are accessed from The embedded covers are indexed numerically by order of appearance in the streams. The first cover will be at index 1 and so on. This is not necessarily the index of the stream. 'inputcover' is the following structure: 'format' is the picture format. FFmpeg makes a distinction between format and codec, but it is not useful for covers. The name of the format is specified by Demlo, not by FFmpeg. Hence the 'jpeg' name, instead of 'mjpeg' as FFmpeg puts it. 'width' and 'height' hold the size in pixels. 'checksum' can be used to identify files uniquely. For performance reasons, only a partial checksum is performed. This variable is typically used for skipping duplicates. Cover transformations are specified in 'outputcover' has the following structure: The format is specified by FFmpeg this time. See the comments on 'format' for 'inputcover'. 'parameters' is used in the same fashion as 'output.parameters'. User configuration: This must be a Lua file. See the 'demlorc' file provided with this package for an exhaustive list of options. Folder containing the official scripts: User script folder: Create this folder and add your own scripts inside. This folder takes precedence over the system folder, so scripts with the same name will be found in the user folder first. The following examples will not proceed unless the '-p' command-line option is true. Important: you _must_ use single quotes for the runtime Lua command to prevent expansion. Inside the Lua code, use double quotes for strings and escape single quotes. Show default options: Preview changes made by the default scripts: Use 'alternate' script if found in user or system script folder (user folder first): Add the Lua file to the list of scripts. This feature is convenient if you want to write scripts that are too complex to fit on the command-line, but not generic enough to fit the user or system script folders. Remove all script from the list, then add '30-case' and '60-path' scripts. Note that '30-case' will be run before '60-path'. Do not use any script but '60-path'. The file content is unchanged and the file is renamed to a dynamically computed destination. Demlo performs an instant rename if destination is on the same device. Otherwise it copies the file and removes the source. Use the default scripts (if set in configuration file), but do not re-encode: Set 'artist' to the value of 'composer', and 'title' to be preceded by the new value of 'artist', then apply the default script. Do not re-encode. Order in runtime script matters. Mind the double quotes. Set track number to first number in input file name: Use the default scripts but keep original value for the 'artist' tag: 1) Preview default scripts transformation and save it to an index. 2) Edit file to fix any potential mistake. 3) Run Demlo over the same files using the index information only. Same as above but generate output filename according to the custom '61-rename' script. The numeric prefix is important: it ensures that '61-rename' will be run after all the default tag related scripts and after '60-path'. Otherwise, if a change in tags would occur later on, it would not affect the renaming script. Retrieve tags from Internet: Same as above but for a whole album, and saving the result to an index: Only download the cover for the album corresponding to the track. Use 'rmsrc' to avoid duplicating the audio file. Change tags inplace with entries from MusicBrainz: Set tags to titlecase while casing AC-DC correctly: To easily switch between formats from command-line, create one script per format (see 50-encoding.lua), e.g. ogg.lua and flac.lua. Then Add support for non-default formats from CLI: Overwrite existing destination if input is newer: ffmpeg(1), ffprobe(1), http://www.lua.org/pil/contents.html
Package grpcreplay supports the capture and replay of gRPC calls. Its main goal is to improve testing. Once you capture the calls of a test that runs against a real service, you have an "automatic mock" that can be replayed against the same test, yielding a unit test that is fast and flake-free. To record a sequence of gRPC calls to a file, create a Recorder and pass its DialOptions to grpc.Dial: It is essential to close the Recorder when the interaction is finished. There is also a NewRecorderWriter function for capturing to an arbitrary io.Writer. To replay a captured file, create a Replayer and ask it for a (fake) connection. We don't actually have to dial a server. (Since we're reading the file and not writing it, we don't have to be as careful about the error returned from Close). A test might use random or time-sensitive values, for instance to create unique resources for isolation from other tests. The test therefore has initial values, such as the current time, or a random seed, that differ from run to run. You must record this initial state and re-establish it on replay. To record the initial state, serialize it into a []byte and pass it as the second argument to NewRecorder: On replay, get the bytes from Replayer.Initial: Recorders and replayers have support for running callbacks before messages are written to or read from the replay file. A Recorder has a BeforeFunc that can modify a request or response before it is written to the replay file. The actual RPCs sent to the service during recording remain unaltered; only what is saved in the replay file can be changed. A Replayer has a BeforeFunc that can modify a request before it is sent for matching. Example uses for these callbacks include customized logging, or scrubbing data before RPCs are written to the replay file. If requests are modified by the callbacks during recording, it is important to perform the same modifications to the requests when replaying, or RPC matching on replay will fail. A common way to analyze and modify the various messages is to use a type switch. A nondeterministic program may invoke RPCs in a different order each time it is run. The order in which RPCs are called during recording may differ from the order during replay. The replayer matches incoming to recorded requests by method name and request contents, so nondeterminism is only a concern for identical requests that result in different responses. A nondeterministic program whose behavior differs depending on the order of such RPCs probably has a race condition: since both the recorded sequence of RPCs and the sequence during replay are valid orderings, the program should behave the same under both. The same is not true of streaming RPCs. The replayer matches streams only by method name, since it has no other information at the time the stream is opened. Two streams with the same method name that are started concurrently may replay in the wrong order. Besides the differences in replay mentioned above, other differences may cause issues for some programs. We list them here. The Replayer delivers a response to an RPC immediately, without waiting for other incoming RPCs. This can violate causality. For example, in a Pub/Sub program where one goroutine publishes and another subscribes, during replay the Subscribe call may finish before the Publish call begins. For streaming RPCs, the Replayer delivers the result of Send and Recv calls in the order they were recorded. No attempt is made to match message contents. At present, this package does not record or replay stream headers and trailers, or the result of the CloseSend method.
Package CloudForest implements ensembles of decision trees for machine learning in pure Go (golang to search engines). It allows for a number of related algorithms for classification, regression, feature selection and structure analysis on heterogeneous numerical/categorical data with missing values. These include: Breiman and Cutler's Random Forest for Classification and Regression Adaptive Boosting (AdaBoost) Classification Gradiant Boosting Tree Regression Entropy and Cost driven classification L1 regression Feature selection with artificial contrasts Proximity and model structure analysis Roughly balanced bagging for unbalanced classification The API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed only on embargoed data sets and can not yet be released. Library Documentation is in code and can be viewed with godoc or live at: http://godoc.org/github.com/ryanbressler/CloudForest Documentation of command line utilities and file formats can be found in README.md, which can be viewed fromated on github: http://github.com/ryanbressler/CloudForest Pull requests and bug reports are welcome. CloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at the Institute for Systems Biology for use on genomic/biomedical data with partial support from The Cancer Genome Atlas and the Inova Translational Medicine Institute. CloudForest is intended to provide fast, comprehensible building blocks that can be used to implement ensembles of decision trees. CloudForest is written in Go to allow a data scientist to develop and scale new models and analysis quickly instead of having to modify complex legacy code. Data structures and file formats are chosen with use in multi threaded and cluster environments in mind. Go's support for function types is used to provide a interface to run code as data is percolated through a tree. This method is flexible enough that it can extend the tree being analyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous function/closure passed to a tree's root node's Recurse method: This allows a researcher to include whatever additional analysis they need (importance scores, proximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests to tabulate scores or extract structure. Utilities like leafcount and errorrate use this method to tabulate data about the tree in collection objects. Decision tree's are grown with the goal of reducing "Impurity" which is usually defined as Gini Impurity for categorical targets or mean squared error for numerical targets. CloudForest grows trees against the Target interface which allows for alternative definitions of impurity. CloudForest includes several alternative targets: Additional targets can be stacked on top of these target to add boosting functionality: Repeatedly splitting the data and searching for the best split at each node of a decision tree are the most computationally intensive parts of decision tree learning and CloudForest includes optimized code to perform these tasks. Go's slices are used extensively in CloudForest to make it simple to interact with optimized code. Many previous implementations of Random Forest have avoided reallocation by reordering data in place and keeping track of start and end indexes. In go, slices pointing at the same underlying arrays make this sort of optimization transparent. For example a function like: can return left and right slices that point to the same underlying array as the original slice of cases but these slices should not have their values changed. Functions used while searching for the best split also accepts pointers to reusable slices and structs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains pointers to these items and its use can be seen in functions like: For categorical predictors, BestSplit will also attempt to intelligently choose between 4 different implementations depending on user input and the number of categories. These include exhaustive, random, and iterative searches for the best combination of categories implemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter, BestCatSplitBig and BestCatSplitIterBig. All numerical predictors are handled by BestNumSplit which relies on go's sorting package. Training a Random forest is an inherently parallel process and CloudForest is designed to allow parallel implementations that can tackle large problems while keeping memory usage low by writing and using data structures directly to/from disk. Trees can be grown in separate go routines. The growforest utility provides an example of this that uses go routines and channels to grow trees in parallel and write trees to disk as the are finished by the "worker" go routines. The few summary statistics like mean impurity decrease per feature (importance) can be calculated using thread safe data structures like RunningMean. Trees can also be grown on separate machines. The .sf stochastic forest format allows several small forests to be combined by concatenation and the ForestReader and ForestWriter structs allow these forests to be accessed tree by tree (or even node by node) from disk. For data sets that are too big to fit in memory on a single machine Tree.Grow and FeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk, distributed database etc. By default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature with missing data the missing cases are removed and the impurity value is corrected to use three way impurity which reduces the bias towards features with lots of missing data: Missing values in the target variable are left out of impurity calculations. This provided generally good results at a fraction of the computational costs of imputing data. Optionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth to impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for imputing values. This forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the more accurate proximity weighted imputation Brieman describes. Experimental support is provided for 3 way splitting which splits missing cases onto a third branch. [2] This has so far yielded mixed results in testing. At some point in the future support may be added for local imputing of missing values during tree growth as described in [3] [1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 [2] https://code.google.com/p/rf-ace/ [3] http://projecteuclid.org/DPubS?verb=Display&version=1.0&service=UI&handle=euclid.aoas/1223908043&page=record In CloudForest data is stored using the FeatureMatrix struct which contains Features. The Feature struct implements storage and methods for both categorical and numerical data and calculations of impurity etc and the search for the best split. The Target interface abstracts the methods of Feature that are needed for a feature to be predictable. This allows for the implementation of alternative types of regression and classification. Trees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow implements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest method is also provided that implements the rest of the method including sampling cases but it may be faster to grow the forest to disk as in the growforest utility. Prediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the VoteTallyer interface.
The sockdrawer command is an analysis and visualization tool to help you reorganize a complex Go package into several simpler ones. sockdrawer operates on three kinds of graphs at different levels of abstraction. The lowest level is the NODE GRAPH. A node is a package-level declaration of a named entity (func, var, const or type). An entire constant declaration is treated as a single node, even if it contains multiple "specs" each defining multiple names, since constants so grouped are typically closely related; an important special case is an enumerated set data type. Also, we treat each "spec" of a var or type declaration as a single node. Each reference to a package-level entity E forms an edge in the node graph, from the node in which it appears to the node E. For example: Each method declaration depends on its receiver named type; in addition we add an edge from each receiver type to its methods: to ensure that a type and its methods stay together. The node graph is highly cyclic, and obviously all nodes in a cycle must belong to the same package for the package import graph to remain acyclic. So, we compute the second graph, the SCNODE GRAPH. In essence, the scnode graph is the graph of strongly connected components (SCCs) of the (ordinary) node graph. By construction, the scnode graph is acyclic. We optionally perform an optimization at this point, which is to fuse single-predecessor scnodes with their sole predecessor, as this tends to reduce clutter in big graphs. This means that the scnodes are no longer true SCCs; however, the scnode graph remains acyclic. We define a valid PARTITION P of the scnode graph as a mapping from scnodes to CLUSTERS such that the projection of the scnode graph using mapping P is an acyclic graph. This third graph is the CLUSTER GRAPH. Every partition represents a valid refactoring of the original package into hypothetical subpackages, each cluster being a subpackage. Two partitions define the extreme ends of a spectrum: the MINIMAL partition maps every scnode to a single cluster; it represents the status quo, a monolithic package. The MAXIMAL partition maps each scnode to a unique cluster; this breaks the package up into an impractically large number of small fragments. The ideal partition lies somewhere in between. The --clusters=<file> argument specifies a CLUSTERS FILE that constrains the partition algorithm. The file consists of a number of stanzas, each assigning an import path to a cluster ("mypkg/internal/util") and assigning a set of initial nodes ({x, y, z}) to it: Order of stanzas is important: clusters must be be declared bottom to top. After each stanza, all nodes transitively reachable (via the node graph) from that cluster are assigned to that cluster, if they have not yet been assigned to some other cluster. Thus we need only mention the root nodes of the cluster, not all its internal nodes. A warning is reported if a node mentioned in a stanza already belongs to a previously defined cluster. There is an implicit cluster, "residue", that holds all remaining nodes after the clusters defined by the file have been processed. Initially, when the clusters file is empty, the residue cluster contains the entire package. (It is logically at the top.) The task for the user is to iteratively define new clusters until the residue becomes empty. When sockdrawer is run, it analyzes the source package, builds the node graph and the scgraph, loads the clusters file, computes the clusters for every node, and then emits SVG renderings of the three levels of graphs, with nodes colors coded as follows: The graphs of all clusters, a DAG, has green nodes; clicking one takes you to the graph over scnodes for that cluster, also a DAG. Each pink node in this graph represents a cyclical bunch of the node graph, collapsed together for ease of viewing. Each blue node here represents a singleton SCC, a single declaration; singular SCCs are replaced by their sole element for simplicity. Clicking a pink (plural) scnode shows the cyclical portion of the node graph that it represents. (If the fusion optimization was enabled, it may not be fully cyclic.) All of its nodes are blue. Clicking a blue node shows the definition of that node in godoc. (The godoc server's base URL is specified by the --godoc flag.) Initially, all nodes belong to the "residue" cluster. (GraphViz graph rendering can be slow for the first several iterations. A large monitor is essential.) The sockdrawer user's task when decomposing a package into clusters is to identify the lowest-hanging fruit (so to speak) in the residue cluster---a coherent group of related scnodes at the bottom of the graph---and to "snip off" a bunch at the "stem" by appending a new stanza to the clusters file and listing the roots of that bunch in the stanza, and then to re-run the tool. Nodes may be added to an existing stanza if appropriate, but if they are added to a cluster that is "too low", this may create conflicts; keep an eye out for warnings. This process continues iteratively until the residue has become empty and the sets of clusters are satisfactory. The tool prints the assignments of nodes to clusters: the "shopping list" for the refactoring work. Clusters should be split off into subpackages in dependency order, lowest first. The analysis chooses a single configuration, such as linux/amd64. Declarations for other configurations (e.g. windows/arm) will be absent from the node graph. There may be some excessively large SCCs in the node graph that reflect a circularity in the design. For the purposes of analysis, you can break them arbitrarily by commenting out some code, though more thought will be required for a principled fix (e.g. dependency injection).