Socket
Socket
Sign inDemoInstall

github.com/mna/pigeon

Package Overview
Dependencies
3
Maintainers
0
Alerts
File Explorer

Install Socket

Protect your apps from supply chain attacks

Install

github.com/mna/pigeon

Command pigeon generates parsers in Go from a PEG grammar. From Wikipedia [0]: Its features and syntax are inspired by the PEG.js project [1], while the implementation is loosely based on [2]. Formal presentation of the PEG theory by Bryan Ford is also an important reference [3]. An introductory blog post can be found at [4]. The pigeon tool must be called with PEG input as defined by the accepted PEG syntax below. The grammar may be provided by a file or read from stdin. The generated parser is written to stdout by default. The following options can be specified: If the code blocks in the grammar (see below, section "Code block") are golint- and go vet-compliant, then the resulting generated code will also be golint- and go vet-compliant. The generated code doesn't use any third-party dependency unless code blocks in the grammar require such a dependency. The accepted syntax for the grammar is formally defined in the grammar/pigeon.peg file, using the PEG syntax. What follows is an informal description of this syntax. Identifiers, whitespace, comments and literals follow the same notation as the Go language, as defined in the language specification (http://golang.org/ref/spec#Source_code_representation): The grammar must be Unicode text encoded in UTF-8. New lines are identified by the \n character (U+000A). Space (U+0020), horizontal tabs (U+0009) and carriage returns (U+000D) are considered whitespace and are ignored except to separate tokens. A PEG grammar consists of a set of rules. A rule is an identifier followed by a rule definition operator and an expression. An optional display name - a string literal used in error messages instead of the rule identifier - can be specified after the rule identifier. E.g.: The rule definition operator can be any one of those: A rule is defined by an expression. The following sections describe the various expression types. Expressions can be grouped by using parentheses, and a rule can be referenced by its identifier in place of an expression. The choice expression is a list of expressions that will be tested in the order they are defined. The first one that matches will be used. Expressions are separated by the forward slash character "/". E.g.: Because the first match is used, it is important to think about the order of expressions. For example, in this rule, "<=" would never be used because the "<" expression comes first: The sequence expression is a list of expressions that must all match in that same order for the sequence expression to be considered a match. Expressions are separated by whitespace. E.g.: A labeled expression consists of an identifier followed by a colon ":" and an expression. A labeled expression introduces a variable named with the label that can be referenced in the code blocks in the same scope. The variable will have the value of the expression that follows the colon. E.g.: The variable is typed as an empty interface, and the underlying type depends on the following: For terminals (character and string literals, character classes and the any matcher), the value is []byte. E.g.: For predicates (& and !), the value is always nil. E.g.: For a sequence, the value is a slice of empty interfaces, one for each expression value in the sequence. The underlying types of each value in the slice follow the same rules described here, recursively. E.g.: For a repetition (+ and *), the value is a slice of empty interfaces, one for each repetition. The underlying types of each value in the slice follow the same rules described here, recursively. E.g.: For a choice expression, the value is that of the matching choice. E.g.: For the optional expression (?), the value is nil or the value of the expression. E.g.: Of course, the type of the value can be anything once an action code block is used. E.g.: An expression prefixed with the ampersand "&" is the "and" predicate expression: it is considered a match if the following expression is a match, but it does not consume any input. An expression prefixed with the exclamation point "!" is the "not" predicate expression: it is considered a match if the following expression is not a match, but it does not consume any input. E.g.: The expression following the & and ! operators can be a code block. In that case, the code block must return a bool and an error. The operator's semantic is the same, & is a match if the code block returns true, ! is a match if the code block returns false. The code block has access to any labeled value defined in its scope. E.g.: An expression followed by "*", "?" or "+" is a match if the expression occurs zero or more times ("*"), zero or one time "?" or one or more times ("+") respectively. The match is greedy, it will match as many times as possible. E.g. A literal matcher tries to match the input against a single character or a string literal. The literal may be a single-quoted single character, a double-quoted string or a backtick-quoted raw string. The same rules as in Go apply regarding the allowed characters and escapes. The literal may be followed by a lowercase "i" (outside the ending quote) to indicate that the match is case-insensitive. E.g.: A character class matcher tries to match the input against a class of characters inside square brackets "[...]". Inside the brackets, characters represent themselves and the same escapes as in string literals are available, except that the single- and double-quote escape is not valid, instead the closing square bracket "]" must be escaped to be used. Character ranges can be specified using the "[a-z]" notation. Unicode classes can be specified using the "[\pL]" notation, where L is a single-letter Unicode class of characters, or using the "[\p{Class}]" notation where Class is a valid Unicode class (e.g. "Latin"). As for string literals, a lowercase "i" may follow the matcher (outside the ending square bracket) to indicate that the match is case-insensitive. A "^" as first character inside the square brackets indicates that the match is inverted (it is a match if the input does not match the character class matcher). E.g.: The any matcher is represented by the dot ".". It matches any character except the end of file, thus the "!." expression is used to indicate "match the end of file". E.g.: Code blocks can be added to generate custom Go code. There are three kinds of code blocks: the initializer, the action and the predicate. All code blocks appear inside curly braces "{...}". The initializer must appear first in the grammar, before any rule. It is copied as-is (minus the wrapping curly braces) at the top of the generated parser. It may contain function declarations, types, variables, etc. just like any Go file. Every symbol declared here will be available to all other code blocks. Although the initializer is optional in a valid grammar, it is usually required to generate a valid Go source code file (for the package clause). E.g.: Action code blocks are code blocks declared after an expression in a rule. Those code blocks are turned into a method on the "*current" type in the generated source code. The method receives any labeled expression's value as argument (as any) and must return two values, the first being the value of the expression (an any), and the second an error. If a non-nil error is returned, it is added to the list of errors that the parser will return. E.g.: Predicate code blocks are code blocks declared immediately after the and "&" or the not "!" operators. Like action code blocks, predicate code blocks are turned into a method on the "*current" type in the generated source code. The method receives any labeled expression's value as argument (as any) and must return two opt, the first being a bool and the second an error. If a non-nil error is returned, it is added to the list of errors that the parser will return. E.g.: State change code blocks are code blocks starting with "#". In contrast to action and predicate code blocks, state change code blocks are allowed to modify values in the global "state" store (see below). State change code blocks are turned into a method on the "*current" type in the generated source code. The method is passed any labeled expression's value as an argument (of type any) and must return a value of type error. If a non-nil error is returned, it is added to the list of errors that the parser will return, note that the parser does NOT backtrack if a non-nil error is returned. E.g: The "*current" type is a struct that provides four useful fields that can be accessed in action, state change, and predicate code blocks: "pos", "text", "state" and "globalStore". The "pos" field indicates the current position of the parser in the source input. It is itself a struct with three fields: "line", "col" and "offset". Line is a 1-based line number, col is a 1-based column number that counts runes from the start of the line, and offset is a 0-based byte offset. The "text" field is the slice of bytes of the current match. It is empty in a predicate code block. The "state" field is a global store, with backtrack support, of type "map[string]any". The values in the store are tied to the parser's backtracking, in particular if a rule fails to match then all updates to the state that occurred in the process of matching the rule are rolled back. For a key-value store that is not tied to the parser's backtracking, see the "globalStore". The values in the "state" store are available for read access in action and predicate code blocks, any changes made to the "state" store will be reverted once the action or predicate code block is finished running. To update values in the "state" use state change code blocks ("#{}"). IMPORTANT: The "globalStore" field is a global store of type "map[string]any", which allows to store arbitrary values, which are available in action and predicate code blocks for read as well as write access. It is important to notice, that the global store is completely independent from the backtrack mechanism of PEG and is therefore not set back to its old state during backtrack. The initialization of the global store may be achieved by using the GlobalStore function (http://godoc.org/github.com/mna/pigeon/test/predicates#GlobalStore). Be aware, that all keys starting with "_pigeon" are reserved for internal use of pigeon and should not be used nor modified. Those keys are treated as internal implementation details and therefore there are no guarantees given in regards of API stability. With options -support-left-recursion pigeon supports left recursion. E.g.: Supports indirect recursion: The implementation is based on the [Left-recursive PEG Grammars][9] article that links to [Left Recursion in Parsing Expression Grammars][10] and [Packrat Parsers Can Support Left Recursion][11] papers. References: pigeon supports an extension of the classical PEG syntax called failure labels, proposed by Maidl et al. in their paper "Error Reporting in Parsing Expression Grammars" [7]. The used syntax for the introduced expressions is borrowed from their lpeglabel [8] implementation. This extension allows to signal different kinds of errors and to specify, which recovery pattern should handle a given label. With labeled failures it is possible to distinguish between an ordinary failure and an error. Usually, an ordinary failure is produced when the matching of a character fails, and this failure is caught by ordered choice. An error (a non-ordinary failure), by its turn, is produced by the throw operator and may be caught by the recovery operator. In pigeon, the recovery expression consists of the regular expression, the recovery expression and a set of labels to be matched. First, the regular expression is tried. If this fails with one of the provided labels, the recovery expression is tried. If this fails as well, the error is propagated. E.g.: To signal a failure condition, the throw expression is used. E.g.: For concrete examples, how to use throw and recover, have a look at the examples "labeled_failures" and "thrownrecover" in the "test" folder. The implementation of the throw and recover operators work as follows: The failure recover expression adds the recover expression for every failure label to the recovery stack and runs the regular expression. The throw expression checks the recovery stack in reversed order for the provided failure label. If the label is found, the respective recovery expression is run. If this expression is successful, the parser continues the processing of the input. If the recovery expression is not successful, the parsing fails and the parser starts to backtrack. If throw and recover expressions are used together with global state, it is the responsibility of the author of the grammar to reset the global state to a valid state during the recovery operation. The parser generated by pigeon exports a few symbols so that it can be used as a package with public functions to parse input text. The exported API is: See the godoc page of the generated parser for the test/predicates grammar for an example documentation page of the exported API: http://godoc.org/github.com/mna/pigeon/test/predicates. Like the grammar used to generate the parser, the input text must be UTF-8-encoded Unicode. The start rule of the parser is the first rule in the PEG grammar used to generate the parser. A call to any of the Parse* functions returns the value generated by executing the grammar on the provided input text, and an optional error. Typically, the grammar should generate some kind of abstract syntax tree (AST), but for simple grammars it may evaluate the result immediately, such as in the examples/calculator example. There are no constraints imposed on the author of the grammar, it can return whatever is needed. When the parser returns a non-nil error, the error is always of type errList, which is defined as a slice of errors ([]error). Each error in the list is of type *parserError. This is a struct that has an "Inner" field that can be used to access the original error. So if a code block returns some well-known error like: The original error can be accessed this way: By defaut the parser will continue after an error is returned and will cumulate all errors found during parsing. If the grammar reaches a point where it shouldn't continue, a panic statement can be used to terminate parsing. The panic will be caught at the top-level of the Parse* call and will be converted into a *parserError like any error, and an errList will still be returned to the caller. The divide by zero error in the examples/calculator grammar leverages this feature (no special code is needed to handle division by zero, if it happens, the runtime panics and it is recovered and returned as a parsing error). Providing good error reporting in a parser is not a trivial task. Part of it is provided by the pigeon tool, by offering features such as filename, position, expected literals and rule name in the error message, but an important part of good error reporting needs to be done by the grammar author. For example, many programming languages use double-quotes for string literals. Usually, if the opening quote is found, the closing quote is expected, and if none is found, there won't be any other rule that will match, there's no need to backtrack and try other choices, an error should be added to the list and the match should be consumed. In order to do this, the grammar can look something like this: This is just one example, but it illustrates the idea that error reporting needs to be thought out when designing the grammar. Because the above mentioned error types (errList and parserError) are not exported, additional steps have to be taken, ff the generated parser is used as library package in other packages (e.g. if the same parser is used in multiple command line tools). One possible implementation for exported errors (based on interfaces) and customized error reporting (caret style formatting of the position, where the parsing failed) is available in the json example and its command line tool: http://godoc.org/github.com/mna/pigeon/examples/json Generated parsers have user-provided code mixed with pigeon code in the same package, so there is no package boundary in the resulting code to prevent access to unexported symbols. What is meant to be implementation details in pigeon is also available to user code - which doesn't mean it should be used. For this reason, it is important to precisely define what is intended to be the supported API of pigeon, the parts that will be stable in future versions. The "stability" of the version 1.0 API attempts to make a similar guarantee as the Go 1 compatibility [5]. The following lists what part of the current pigeon code falls under that guarantee (features may be added in the future): The pigeon command-line flags and arguments: those will not be removed and will maintain the same semantics. The explicitly exported API generated by pigeon. See [6] for the documentation of this API on a generated parser. The PEG syntax, as documented above. The code blocks (except the initializer) will always be generated as methods on the *current type, and this type is guaranteed to have the fields pos (type position) and text (type []byte). There are no guarantees on other fields and methods of this type. The position type will always have the fields line, col and offset, all defined as int. There are no guarantees on other fields and methods of this type. The type of the error value returned by the Parse* functions, when not nil, will always be errList defined as a []error. There are no guarantees on methods of this type, other than the fact it implements the error interface. Individual errors in the errList will always be of type *parserError, and this type is guaranteed to have an Inner field that contains the original error value. There are no guarantees on other fields and methods of this type. The above guarantee is given to the version 1.0 (https://github.com/mna/pigeon/releases/tag/v1.0.0) of pigeon, which has entered maintenance mode (bug fixes only). The current master branch includes the development toward a future version 2.0, which intends to further improve pigeon. While the given API stability should be maintained as far as it makes sense, breaking changes may be necessary to be able to improve pigeon. The new version 2.0 API has not yet stabilized and therefore changes to the API may occur at any time. References:

    v1.2.1

Version published
Maintainers
0

Readme

# pigeon - a PEG parser generator for Go

[![Go Reference](https://pkg.go.dev/badge/github.com/mna/pigeon.svg)](https://pkg.go.dev/github.com/mna/pigeon)
[![Test Status](https://github.com/mna/pigeon/workflows/Go%20Matrix/badge.svg)](https://github.com/mna/pigeon/actions?query=workflow%3AGo%20Matrix)
[![GoReportCard](https://goreportcard.com/badge/github.com/mna/pigeon)](https://goreportcard.com/report/github.com/mna/pigeon)
[![Software License](https://img.shields.io/badge/license-BSD-blue.svg)](LICENSE)

The pigeon command generates parsers based on a [parsing expression grammar (PEG)][0]. Its grammar and syntax is inspired by the [PEG.js project][1], while the implementation is loosely based on the [parsing expression grammar for C# 3.0][2] article. It parses Unicode text encoded in UTF-8.

See the [godoc page][3] for detailed usage. Also have a look at the [Pigeon Wiki](https://github.com/mna/pigeon/wiki) for additional information about Pigeon and PEG in general.

## Releases

* v1.0.0 is the tagged release of the original implementation.
* Work has started on v2.0.0 with some planned breaking changes.

Github user [@mna][6] created the package in April 2015, and [@breml][5] is the package's maintainer as of May 2017.

### Release policy

Starting of June 2023, the backwards compatibility support for `pigeon` is changed to follow the official [Go Security Policy](https://github.com/golang/go/security/policy).

Over time, the Go ecosystem is evolving.

On one hand side, packages like [golang.org/x/tools](https://pkg.go.dev/golang.org/x/tools), which are critical dependencies of `pigeon`, do follow the official Security Policy and with `pigeon` not following the same guidelines, it was no longer possible to include recent versions of these dependencies and with this it was no longer possible to include critical bugfixes.
On the other hand there are changes to what is considered good practice by the greater community (e.g. change from `interface{}` to `any`). For users following (or even enforcing) these good practices, the code generated by `pigeon` does no longer meet the bar of expectations.
Last but not least, following the Go Security Policy over the last years has been a smooth experience and therefore updating Go on a regular bases feels like duty that is reasonable to be put on users of `pigeon`.

This observations lead to the decision to follow the same Security Policy as Go.

## Installation

Provided you have Go correctly installed with the $GOPATH and $GOBIN environment variables set, run:

```
$ go get -u github.com/mna/pigeon
```

This will install or update the package, and the `pigeon` command will be installed in your $GOBIN directory. Neither this package nor the parsers generated by this command require any third-party dependency, unless such a dependency is used in the code blocks of the grammar.

## Basic usage

```
$ pigeon [options] [PEG_GRAMMAR_FILE]
```

By default, the input grammar is read from `stdin` and the generated code is printed to `stdout`. You may save it in a file using the `-o` flag.

## Example

Given the following grammar:

```
{
// part of the initializer code block omitted for brevity

var ops = map[string]func(int, int) int {
    "+": func(l, r int) int {
        return l + r
    },
    "-": func(l, r int) int {
        return l - r
    },
    "*": func(l, r int) int {
        return l * r
    },
    "/": func(l, r int) int {
        return l / r
    },
}

func toAnySlice(v any) []any {
    if v == nil {
        return nil
    }
    return v.([]any)
}

func eval(first, rest any) int {
    l := first.(int)
    restSl := toAnySlice(rest)
    for _, v := range restSl {
        restExpr := toAnySlice(v)
        r := restExpr[3].(int)
        op := restExpr[1].(string)
        l = ops[op](l, r)
    }
    return l
}
}


Input <- expr:Expr EOF {
    return expr, nil
}

Expr <- _ first:Term rest:( _ AddOp _ Term )* _ {
    return eval(first, rest), nil
}

Term <- first:Factor rest:( _ MulOp _ Factor )* {
    return eval(first, rest), nil
}

Factor <- '(' expr:Expr ')' {
    return expr, nil
} / integer:Integer {
    return integer, nil
}

AddOp <- ( '+' / '-' ) {
    return string(c.text), nil
}

MulOp <- ( '*' / '/' ) {
    return string(c.text), nil
}

Integer <- '-'? [0-9]+ {
    return strconv.Atoi(string(c.text))
}

_ "whitespace" <- [ \n\t\r]*

EOF <- !.
```

The generated parser can parse simple arithmetic operations, e.g.:

```
18 + 3 - 27 * (-18 / -3)

=> -141
```

More examples can be found in the `examples/` subdirectory.

See the [package documentation][3] for detailed usage.

## Contributing

See the CONTRIBUTING.md file.

## License

The [BSD 3-Clause license][4]. See the LICENSE file.

[0]: http://en.wikipedia.org/wiki/Parsing_expression_grammar
[1]: http://pegjs.org/
[2]: http://www.codeproject.com/Articles/29713/Parsing-Expression-Grammar-Support-for-C-Part
[3]: https://pkg.go.dev/github.com/mna/pigeon
[4]: http://opensource.org/licenses/BSD-3-Clause
[5]: https://github.com/breml
[6]: https://github.com/mna

FAQs

Last updated on 16 Oct 2023

Did you know?

Socket installs a GitHub app to automatically flag issues on every pull request and report the health of your dependencies. Find out what is inside your node modules and prevent malicious activity before you update the dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc