A dead simple parser package for Go
Introduction
The goal of this package is to provide a simple, idiomatic and elegant way of
defining parsers in Go.
Participle's method of defining grammars should be familiar to any Go
programmer who has used the encoding/json
package: struct field tags define
what and how input is mapped to those same fields. This is not unusual for Go
encoders, but is unusual for a parser.
Changes
See the Change Log for details.
Tutorial
A tutorial is available, walking through the creation of an .ini parser.
Overview
A grammar is an annotated Go structure used to both define the parser grammar,
and be the AST output by the parser. As an example, following is the final INI
parser from the tutorial.
type INI struct {
Properties []*Property `{ @@ }`
Sections []*Section `{ @@ }`
}
type Section struct {
Identifier string `"[" @Ident "]"`
Properties []*Property `{ @@ }`
}
type Property struct {
Key string `@Ident "="`
Value *Value `@@`
}
type Value struct {
String *string ` @String`
Number *float64 `| @Float`
}
Note: Participle also supports named struct tags (eg. Hello string `parser:"@Ident"`
).
A parser is constructed from a grammar and a lexer:
parser, err := participle.Build(&INI{})
Once constructed, the parser is applied to input to produce an AST:
ast := &INI{}
err := parser.ParseString("", "size = 10", ast)
Annotation syntax
@<expr>
Capture expression into the field.@@
Recursively capture using the fields own type.<identifier>
Match named lexer token.( ... )
Group."..."
Match the literal (note that the lexer must emit tokens matching this literal exactly)."...":<identifier>
Match the literal, specifying the exact lexer token type to match.<expr> <expr> ...
Match expressions.<expr> | <expr>
Match one of the alternatives.!<expr>
Match any token that is not the start of the expression (eg: @!";"
matches anything but the ;
character into the field).
The following modifiers can be used after any expression:
*
Expression can match zero or more times.+
Expression must match one or more times.?
Expression can match zero or once.!
Require a non-empty match (this is useful with a sequence of optional matches eg. ("a"? "b"? "c"?)!
).
Supported but deprecated:
{ ... }
Match 0 or more times (DEPRECATED - prefer ( ... )*
).[ ... ]
Optional (DEPRECATED - prefer ( ... )?
).
Notes:
- Each struct is a single production, with each field applied in sequence.
@<expr>
is the mechanism for capturing matches into the field.- if a struct field is not keyed with "parser", the entire struct tag
will be used as the grammar fragment. This allows the grammar syntax to remain
clear and simple to maintain.
Capturing
Prefixing any expression in the grammar with @
will capture matching values
for that expression into the corresponding field.
For example:
type Grammar struct {
Hello string `@Ident`
}
source := "world"
result == &Grammar{
Hello: "world",
}
For slice and string fields, each instance of @
will accumulate into the
field (including repeated patterns). Accumulation into other types is not
supported.
A successful capture match into a boolean field will set the field to true.
For integer and floating point types, a successful capture will be parsed
with strconv.ParseInt()
and strconv.ParseBool()
respectively.
Tokens can also be captured directly into fields of type lexer.Token
and
[]lexer.Token
.
Custom control of how values are captured into fields can be achieved by a
field type implementing the Capture
interface (Capture(values []string) error
).
Additionally, any field implementing the encoding.TextUnmarshaler
interface
will be capturable too. One caveat is that UnmarshalText()
will be called once
for each captured token, so eg. @(Ident Ident Ident)
will be called three times.
Streaming
Participle supports streaming parsing. Simply pass a channel of your grammar into
Parse*()
. The grammar will be repeatedly parsed and sent to the channel. Note that
the Parse*()
call will not return until parsing completes, so it should generally be
started in a goroutine.
type token struct {
Str string ` @Ident`
Num int `| @Int`
}
parser, err := participle.Build(&token{})
tokens := make(chan *token, 128)
err := parser.ParseString("", `hello 10 11 12 world`, tokens)
for token := range tokens {
fmt.Printf("%#v\n", token)
}
Lexing
Participle relies on distinct lexing and parsing phases. The lexer takes raw
bytes and produces tokens which the parser consumes. The parser transforms
these tokens into Go values.
The default lexer is based on the Go text/scanner
package and thus produces
tokens for Go-like source code. This is surprisingly useful, but if you do require
more control over lexing the builtin participle/lexer/stateful
lexer should
cover most other cases. If that in turn is not flexible enough, you can implement
your own lexer.
Configure your parser with a lexer via participle.Lexer()
.
To use your own Lexer you will need to implement two interfaces:
Definition
and Lexer.
Experimental - code generation
Participle v1 now has experimental support for generating code to perform
lexing. Use participle/experimental/codegen.GenerateLexer()
to compile a
stateful
lexer to Go code.
This will generally provide around a 10x improvement in lexing performance
while producing O(1) garbage.
Options
The Parser's behaviour can be configured via Options.
Examples
There are several examples included:
Included below is a full GraphQL lexer and parser:
package main
import (
"fmt"
"os"
"github.com/alecthomas/kong"
"github.com/alecthomas/repr"
"github.com/alecthomas/participle"
"github.com/alecthomas/participle/lexer"
"github.com/alecthomas/participle/lexer/stateful"
)
type File struct {
Entries []*Entry `@@*`
}
type Entry struct {
Type *Type ` @@`
Schema *Schema `| @@`
Enum *Enum `| @@`
Scalar string `| "scalar" @Ident`
}
type Enum struct {
Name string `"enum" @Ident`
Cases []string `"{" { @Ident } "}"`
}
type Schema struct {
Fields []*Field `"schema" "{" { @@ } "}"`
}
type Type struct {
Name string `"type" @Ident`
Implements string `[ "implements" @Ident ]`
Fields []*Field `"{" { @@ } "}"`
}
type Field struct {
Name string `@Ident`
Arguments []*Argument `[ "(" [ @@ { "," @@ } ] ")" ]`
Type *TypeRef `":" @@`
Annotation string `[ "@" @Ident ]`
}
type Argument struct {
Name string `@Ident`
Type *TypeRef `":" @@`
Default *Value `[ "=" @@ ]`
}
type TypeRef struct {
Array *TypeRef `( "[" @@ "]"`
Type string ` | @Ident )`
NonNullable bool `[ @"!" ]`
}
type Value struct {
Symbol string `@Ident`
}
var (
graphQLLexer = lexer.Must(stateful.NewSimple([]stateful.Rule{
{"Comment", `(?:#|//)[^\n]*\n?`, nil},
{"Ident", `[a-zA-Z]\w*`, nil},
{"Number", `(?:\d*\.)?\d+`, nil},
{"Punct", `[-[!@#$%^&*()+_={}\|:;"'<,>.?/]|]`, nil},
{"Whitespace", `[ \t\n\r]+`, nil},
}))
parser = participle.MustBuild(&File{},
participle.Lexer(graphQLLexer),
participle.Elide("Comment", "Whitespace"),
participle.UseLookahead(2),
)
)
var cli struct {
EBNF bool `help"Dump EBNF."`
Files []string `arg:"" optional:"" type:"existingfile" help:"GraphQL schema files to parse."`
}
func main() {
ctx := kong.Parse(&cli)
if cli.EBNF {
fmt.Println(parser.String())
ctx.Exit(0)
}
for _, file := range cli.Files {
ast := &File{}
r, err := os.Open(file)
ctx.FatalIfErrorf(err)
err = parser.Parse(file, r, ast)
r.Close()
repr.Println(ast)
ctx.FatalIfErrorf(err)
}
}
Performance
One of the included examples is a complete Thrift parser
(shell-style comments are not supported). This gives
a convenient baseline for comparing to the PEG based
pigeon, which is the parser used by
go-thrift. Additionally, the pigeon
parser is utilising a generated parser, while the participle parser is built at
run time.
You can run the benchmarks yourself, but here's the output on my machine:
BenchmarkParticipleThrift-12 5941 201242 ns/op 178088 B/op 2390 allocs/op
BenchmarkGoThriftParser-12 3196 379226 ns/op 157560 B/op 2644 allocs/op
On a real life codebase of 47K lines of Thrift, Participle takes 200ms and go-
thrift takes 630ms, which aligns quite closely with the benchmarks.
Concurrency
A compiled Parser
instance can be used concurrently. A LexerDefinition
can be used concurrently. A Lexer
instance cannot be used concurrently.
Error reporting
There are a few areas where Participle can provide useful feedback to users of your parser.
- Errors returned by Parser.Parse*() will be of type Error. This will contain positional information where available.
- Participle will make a best effort to return as much of the AST up to the error location as possible.
- Any node in the AST containing a field
Pos lexer.Position
will be automatically
populated from the nearest matching token. - Any node in the AST containing a field
EndPos lexer.Position
will be
automatically populated from the token at the end of the node. - Any node in the AST containing a field
Tokens []lexer.Token
will be automatically
populated with all tokens captured by the node, including elided tokens.
These related pieces of information can be combined to provide fairly comprehensive error reporting.
Limitations
Internally, Participle is a recursive descent parser with backtracking (see
UseLookahead(K)
).
Among other things, this means that they do not support left recursion. Left
recursion must be eliminated by restructuring your grammar.
EBNF
Participle supports outputting an EBNF grammar from a Participle parser. Once
the parser is constructed simply call String()
.
eg. The GraphQL example
gives in the following EBNF:
File = Entry* .
Entry = Type | Schema | Enum | "scalar" ident .
Type = "type" ident ("implements" ident)? "{" Field* "}" .
Field = ident ("(" (Argument ("," Argument)*)? ")")? ":" TypeRef ("@" ident)? .
Argument = ident ":" TypeRef ("=" Value)? .
TypeRef = "[" TypeRef "]" | ident "!"? .
Value = ident .
Schema = "schema" "{" Field* "}" .
Enum = "enum" ident "{" ident* "}" .