Python Left-Right Parser
Pyleri is an easy-to-use parser created for SiriDB. We first used lrparsing and wrote jsleri for auto-completion and suggestions in our web console. Later we found small issues within the lrparsing
module and also had difficulties keeping the language the same in all projects. That is when we decided to create Pyleri which can export a created grammar to JavaScript, C, Python, Go and Java.
Gabriele Tomassetti wrote a tutorial about the pyleri library.
Related projects
Installation
The easiest way is to use PyPI:
sudo pip3 install pyleri
Quick usage
from pyleri import (
Grammar,
Keyword,
Regex,
Sequence)
class MyGrammar(Grammar):
r_name = Regex('(?:"(?:[^"]*)")+')
k_hi = Keyword('hi')
START = Sequence(k_hi, r_name)
my_grammar = MyGrammar()
print(my_grammar.parse('hi "Iris"').is_valid)
print(my_grammar.parse('bye "Iris"').is_valid)
print(my_grammar.parse('bye "Iris"').as_str())
Grammar
When writing a grammar you should subclass Grammar. A Grammar expects at least a START
property so the parser knows where to start parsing. Grammar has some default properties which can be overwritten like RE_KEYWORDS
, which will be explained later. Grammar also has a parse method: parse()
, and a few export methods: export_js(), export_c(), export_py(), export_go() and export_java() which are explained below.
parse
syntax:
Grammar().parse(string)
The parse()
method returns a result object which has the following properties that are further explained in Result:
export_js
syntax:
Grammar().export_js(
js_module_name='jsleri',
js_template=Grammar.JS_TEMPLATE,
js_indent=' ' * 4)
Optional keyword arguments:
js_module_name
: Name of the JavaScript module. (default: 'jsleri')js_template
: Template String used for the export. You might want to look at the default string which can be found at Grammar.JS_TEMPLATE.js_indent
: indentation used in the JavaScript file. (default: 4 spaces)
For example when using our Quick usage grammar, this is the output when running my_grammar.export_js()
:
'use strict';
(function (
Regex,
Sequence,
Keyword,
Grammar
) {
var r_name = Regex('^(?:"(?:[^"]*)")+');
var k_hi = Keyword('hi');
var START = Sequence(
k_hi,
r_name
);
window.MyGrammar = Grammar(START, '^\w+');
})(
window.jsleri.Regex,
window.jsleri.Sequence,
window.jsleri.Keyword,
window.jsleri.Grammar
);
export_c
syntax:
Grammar().export_c(
target=Grammar.C_TARGET,
c_indent=' ' * 4)
Optional keyword arguments:
target
: Name of the c module. (default: 'grammar')c_indent
: indentation used in the c files. (default: 4 spaces)
The return value is a tuple containing the source (c) file and header (h) file.
For example when using our Quick usage grammar, this is the output when running my_grammar.export_c()
:
#include "grammar.h"
#include <stdio.h>
#define CLERI_CASE_SENSITIVE 0
#define CLERI_CASE_INSENSITIVE 1
#define CLERI_FIRST_MATCH 0
#define CLERI_MOST_GREEDY 1
cleri_grammar_t * compile_grammar(void)
{
cleri_t * r_name = cleri_regex(CLERI_GID_R_NAME, "^(?:\"(?:[^\"]*)\")+");
cleri_t * k_hi = cleri_keyword(CLERI_GID_K_HI, "hi", CLERI_CASE_INSENSITIVE);
cleri_t * START = cleri_sequence(
CLERI_GID_START,
2,
k_hi,
r_name
);
cleri_grammar_t * grammar = cleri_grammar(START, "^\\w+");
return grammar;
}
and the header file...
#ifndef CLERI_EXPORT_GRAMMAR_H_
#define CLERI_EXPORT_GRAMMAR_H_
#include <grammar.h>
#include <cleri/cleri.h>
cleri_grammar_t * compile_grammar(void);
enum cleri_grammar_ids {
CLERI_NONE,
CLERI_GID_K_HI,
CLERI_GID_R_NAME,
CLERI_GID_START,
CLERI_END
};
#endif
export_go
syntax:
Grammar().export_go(
go_template=Grammar.GO_TEMPLATE,
go_indent='\t',
go_package='grammar')
Optional keyword arguments:
go_template
: Template String used for the export. You might want to look at the default string which can be found at Grammar.GO_TEMPLATE.go_indent
: indentation used in the Go file. (default: one tab)go_package
: Name of the go package. (default: 'grammar')
For example when using our Quick usage grammar, this is the output when running my_grammar.export_go()
:
package grammar
import (
"regexp"
"github.com/cesbit/goleri"
)
const (
NoGid = iota
GidKHi = iota
GidRName = iota
GidSTART = iota
)
func MyGrammar() *goleri.Grammar {
rName := goleri.NewRegex(GidRName, regexp.MustCompile(`^(?:"(?:[^"]*)")+`))
kHi := goleri.NewKeyword(GidKHi, "hi", false)
START := goleri.NewSequence(
GidSTART,
kHi,
rName,
)
return goleri.NewGrammar(START, regexp.MustCompile(`^\w+`))
}
export_java
syntax:
Grammar().export_java(
java_template=Grammar.JAVA_TEMPLATE,
java_indent=' ' * 4,
java_package=None,
is_public=True)
Optional keyword arguments:
java_template
: Template String used for the export. You might want to look at the default string which can be found at Grammar.JAVA_TEMPLATE.java_indent
: indentation used in the Java file. (default: four spaces)java_package
: Name of the Java package or None when no package is specified. (default: None)is_public
: Class and constructor are defined as public when True, else they will be defined as package private.
For example when using our Quick usage grammar, this is the output when running my_grammar.export_java()
:
import jleri.Grammar;
import jleri.Element;
import jleri.Sequence;
import jleri.Regex;
import jleri.Keyword;
public class MyGrammar extends Grammar {
enum Ids {
K_HI,
R_NAME,
START
}
private static final Element R_NAME = new Regex(Ids.R_NAME, "^(?:\"(?:[^\"]*)\")+");
private static final Element K_HI = new Keyword(Ids.K_HI, "hi", false);
private static final Element START = new Sequence(
Ids.START,
K_HI,
R_NAME
);
public MyGrammar() {
super(START, "^\\w+");
}
}
export_py
syntax:
Grammar().export_py(
py_module_name='pyleri',
py_template=Grammar.PY_TEMPLATE,
py_indent=' ' * 4)
Optional keyword arguments:
py_module_name
: Name of the Pyleri Module. (default: 'pyleri')py_template
: Template String used for the export. You might want to look at the default string which can be found at Grammar.PY_TEMPLATE.py_indent
: indentation used in the Python file. (default: 4 spaces)
For example when using our Quick usage grammar, this is the output when running my_grammar.export_py()
:
"""
This grammar is generated using the Grammar.export_py() method and
should be used with the pyleri python module.
Source class: MyGrammar
Created at: 2017-03-14 19:14:51
"""
import re
from pyleri import Sequence
from pyleri import Keyword
from pyleri import Grammar
from pyleri import Regex
class MyGrammar(Grammar):
RE_KEYWORDS = re.compile('^\\w+')
r_name = Regex('^(?:"(?:[^"]*)")+')
k_hi = Keyword('hi')
START = Sequence(
k_hi,
r_name
)
Result
The result of the parse()
method contains 4 properties that will be explained next. A function as_str(translate=None)
is also available which will
show the result as a string. The translate
argument should be a function which accepts an element as argument. This function can be used to
return custom strings for certain elements. If the return value of translate
is None
then the function will fall try to generate a string value. If
the return value is an empty string, the value will be ignored.
Example of translate functions:
def translate(elem):
return ''
def translate(elem):
if elem is some_elem:
return 'A'
elif elem is other_elem:
return ''
else:
return None
print(my_grammar.parse('some string').as_str(translate=translate))
is_valid
is_valid
returns a boolean value, True
when the given string is valid according to the given grammar, False
when not valid.
Let us take the example from Quick usage.
res = my_grammar.parse('bye "Iris"')
print(res.is_valid)
Position
pos
returns the position where the parser had to stop. (when is_valid
is True
this value will be equal to the length of the given string with str.rstrip()
applied)
Let us take the example from Quick usage.
result = my_grammar.parse('hi Iris')
print(res.is_valid, result.pos)
Tree
tree
contains the parse tree. Even when is_valid
is False
the parse tree is returned but will only contain results as far as parsing has succeeded. The tree is the root node which can include several children
nodes. The structure will be further clarified in the following example which explains a way of visualizing the parse tree.
Example:
import json
from pyleri import Choice
from pyleri import Grammar
from pyleri import Keyword
from pyleri import Regex
from pyleri import Repeat
from pyleri import Sequence
class MyGrammar(Grammar):
r_name = Regex('(?:"(?:[^"]*)")+')
k_hi = Keyword('hi')
k_bye = Keyword('bye')
START = Repeat(Sequence(Choice(k_hi, k_bye), r_name))
def node_props(node, children):
return {
'start': node.start,
'end': node.end,
'name': node.element.name if hasattr(node.element, 'name') else None,
'element': node.element.__class__.__name__,
'string': node.string,
'children': children}
def get_children(children):
return [node_props(c, get_children(c.children)) for c in children]
def view_parse_tree(res):
start = res.tree.children[0] \
if res.tree.children else res.tree
return node_props(start, get_children(start.children))
if __name__ == '__main__':
my_grammar = MyGrammar()
res = my_grammar.parse('hi "pyleri" bye "pyleri"')
print(json.dumps(view_parse_tree(res), indent=2))
Part of the output is shown below.
{
"start": 0,
"end": 23,
"name": "START",
"element": "Repeat",
"string": "hi \"pyleri\" bye \"pyleri\"",
"children": [
{
"start": 0,
"end": 11,
"name": null,
"element": "Sequence",
"string": "hi \"pyleri\"",
"children": [
{
"start": 0,
"end": 2,
"name": null,
"element": "Choice",
"string": "hi",
"children": [
{
"start": 0,
"end": 2,
"name": "k_hi",
"element": "Keyword",
"string": "hi",
"children": []
}
]
},
{
"start": 3,
"end": 11,
"name": "r_name",
"element": "Regex",
"string": "\"pyleri\"",
"children": []
}
"..."
"..."
A node contains 5 properties that will be explained next:
start
property returns the start of the node object.end
property returns the end of the node object.element
returns the Element's type (e.g. Repeat, Sequence, Keyword, etc.). An element can be assigned to a variable; for instance in the example above Keyword('hi')
was assigned to k_hi
. With element.name
the assigned name k_hi
will be returned. Note that it is not a given that an element is named; in our example Sequence
was not assigned, thus in this case the element has no attribute name
.string
returns the string that is parsed.children
can return a node object containing deeper layered nodes provided that there are any. In our example the root node has an element type Repeat()
, starts at 0 and ends at 24, and it has two children
. These children are node objects that have both an element type Sequence
, start at 0 and 12 respectively, and so on.
Expecting
expecting
returns a Python set() containing elements which pyleri expects at pos
. Even if is_valid
is true there might be elements in this set, for example when an Optional()
element could be added to the string. "Expecting" is useful if you want to implement things like auto-completion, syntax error handling, auto-syntax-correction etc. The following example will illustrate a way of implementation.
Example:
import re
import random
from pyleri import Choice
from pyleri import Grammar
from pyleri import Keyword
from pyleri import Repeat
from pyleri import Sequence
from pyleri import end_of_statement
class MyGrammar(Grammar):
RE_KEYWORDS = re.compile(r'\S+')
r_name = Keyword('"pyleri"')
k_hi = Keyword('hi')
k_bye = Keyword('bye')
START = Repeat(Sequence(Choice(k_hi, k_bye), r_name), mi=2)
def print_expecting(node_expecting, string_expecting):
for loop, e in enumerate(node_expecting):
string_expecting = '{}\n\t({}) {}'.format(string_expecting, loop, e)
print(string_expecting)
def auto_correction(string, my_grammar):
node = my_grammar.parse(string)
print('\nParsed string: {}'.format(node.tree.string))
if node.is_valid:
string_expecting = 'String is valid. \nExpected: '
print_expecting(node.expecting, string_expecting)
else:
string_expecting = 'String is NOT valid.\nExpected: ' \
if not node.pos \
else 'String is NOT valid. \nAfter "{}" expected: '.format(
node.tree.string[:node.pos])
print_expecting(node.expecting, string_expecting)
selected = random.choice(list(node.expecting))
string = '{} {}'.format(node.tree.string[:node.pos],
selected
if selected
is not end_of_statement else '')
auto_correction(string, my_grammar)
if __name__ == '__main__':
my_grammar = MyGrammar()
string = 'hello "pyleri"'
auto_correction(string, my_grammar)
Output:
Parsed string: hello "pyleri"
String is NOT valid.
Expected:
(1) hi
(2) bye
Parsed string: bye
String is NOT valid.
After " bye" expected:
(1) "pyleri"
Parsed string: bye "pyleri"
String is NOT valid.
After " bye "pyleri"" expected:
(1) hi
(2) bye
Parsed string: bye "pyleri" hi
String is NOT valid.
After " bye "pyleri" hi" expected:
(1) "pyleri"
Parsed string: bye "pyleri" hi "pyleri"
String is valid.
Expected:
(1) hi
(2) bye
In the above example we parsed an invalid string according to the grammar class. The auto-correction()
method that we built for this example combines all properties from the parse()
to create a valid string. The output shows every recursion of the auto-correction()
method and prints successively the set of expected elements. It takes one randomly and adds it to the string. When the string corresponds to the grammar, the property is_valid
will return True
. Notably the expecting
property still contains elements even if the is_valid
returned True
. The reason in this example is due to the Repeat element.
Elements
Pyleri has several elements which are all subclasses of Element and can be used to create a grammar.
Keyword
syntax:
Keyword(keyword, ign_case=False)
The parser needs to match the keyword which is just a string. When matching keywords we need to tell the parser what characters are allowed in keywords. By default Pyleri uses ^\w+
which is both in Python and JavaScript equal to ^[A-Za-z0-9_]+
. We can overwrite the default by setting RE_KEYWORDS
in the grammar. Keyword() accepts one keyword argument ign_case
to tell the parser if we should match case insensitive.
Example:
class TicTacToe(Grammar):
RE_KEYWORDS = re.compile('^[A-Za-z-]+')
START = Keyword('tic-tac-toe', ign_case=True)
ttt_grammar = TicTacToe()
ttt_grammar.parse('Tic-Tac-Toe').is_valid
Regex
syntax:
Regex(pattern, flags=0)
The parser compiles a regular expression using the re
module. The current version of pyleri has only support for the re.IGNORECASE
flag.
See the Quick usage example for how to use Regex
.
Token
syntax:
Token(token)
A token can be one or more characters and is usually used to match operators like +
, -
, //
and so on. When we parse a string object where pyleri expects an element, it will automatically be converted to a Token()
object.
Example:
class Ni(Grammar):
t_dash = Token('-')
START = List(Keyword('ni'), delimiter=t_dash)
ni = Ni()
ni.parse('ni-ni-ni-ni-ni').is_valid
Tokens
syntax:
Tokens(tokens)
Can be used to register multiple tokens at once. The tokens
argument should be a string with tokens separated by spaces. If given tokens are different in size the parser will try to match the longest tokens first.
Example:
class Ni(Grammar):
tks = Tokens('+ - !=')
START = List(Keyword('ni'), delimiter=tks)
ni = Ni()
ni.parse('ni + ni != ni - ni').is_valid
Sequence
syntax:
Sequence(element, element, ...)
The parser needs to match each element in a sequence.
Example:
class TicTacToe(Grammar):
START = Sequence(Keyword('Tic'), Keyword('Tac'), Keyword('Toe'))
ttt_grammar = TicTacToe()
ttt_grammar.parse('Tic Tac Toe').is_valid
Choice
syntax:
Choice(element, element, ..., most_greedy=True)
The parser needs to choose between one of the given elements. Choice accepts one keyword argument most_greedy
which is True
by default. When most_greedy
is set to False
the parser will stop at the first match. When True
the parser will try each element and returns the longest match. Setting most_greedy
to False
can provide some extra performance. Note that the parser will try to match each element in the exact same order they are parsed to Choice.
Example: let us use Choice
to modify the Quick usage example to allow the string 'bye "Iris"'
class MyGrammar(Grammar):
r_name = Regex('(?:"(?:[^"]*)")+')
k_hi = Keyword('hi')
k_bye = Keyword('bye')
START = Sequence(Choice(k_hi, k_bye), r_name)
my_grammar = MyGrammar()
my_grammar.parse('hi "Iris"').is_valid
my_grammar.parse('bye "Iris"').is_valid
Repeat
syntax:
Repeat(element, mi=0, ma=None)
The parser needs at least mi
elements and at most ma
elements. When ma
is set to None
we allow unlimited number of elements. mi
can be any integer value equal or higher than 0 but not larger then ma
.
Example:
class Ni(Grammar):
START = Repeat(Keyword('ni'))
ni = Ni()
ni.parse('ni ni ni ni ni').is_valid
It is not allowed to bind a name to the same element twice and Repeat(element, 1, 1) is a common solution to bind the element a second (or more) time(s).
For example consider the following:
class MyGrammar(Grammar):
r_name = Regex('(?:"(?:[^"]*)")+')
r_address = r_name
r_address = Repeat(r_name, 1, 1)
List
syntax:
List(element, delimiter=',', mi=0, ma=None, opt=False)
List is like Repeat but with a delimiter. A comma is used as default delimiter but any element is allowed. When a string is used as delimiter it will be converted to a Token
element. mi
and ma
work exactly like with Repeat. An optional keyword argument opt
can be set to True
to allow the list to end with a delimiter. By default this is set to False
which means the list has to end with an element.
Example:
class Ni(Grammar):
START = List(Keyword('ni'))
ni = Ni()
ni.parse('ni, ni, ni, ni, ni').is_valid
Optional
syntax:
Optional(element)
The parser looks for an optional element. It is like using Repeat(element, 0, 1)
but we encourage to use Optional
since it is more readable. (and slightly faster)
Example:
class MyGrammar(Grammar):
r_name = Regex('(?:"(?:[^"]*)")+')
k_hi = Keyword('hi')
START = Sequence(k_hi, Optional(r_name))
my_grammar = MyGrammar()
my_grammar.parse('hi "Iris"').is_valid
my_grammar.parse('hi').is_valid
Ref
syntax:
Ref()
The grammar can make a forward reference to make recursion possible. In the example below we create a forward reference to START but note that
a reference to any element can be made.
Warning: A reference is not protected against testing the same position in
a string. This could potentially lead to an infinite loop.
For example:
r = Ref()
r = Optional(r)
Use Prio if such recursive construction is required.
Example:
class NestedNi(Grammar):
START = Ref()
ni_item = Choice(Keyword('ni'), START)
START = Sequence('[', List(ni_item), ']')
nested_ni = NestedNi()
nested_ni.parse('[ni, ni, [ni, [], [ni, ni]]]').is_valid
Prio
syntax:
Prio(element, element, ...)
Choose the first match from the prio elements and allow THIS
for recursive operations. With THIS
we point to the Prio
element. Probably the example below explains how Prio
and THIS
can be used.
Note: Use a Ref when possible.
A Prio
element is required when the same position in a string is potentially
checked more than once.
Example:
class Ni(Grammar):
k_ni = Keyword('ni')
START = Prio(
k_ni,
Sequence('(', THIS, ')'),
Sequence(THIS, Keyword('or'), THIS),
Sequence(THIS, Keyword('and'), THIS))
ni = Ni()
ni.parse('(ni or ni) and (ni or ni)').is_valid