Comparing version 2.10.2 to 2.10.3
{ | ||
"name": "nearley", | ||
"version": "2.10.2", | ||
"version": "2.10.3", | ||
"description": "Simple, fast, powerful parser toolkit for JavaScript.", | ||
@@ -27,5 +27,6 @@ "main": "lib/nearley.js", | ||
"scripts": { | ||
"bootstrap": "node bin/nearleyc.js lib/nearley-language-bootstrapped.ne >lib/nearley-language-bootstrapped.js.new && mv lib/nearley-language-bootstrapped.js.new lib/nearley-language-bootstrapped.js", | ||
"bootstrap": "bin/nearleyc.js lib/nearley-language-bootstrapped.ne > tmp && mv tmp lib/nearley-language-bootstrapped.js", | ||
"test": "mocha test/launch.js", | ||
"benchmark": "node test/benchmark.js" | ||
"benchmark": "node test/benchmark.js", | ||
"doctoc": "doctoc --notitle README.md" | ||
}, | ||
@@ -45,4 +46,5 @@ "author": "Hardmath123", | ||
"coffee-script": "^1.10.0", | ||
"doctoc": "^1.3.0", | ||
"microtime": "^2.1.2", | ||
"mocha": "^2.3.4", | ||
"mocha": "^2.5.3", | ||
"moo": "^0.3.2", | ||
@@ -49,0 +51,0 @@ "typescript": "^2.3.4" |
681
README.md
![](www/logo/nearley-purple.png) | ||
[nearley](http://nearley.js.org) [![JS.ORG](https://img.shields.io/badge/js.org-nearley-ffb400.svg?style=flat-square)](http://js.org) | ||
============== | ||
# [nearley](http://nearley.js.org) | ||
[![JS.ORG](https://img.shields.io/badge/js.org-nearley-ffb400.svg?style=flat-square)](http://js.org) | ||
[![npm version](https://badge.fury.io/js/nearley.svg)](https://badge.fury.io/js/nearley) | ||
Simple parsing for node.js. | ||
nearley is a simple, fast and powerful parsing toolkit based on the [Earley | ||
algorithm](https://en.wikipedia.org/wiki/Earley_parser). | ||
For this reason, nearley can often parse what other JavaScript parsers simply | ||
cannot! nearley can handle *any* grammar you can define in BNF... and also some | ||
grammars that you *cannot* define in BNF. Indeed, while most existing JS | ||
parsers such as PEGjs and Jison choke on certain grammars (in particular [left | ||
recursive ones](http://en.wikipedia.org/wiki/Left_recursion)), nearley, handles | ||
them easily and efficiently. | ||
nearley also has capabilities to catch errors gracefully, and deal with | ||
ambiguous grammars (for strings that can be parsed in multiple ways). It comes | ||
with fantastic tooling, and works in both node and the browser. | ||
nearley is used by a wide variety of projects: | ||
- [artificial | ||
intelligence](https://github.com/ChalmersGU-AI-course/shrdlite-course-project) | ||
and | ||
- [computational | ||
linguistics](https://wiki.eecs.yorku.ca/course_archive/2014-15/W/6339/useful_handouts) | ||
classes at universities; | ||
- [file format parsers](https://github.com/raymond-h/node-dmi); | ||
- [data-driven markup languages](https://github.com/idyll-lang/idyll-compiler); | ||
- [compilers for real programming languages](https://github.com/sizigi/lp5562); | ||
- and nearley itself! The nearley compiler is written in *itself*. | ||
nearley is an npm [staff | ||
pick](https://www.npmjs.com/package/npm-collection-staff-picks). | ||
## Contents | ||
<!-- $ npm install -g doctoc --> | ||
@@ -14,82 +45,59 @@ <!-- $ doctoc --notitle README.md --> | ||
- [What is nearley?](#what-is-nearley) | ||
- [Why do I care?](#why-do-i-care) | ||
- [Installation and Usage](#installation-and-usage) | ||
- [Parser specification](#parser-specification) | ||
- [Installation](#installation) | ||
- [Getting started: nearley in 3 steps](#getting-started-nearley-in-3-steps) | ||
- [Writing a parser](#writing-a-parser) | ||
- [Terminals, nonterminals, rules](#terminals-nonterminals-rules) | ||
- [Postprocessors](#postprocessors) | ||
- [Epsilon rules](#epsilon-rules) | ||
- [Charsets](#charsets) | ||
- [Case-insensitive String Literals](#case-insensitive-string-literals) | ||
- [EBNF](#ebnf) | ||
- [Target languages](#target-languages) | ||
- [Catching errors](#catching-errors) | ||
- [More syntax: tips and tricks](#more-syntax-tips-and-tricks) | ||
- [Comments](#comments) | ||
- [Charsets](#charsets) | ||
- [Case-insensitive string literals](#case-insensitive-string-literals) | ||
- [EBNF](#ebnf) | ||
- [Macros](#macros) | ||
- [Additional JS](#additional-js) | ||
- [Importing](#importing) | ||
- [Custom lexers](#custom-lexers) | ||
- [Custom tokens](#custom-tokens) | ||
- [Using a parser](#using-a-parser) | ||
- [Catching errors](#catching-errors) | ||
- [Exploring a parser interactively](#exploring-a-parser-interactively) | ||
- [The Unparser](#the-unparser) | ||
- [Automagical Railroad Diagrams](#automagical-railroad-diagrams) | ||
- [Other Tools](#other-tools) | ||
- [Importing other grammars](#importing-other-grammars) | ||
- [Tokenizers](#tokenizers) | ||
- [Tools](#tools) | ||
- [nearley-test: Exploring a parser interactively](#nearley-test-exploring-a-parser-interactively) | ||
- [nearley-unparse: The Unparser](#nearley-unparse-the-unparser) | ||
- [nearley-railroad: Automagical Railroad Diagrams](#nearley-railroad-automagical-railroad-diagrams) | ||
- [Other Tools](#other-tools) | ||
- [Still confused?](#still-confused) | ||
- [Contributing](#contributing) | ||
- [Further reading](#further-reading) | ||
- [Documentation](#documentation) | ||
- [Recipes](#recipes) | ||
- [Details](#details) | ||
<!-- END doctoc generated TOC please keep comment here to allow auto update --> | ||
What is nearley? | ||
---------------- | ||
nearley uses the Earley parsing algorithm augmented with Joop Leo's | ||
optimizations to parse complex data structures easily. nearley is über-fast and | ||
really powerful. It can parse literally anything you throw at it. | ||
nearley is used by [artificial | ||
intelligence](https://github.com/ChalmersGU-AI-course/shrdlite-course-project) | ||
and [computational | ||
linguistics](https://wiki.eecs.yorku.ca/course_archive/2014-15/W/6339/useful_handouts) | ||
classes at universities, as well as [file format | ||
parsers](https://github.com/raymond-h/node-dmi), [markup | ||
languages](https://github.com/bobbybee/uPresent) and [complete programming | ||
languages](https://github.com/bobbybee/carbon). It's an npm [staff | ||
pick](https://www.npmjs.com/package/npm-collection-staff-picks). | ||
## Installation | ||
Why do I care? | ||
-------------- | ||
The nearley *compiler* converts grammar definitions from a simple | ||
[BNF](https://en.wikipedia.org/wiki/Backus–Naur_form)-based syntax to a small | ||
JS module. You can use that module to construct a nearley *parser*, which | ||
parses input strings. | ||
nearley can parse what other JS parse engines cannot, because it uses a | ||
different algorithm. The Earley algorithm is *general*, which means it can | ||
handle *any* grammar you can define in BNF. In fact, the nearley syntax is | ||
written in *itself* (this is called bootstrapping). | ||
Both components are published as a single | ||
[NPM](https://docs.npmjs.com/getting-started/what-is-npm) package compatible | ||
with [Node.js](https://nodejs.org/en/) and most browsers. | ||
PEGjs and Jison are recursive-descent based, and so they will choke on a lot of | ||
grammars, in particular [left recursive | ||
ones](http://en.wikipedia.org/wiki/Left_recursion). | ||
To use the nearley *parser*, you need to install nearley **locally**. | ||
nearley also has capabilities to catch errors gracefully, and detect ambiguous | ||
grammars (grammars that can be parsed in multiple ways). | ||
```bash | ||
$ npm install --save nearley | ||
``` | ||
Installation and Usage | ||
---------------------- | ||
To use the nearley *compiler*, you need to *additionally* install nearley | ||
**globally**. | ||
> **Note:** For beginners, Guillermo Webster's | ||
> [nearley-playground](https://omrelli.ug/nearley-playground/) is a wonderful | ||
> way to explore nearley interactively in your browser: | ||
> | ||
> ![A screenshot of the playground](www/playground.png) | ||
> | ||
> Enjoy! | ||
```bash | ||
$ npm install -g nearley | ||
``` | ||
This actually adds several new commands to your `$PATH`: | ||
To use nearley, you need both a *global* and a *local* installation. The two | ||
types of installations are described separately below. | ||
--- | ||
To *compile* a nearley parser (a `.ne` file), you need to install the | ||
`nearleyc` command from npm: | ||
$ npm install -g nearley | ||
$ nearleyc parser.ne | ||
nearley ships with three additional tools: | ||
- `nearleyc` compiles grammar files to JavaScript. | ||
- `nearley-test` lets you quickly test a grammar against some input and see the | ||
@@ -103,69 +111,110 @@ results. It also lets you explore the internal state of nearley's Earley | ||
You can uninstall the nearley compiler using `npm uninstall -g nearley`. | ||
These are documented below. | ||
--- | ||
> NOTE: If you're not ready to install nearley yet, you can follow along in | ||
> your browser using the [nearley | ||
> playground](https://omrelli.ug/nearley-playground/), an online interface for | ||
> exploring nearley grammars interactively. | ||
To *use* a generated grammar, you need to install `nearley` as a per-project | ||
dependency via npm (note that there is no `-g` in the first command): | ||
## Getting started: nearley in 3 steps | ||
nearley was written with users in mind: getting started with nearley is as | ||
simple as: | ||
**Step 1: Describe your grammar** using the nearley syntax. In a file called | ||
`grammar.ne`, write: | ||
```js | ||
main -> (statement "\n"):+ | ||
statement -> "foo" | "bar" | ||
``` | ||
$ npm install nearley | ||
$ node | ||
> var nearley = require("nearley"); | ||
> var grammar = require("./my-generated-grammar.js"); | ||
**Step 2: Compile** the grammar to a JavaScript module. On the command line, | ||
run: | ||
```bash | ||
$ nearleyc grammar.ne -o grammar.js | ||
``` | ||
Alternatively, to use a generated grammar in a browser runtime, include the | ||
`nearley.js` file in a `<script>` tag. | ||
**Step 3: Parse** some data! In a new JavaScript file, write: | ||
```html | ||
<script src="nearley.js"></script> | ||
<script src="my-generated-grammar.js"></script> | ||
```js | ||
const nearley = require("nearley"); | ||
const grammar = require("./grammar.js"); | ||
// Create a Parser object from our grammar. | ||
const parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar)); | ||
// Parse something! | ||
parser.feed("foo\n"); | ||
// parser.results is an array of possible parsings. | ||
console.log(parser.results); // [[[[ "foo" ],"\n" ]]] | ||
``` | ||
## Writing a parser | ||
Parser specification | ||
-------------------- | ||
Let's explore the building blocks of a nearley parser. | ||
This is a basic overview of nearley syntax and usage. For an advanced | ||
styleguide, see [this file](how-to-grammar-good.md). | ||
### Terminals, nonterminals, rules | ||
A parser consists of several *nonterminals*, which are constructions in a | ||
language. A nonterminal is made up of a series of either other nonterminals or | ||
strings. In nearley, you define a nonterminal by giving its name and its | ||
expansions. | ||
- A *terminal* is a string or a token. For example, the keyword `"if"` is a | ||
terminal. | ||
- A *nonterminal* is a combination of terminals and other nonterminals. For | ||
example, an if statement defined as `"if" condition statement` is a | ||
nonteminal. | ||
- A *rule* (or production rule) is a definition of a nonterminal. For example, | ||
`"if" condition "then" statement "endif"` is the rule according to which the | ||
if statement nonterminal is parsed. | ||
Strings are the *terminals*, which match those string literals (specified as | ||
JSON-compatible strings). | ||
The first nonterminal of the grammar is the one the whole input must match. | ||
With the following grammar, nearley will try to parse text as `expression`. | ||
The following grammar matches a number, a plus sign, and another number: | ||
```js | ||
expression -> number "+" number | ||
number -> [0-9]:+ | ||
``` | ||
expression -> number "+" number | ||
Anything from a `#` to the end of a line is ignored as a comment: | ||
Use the pipe character `|` to separate alternative rules for a nonterminal. | ||
expression -> number "+" number # sum of two numbers | ||
```js | ||
expression -> | ||
number "+" number | ||
| number "-" number | ||
| number "*" number | ||
| number "/" number | ||
``` | ||
A nonterminal can have multiple expansions, separated by vertical bars (`|`): | ||
The keyword `null` stands for the **epsilon rule**, which matches nothing. The | ||
following nonterminal matches zero or more `cow`s in a row, such as | ||
`cowcowcow`: | ||
expression -> | ||
number "+" number | ||
| number "-" number | ||
| number "*" number | ||
| number "/" number | ||
```js | ||
a -> null | ||
| a "cow" | ||
``` | ||
The parser tries to parse the first nonterminal that you define in a file. | ||
However, you can (and should!) introduce more nonterminals as "helpers". In | ||
this example, we would have to define the expansion of `number`. | ||
Keep in mind that nearley syntax is not sensitive to formatting. You're welcome | ||
to keep rules on the same line: `foo -> bar | qux`. | ||
### Postprocessors | ||
Each meaning (called a *production rule*) can have a postprocessing function, | ||
that can format the data in a way that you would like: | ||
By default, nearley wraps everything matched by a rule into an array. For | ||
example, when `rule -> "foo" "bar"` matches, it creates the "parse tree" | ||
`["foo", "bar"]`. Most of the time, however, you need to process that data in | ||
some way: for example, you may want to filter out whitespace, or transform the | ||
results into a custom JavaScript object. | ||
For this purpose, each rule can have a *postprocessor*: a JavaScript function | ||
that transforms the array and returns a "processed" version of the result. | ||
Postprocessors are wrapped in `{% %}`: | ||
```js | ||
expression -> number "+" number {% | ||
function (data, location, reject) { | ||
return ["sum", data[0], data[2]]; | ||
function(data, location, reject) { | ||
return { | ||
operator: "sum", | ||
leftOperand: data[0], | ||
rightOperand: data[2] // data[1] is "+" | ||
}; | ||
} | ||
@@ -175,50 +224,118 @@ %} | ||
`data` is an array whose elements match the nonterminals in order. The | ||
postprocessor `id` returns the first token in the match (literally | ||
`function(data) {return data[0];}`). | ||
The rule above will parse the string `5+10` into `{ operator: "sum", | ||
leftOperand: "5", rightOperand: "10" }`. | ||
`location` is the index at which that rule was found. Retaining this | ||
information in a syntax tree is useful if you're writing an interpreter and | ||
want to give fancy error messages for runtime errors. | ||
The postprocessor can be any function. It will be passed three arguments: | ||
If, after examining the data, you want to force the rule to fail anyway, return | ||
`reject`. An example of this is allowing a variable name to be a word that is | ||
not a string: | ||
- `data: Array` - an array that contains the results of parsing each part of | ||
the rule. Note that it is still an array, even if the rule only has one part! | ||
You can use the built-in `{% id %}` postprocessor to convert a one-item array | ||
into the item itself. | ||
- `location: number` - the index (zero-based) at which the rule match starts. | ||
This is useful, for example, to construct an error message that tells you where | ||
in the source the error occurred. | ||
- `reject: Object` - return this object to signal that this rule doesn't | ||
*actually* match. This is necessary in certain edge-conditions. For example, | ||
suppose you want sequences of letters to match variables, except for the | ||
keyword `var`. In this case, your rule may be | ||
```js | ||
word -> [a-z]:+ {% | ||
function(d,l, reject) { | ||
if (d[0] == 'var') { | ||
return reject; | ||
} else { | ||
return {'var': d[0]}; | ||
} | ||
} | ||
%} | ||
``` | ||
Please note that grammars using `reject` are not context-free, and are often | ||
much slower to parse. Use it wisely! You can usually avoid the need for | ||
`reject` by using a [tokenizer](#tokenizers). | ||
Remember that a postprocessor is scoped to a single rule, not the whole | ||
nonterminal. If a nonterminal has multiple alternative rules, each of them can | ||
have its own postprocessor. | ||
For arrow-function users, a convenient pattern is to decompose the `data` array | ||
within the argument of the arrow function: | ||
```js | ||
variable -> word {% | ||
function(data, location, reject) { | ||
if (KEYWORDS.indexOf(data[0]) === -1) { | ||
return data[0]; // It's a valid name | ||
} else { | ||
return reject; // It's a keyword, so reject it | ||
} | ||
} | ||
%} | ||
expression -> | ||
number "+" number {% ([first, _, second]) => first + second %} | ||
| number "-" number {% ([first, _, second]) => first - second %} | ||
| number "*" number {% ([first, _, second]) => first * second %} | ||
| number "/" number {% ([first, _, second]) => first / second %} | ||
``` | ||
You can write your postprocessors in CoffeeScript by adding `@preprocessor | ||
coffee` to the top of your file. Similarly, you can write them in TypeScript by | ||
adding `@preprocessor typescript` to the top of your file. If you would like to | ||
support a different postprocessor language, feel free to file a PR! | ||
There are two built-in postprocessors for the most common scenarios: | ||
### Epsilon rules | ||
- `id` - returns the first element of the `data` array. This is useful to | ||
extract the content of a single-element array: `foo -> bar {% id %}` | ||
The **epsilon rule** is the empty rule that matches nothing. The constant | ||
`null` is the epsilon rule, so: | ||
- `nuller` - returns null. This is useful for whitespace rules: `space -> " " | ||
{% nuller %}` | ||
a -> null | ||
| a "cow" | ||
#### Target languages | ||
will match 0 or more `cow`s in a row. | ||
By default, `nearleyc` compiles your grammar to JavaScript. You can also choose | ||
CoffeeScript or TypeScript by adding `@preprocessor coffee` or `@preprocessor | ||
typescript` at the top of your grammar file. This can be useful to write your | ||
postprocessors in a different language, and to get type annotations if you wish | ||
to use nearley in a statically typed dialect of JavaScript. | ||
### Charsets | ||
### Catching errors | ||
You can use valid RegExp charsets in a rule: | ||
nearley is a *streaming* parser: you can keep feeding it more strings. This | ||
means that there are two error scenarios in nearley. | ||
Consider the simple parser below for the examples to follow. | ||
```js | ||
main -> "Cow goes moo." {% function(d) {return "yay!"; } %} | ||
``` | ||
If there are no possible parsings given the current input, but in the *future* | ||
there *might* be results if you feed it more strings, then nearley will | ||
temporarily set the `results` array to the empty array, `[]`. | ||
```js | ||
parser.feed("Cow "); // parser.results is [] | ||
parser.feed("goes "); // parser.results is [] | ||
parser.feed("moo."); // parser.results is ["yay!"] | ||
``` | ||
If there are no possible parsings, and there is no way to "recover" by feeding | ||
more data, then nearley will throw an error whose `offset` property is the | ||
index of the offending token. | ||
```js | ||
try { | ||
parser.feed("Cow goes% moo."); | ||
} catch(parseError) { | ||
console.log("Error at character " + parseError.offset); // "Error at character 9" | ||
} | ||
``` | ||
### More syntax: tips and tricks | ||
#### Comments | ||
Comments are marked with '#'. Everything from `#` to the end of a line is | ||
ignored: | ||
```ini | ||
expression -> number "+" number # sum of two numbers | ||
``` | ||
#### Charsets | ||
You can use valid RegExp charsets in a rule (unless you're using a | ||
[tokenizer](#tokenizers)): | ||
not_a_letter -> [^a-zA-Z] | ||
The `.` character can be used to represent "any character". | ||
The `.` character can be used to represent any character. | ||
### Case-insensitive String Literals | ||
#### Case-insensitive string literals | ||
@@ -228,3 +345,3 @@ You can create case-insensitive string literals by adding an `i` after the | ||
cow -> "cow"i # matches CoW, COW, etc. | ||
cow -> "cow"i # matches CoW, COW, and so on. | ||
@@ -235,8 +352,10 @@ Note that if you are using a lexer, your lexer should use the `i` flag in its | ||
### EBNF | ||
#### EBNF | ||
nearley compiles some higher-level constructs into BNF for you. In particular, | ||
the `*`, `?`, and `+` operators from Regexes (or EBNF) are available as shown: | ||
nearley supports the `*`, `?`, and `+` operators from | ||
[EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) as shown: | ||
batman -> "na":* "batman" # nananana...nanabatman | ||
```ini | ||
batman -> "na":* "batman" # nananana...nanabatman | ||
``` | ||
@@ -246,182 +365,148 @@ You can also use capture groups with parentheses. Its contents can be anything | ||
banana -> "ba" ("na" {% id %} | "NA" {% id %}):+ | ||
```js | ||
banana -> "ba" ("na" {% id %} | "NA" {% id %}):+ | ||
``` | ||
### Macros | ||
You can create "polymorphic" rules through macros: | ||
Macros allow you to create polymorphic rules: | ||
match3[X] -> $X $X $X | ||
quote[X] -> "'" $X "'" | ||
```ini | ||
# Matches "'Hello?' 'Hello?' 'Hello?'" | ||
main -> matchThree[inQuotes["Hello?"]] | ||
main -> match3[quote["Hello?"]] | ||
# matches "'Hello?''Hello?''Hello?'" | ||
matchThree[X] -> $X " " $X " " $X | ||
Macros are dynamically scoped: | ||
inQuotes[X] -> "'" $X "'" | ||
``` | ||
foo[X, Y] -> bar[("moo" | "oink" | "baa")] $Y | ||
bar[Z] -> $X " " $Z # 'remembers' $X from its caller | ||
main -> foo["Cows", "."] | ||
# matches "Cows oink." and "Cows moo." | ||
Macros are dynamically scoped, which means they see arguments passed to parent | ||
macros: | ||
Macros *cannot* be recursive (`nearleyc` will go into an infinite loop trying | ||
to expand the macro-loop). | ||
```ini | ||
# Matches "Cows oink." and "Cows moo!" | ||
main -> sentence["Cows", ("." | "!")] | ||
sentence[ANIMAL, PUNCTUATION] -> animalGoes[("moo" | "oink" | "baa")] $PUNCTUATION | ||
animalGoes[SOUND] -> $ANIMAL " " $SOUND # uses $ANIMAL from its caller | ||
``` | ||
Macros are expanded at compile time and inserted in places they are used. They | ||
are not "real" rules. Therefore, macros *cannot* be recursive (`nearleyc` will | ||
go into an infinite loop trying to expand the macro-loop). | ||
### Additional JS | ||
For more intricate postprocessors, or any other functionality you may need, you | ||
can include parts of literal JavaScript between production rules by surrounding | ||
can include chunks of JavaScript code between production rules by surrounding | ||
it with `@{% ... %}`: | ||
```js | ||
@{% var makeCowWithString = require('./cow.js') %} | ||
cow -> "moo" {% function(d) {makeCowWithString(d[0]); } %} | ||
@{% | ||
const cowSays = require("./cow.js"); | ||
%} | ||
cow -> "moo" {% ([moo]) => cowSays(moo) %} | ||
``` | ||
Note that it doesn't matter where you define these; they all get hoisted to the | ||
Note that it doesn't matter where you add these; they all get hoisted to the | ||
top of the generated code. | ||
### Importing | ||
### Importing other grammars | ||
You can include the content of other parser files: | ||
You can include the content of other grammar files: | ||
@include "../misc/primitives.ne" # path relative to file being compiled | ||
sum -> number "+" number | ||
```ini | ||
@include "../misc/primitives.ne" # path relative to file being compiled | ||
sum -> number "+" number # uses "number" from the included file | ||
``` | ||
There are also some built-in parsers whose contents you can include: | ||
There are several builtin helper files that you can include: | ||
@builtin "cow.ne" | ||
main -> cow:+ | ||
```ini | ||
@builtin "cow.ne" | ||
main -> cow:+ | ||
``` | ||
See the `builtin/` directory for an index of this library. Contributions are | ||
See the [`builtin/`](builtin) directory for more details. Contributions are | ||
welcome here! | ||
Including a parser imports *all* of the nonterminals defined in the parser, as | ||
well as any JS, macros, and config options defined there. | ||
Including a file imports *all* of the nonterminals defined in it, as well as | ||
any JS, macros, and config options defined there. | ||
## Tokenizers | ||
### Custom lexers | ||
By default, nearley splits the input into a stream of characters. This is | ||
called *scannerless* parsing. | ||
You can pass a `lexer` instance to Parser, which must have the following interface: | ||
A tokenizer splits the input into a stream of larger units called *tokens*. | ||
This happens in a separate stage before parsing. For example, a tokenizer might | ||
convert `512 + 10` into `["512", "+", "10"]`: notice how it removed the | ||
whitespace, and combined multi-digit numbers into a single number. | ||
* `reset(chunk, Info)`: set the internal buffer to `chunk`, and restore line/col/state info taken from `save()`. | ||
* `next() -> Token` return e.g. `{type, value, line, col, …}`. Only the `value` attribute is required. | ||
* `save() -> Info` -> return an object describing the current line/col etc. This allows us to preserve this information between `feed()` calls, and also to support `Parser#rewind()`. The exact structure is lexer-specific; nearley doesn't care what's in it. | ||
* `formatError(token)` -> return a string with an error message describing the line/col of the offending token. You might like to include a preview of the line in question. | ||
* `has(tokenType)` -> return true if the lexer can emit tokens with that name. Used to resolve `%`-specifiers in compiled nearley grammars. | ||
Using a tokenizer has many benefits. It... | ||
If Parser isn't given a lexer option, it will look for a `.lexer` attribute on its Grammar. The `@lexer` directive allows exporting a lexer object from your `.ne` grammar file. (See `json.ne` for an example.) | ||
- ...often makes your parser faster by more than an order of magnitude. | ||
- ...allows you to write cleaner, more maintainable grammars. | ||
- ...helps avoid ambiguous grammars in some cases. For example, a tokenizer can | ||
easily tell you that `superclass` is a single keyword, not a sequence of | ||
`super` and `class` keywords. | ||
- ...gives you *lexical information* such as line numbers for each token. This | ||
lets you make better error messages. | ||
nearley supports and recommends [Moo](https://github.com/tjvr/moo), a | ||
super-fast tokenizer. Here is a simple example: | ||
### Custom tokens | ||
```coffeescript | ||
@{% | ||
const moo = require("moo"); | ||
Nearley assumes by default that your fundamental unit of parsing, called a | ||
*token*, is a character. That is, you're parsing a list of characters. However, | ||
sometimes you want to preprocess your string to turn it into a list of *lexical | ||
tokens*. This means, instead of seeing "1", "2", "3", the nearley might just | ||
see a single list item "123". This is called *tokenizing*, and it can bring you | ||
decent performance gains. It also allows you to write cleaner, more | ||
maintainable grammars and to prevent ambiguous grammars. | ||
Tokens can be defined in two ways: literal tokens and testable tokens. A | ||
literal token matches exactly, while a testable token runs a function to test | ||
whether it is a match or not. | ||
``` | ||
@{% | ||
var print_tok = {literal: "print"}; | ||
var number_tok = {test: function(x) {return x.constructor === Number; }} | ||
const lexer = moo.compile({ | ||
ws: /[ \t]+/, | ||
number: /[0-9]+/, | ||
times: /\*|x/ | ||
}); | ||
%} | ||
main -> %print_tok %number_tok | ||
``` | ||
# Pass your lexer object using the @lexer option: | ||
@lexer lexer | ||
Now, instead of parsing the string `"print 12"`, you would parse the array | ||
`["print", 12]`. | ||
You can write your own tokenizer using regular expressions, or choose from | ||
several existing tokenizing libraries for node. | ||
Using a parser | ||
-------------- | ||
nearley exposes the following API: | ||
```js | ||
var grammar = require("generated-code.js"); | ||
var nearley = require("nearley"); | ||
// Create a Parser object from our grammar. | ||
var p = new nearley.Parser(grammar.ParserRules, grammar.ParserStart); | ||
// Parse something | ||
p.feed("1+1"); | ||
// p.results --> [ ["sum", "1", "1"] ] | ||
# Use %token to match any token of that type instead of "token": | ||
multiplication -> %number %ws %times %ws %number {% ([first, , , , second]) => first * second %} | ||
``` | ||
The `Parser` object can be fed data in parts with `.feed(data)`. You can then | ||
find an array of parsings with the `.results` property. If `results` is empty, | ||
then there are no parsings. If `results` contains multiple values, then that | ||
combination is ambiguous. | ||
Have a look at [the Moo documentation](https://github.com/tjvr/moo#usage) to | ||
learn more about the tokenizer. | ||
The incremental feeding design is inspired so that you can parse data from | ||
stream-like inputs, or even dynamic readline inputs. For example, to create a | ||
Python-style REPL where it continues to prompt you until you have entered a | ||
complete block. | ||
Note that when using a tokenizer, raw strings match full tokens parsed by Moo. | ||
This is convenient for matching keywords. | ||
```js | ||
p.feed(prompt_user(">>> ")); | ||
while (p.results.length < 1) { | ||
p.feed(prompt_user("... ")); | ||
} | ||
console.log(p.results); | ||
```ini | ||
ifStatement -> "if" condition "then" block | ||
``` | ||
The `nearley.Parser` constructor takes an optional third parameter, `options`, | ||
which is an object with the following possible keys: | ||
You use the parser as usual: call `parser.feed(data)`, and nearley will give | ||
you the parsed results in return. | ||
- `keepHistory` (boolean, default `false`): if set to `true`, nearley will | ||
preserve the internal state of the parser in the parser's `.table` property. | ||
Preserving the state has some performance cost (because it can potentially be | ||
very large), so we recommend leaving this as `false` unless you are familiar | ||
with the Earley parsing algorithm and are planning to do something exciting | ||
with the parse table. | ||
Catching errors | ||
--------------- | ||
## Tools | ||
If there are no possible parsings, nearley will throw an error whose `offset` | ||
property is the index of the offending token. | ||
As mentioned above, nearley ships with a host of tools. | ||
```js | ||
try { | ||
p.feed("1+gorgonzola"); | ||
} catch(parseError) { | ||
console.log( | ||
"Error at character " + parseError.offset | ||
); // "Error at character 2" | ||
} | ||
``` | ||
### nearley-test: Exploring a parser interactively | ||
A global install of nearley provides `nearley-test`, a simple command-line tool | ||
you can use to inspect what a parser is doing. You input a generated | ||
`grammar.js` file, and then give it some input to test the parser against. | ||
`nearley-test` prints out the output if successful, and optionally | ||
pretty-prints the internal parse table used by the algorithm. This is helpful | ||
to test a new parser. | ||
Exploring a parser interactively | ||
-------------------------------- | ||
### nearley-unparse: The Unparser | ||
The global install will provide `nearley-test`, a simple command-line tool you | ||
can use to inspect what a parser is doing. You input a generated `grammar.js` | ||
file, and then give it some input to test the parser against. `nearley-test` | ||
prints out the output if successful, and also gives you the complete parse | ||
table used by the algorithm. This is very helpful when you're testing a new | ||
parser. | ||
This was previously called `bin/nearleythere.js` and written by Robin. | ||
The Unparser | ||
------------ | ||
The Unparser takes a (compiled) parser and outputs a random string that would | ||
be accepted by the parser. | ||
``` | ||
```bash | ||
$ nearley-unparse -s number <(nearleyc builtin/prims.ne) | ||
@@ -434,8 +519,8 @@ -6.22E94 | ||
- ...test your parser specification by generating lots of random expressions | ||
and making sure all of them are "correct". | ||
and making sure all of them are "correct". | ||
- ...generate random strings from a schema (for example, random email addresses | ||
or telephone numbers). | ||
or telephone numbers). | ||
- ...create fuzzers and combinatorial stress-testers. | ||
- ...play "Mad-Libs" automatically! (Practical application: automatic | ||
grammatically valid loremtext.) | ||
grammatically valid loremtext.) | ||
@@ -454,10 +539,8 @@ The Unparser outputs as a stream by continuously writing characters to its | ||
### nearley-railroad: Automagical Railroad Diagrams | ||
Automagical Railroad Diagrams | ||
----------------------------- | ||
nearley lets you convert your grammars to pretty SVG railroad diagrams that you | ||
can include in webpages, documentation, and even papers. | ||
``` | ||
```bash | ||
$ nearley-railroad regex.ne -o grammar.html | ||
@@ -474,6 +557,4 @@ ``` | ||
### Other Tools | ||
Other Tools | ||
----------- | ||
*This section lists nearley tooling created by other developers. These tools | ||
@@ -508,5 +589,8 @@ are not distributed with nearley, so if you have problems, please contact the | ||
Gulp users can use | ||
[gulp-nearley](https://github.com/JosephJNK/gulp-nearley) by Joseph Junker to | ||
compile grammars with a gulpfile. | ||
Still confused? | ||
--------------- | ||
## Still confused? | ||
You can read [the calculator example](examples/calculator/arithmetic.ne) to get | ||
@@ -519,10 +603,8 @@ a feel for the syntax (see it live | ||
## Contributing | ||
Contributing | ||
------------ | ||
Tests live in `test/` and can be called with `npm test`. Please run the | ||
benchmarks before and after your changes: parsing is tricky, and small changes | ||
can kill efficiency. We learned this the hard way! | ||
Clone, hack, PR. Tests live in `test/` and can be called with `npm test`. Make | ||
sure you read `test/profile.log` after tests run, and that nothing has died | ||
(parsing is tricky, and small changes can kill efficiency). | ||
If you're looking for something to do, here's a short list of things that would | ||
@@ -546,8 +628,21 @@ make me happy: | ||
## Further reading | ||
Further reading | ||
--------------- | ||
### Documentation | ||
- [Best practices for writing grammars](docs/how-to-grammar-good.md) | ||
- [More on tokenizers](docs/custom-tokens-and-lexers.md) | ||
- [Accessing the internal parse table](docs/accessing-parse-table.md) | ||
- [Using `nearleyc` in browsers](docs/using-in-frontend.md) | ||
### Recipes | ||
- [Transforming parse trees](docs/generating-cst-ast.md) | ||
- [Writing an indentation-aware (Python-like) lexer](https://gist.github.com/nathan/d8d1adea38a1ef3a6d6a06552da641aa) | ||
- [Making a REPL for your language](docs/making-a-repl.md) | ||
### Details | ||
- Read my [blog post](http://hardmath123.github.io/earley.html) to learn more | ||
about the algorithm. | ||
about the algorithm. | ||
- Read about [Marpa](http://savage.net.au/Marpa.html) to | ||
@@ -554,0 +649,0 @@ learn more than you ever thought you wanted to know about parsing. |
@@ -47,2 +47,3 @@ var child_process = require('child_process') | ||
it('should build for TypeScript', function() { | ||
this.timeout(5000); | ||
externalNearleyc("test/typescript-test.ne -o test/tmp.typescript-test.ts").should.equal(""); | ||
@@ -49,0 +50,0 @@ sh("node ./node_modules/typescript/bin/tsc --project test"); |
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
938527
105
5075
642
10