Security News
JSR Working Group Kicks Off with Ambitious Roadmap and Plans for Open Governance
At its inaugural meeting, the JSR Working Group outlined plans for an open governance model and a roadmap to enhance JavaScript package management.
The nearley npm package is a fast, feature-rich, and modern parser toolkit for JavaScript. It is based on Earley's algorithm and can be used to create parsers for complex, context-free grammars. nearley is designed to be simple to use and extend, making it a good choice for building compilers, interpreters, and other language-related tools.
Grammar Definition
This feature allows you to define a grammar for your language. The grammar is written in a simple, JSON-like format and compiled into a parser.
{"module.exports = grammar({main: $ => ['hello', $.world],world: $ => 'world'});"}
Parsing Input
Once you have defined a grammar, you can create a parser and feed it input to parse. The parser will output a parse tree or a list of possible parse trees if the input is ambiguous.
{"const nearley = require('nearley');\nconst grammar = require('./your-grammar.js');\nconst parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));\nparser.feed('hello world');\nconst results = parser.results;\nconsole.log(results);"}
Error Reporting
nearley provides error reporting features that help you understand where and why a parse failed, which is useful for debugging grammars and providing feedback to users.
{"const nearley = require('nearley');\nconst grammar = require('./your-grammar.js');\nconst parser = new nearley.Parser(nearley.Grammar.fromCompiled(grammar));\ntry {\n parser.feed('hello wor');\n} catch (error) {\n console.error(error.message);\n}"}
PEG.js is a simple parser generator for JavaScript that produces fast parsers with excellent error reporting. It uses Parsing Expression Grammars (PEG) as the input. Compared to nearley, PEG.js grammars are arguably easier to read and write but are less powerful in terms of expressing certain types of grammars.
Chevrotain is a high-performance, self-optimizing parser building toolkit for JavaScript. Unlike nearley, which uses Earley's algorithm, Chevrotain is based on parsing techniques that do not require a separate parser generation step. It provides a rich feature set and is particularly well-suited for building complex parsers.
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator that supports multiple languages, including JavaScript. ANTLR is more complex than nearley but offers a very rich set of features for building sophisticated language processors. It uses LL(*) parsing which is different from nearley's Earley-based approach.
Jison is an npm package that generates bottom-up parsers in JavaScript. Inspired by Bison, it is capable of handling LR and LALR grammars. Jison can be considered more traditional compared to nearley's modern approach, and it might be more familiar to those with experience in classic parser generators.
___
/\_ \
___ __ __ _ __\//\ \ __ __ __
/' _ `\ /'__`\ /'__`\ /\`'__\\ \ \ /'__`\/\ \/\ \
/\ \/\ \/\ __//\ \L\.\_\ \ \/ \_\ \_/\ __/\ \ \_\ \
\ \_\ \_\ \____\ \__/.\_\\ \_\ /\____\ \____\\/`____ \
\/_/\/_/\/____/\/__/\/_/ \/_/ \/____/\/____/ `/___/> \
/\___/
\/__/
Simple parsing for node.js.
nearley uses the Earley parsing algorithm augmented with Joop Leo's optimizations to parse complex data structures easily. nearley is über-fast and really powerful. It can parse literally anything you throw at it.
nearley is used by artificial intelligence and computational linguistics classes at universities, as well as file format parsers, markup languages and complete programming languages. It's an npm staff pick.
nearley can parse what other JS parse engines cannot, because it uses a different algorithm. The Earley algorithm is general, which means it can handle any grammar you can define in BNF. In fact, the nearley syntax is written in itself (this is called bootstrapping).
PEGjs and Jison are recursive-descent based, and so they will choke on a lot of grammars, in particular left recursive ones.
nearley also has capabilities to catch errors gracefully, and detect ambiguous grammars (grammars that can be parsed in multiple ways).
To compile a parser, use the nearleyc
command:
npm install -g nearley
nearleyc parser.ne
Run nearleyc --help
for more options.
To use a generated grammar in a node runtime, install nearley
as a module:
npm install nearley
...
var nearley = require("nearley");
var grammar = require("./my-generated-grammar.js");
To use a generated grammar in a browser runtime, include nearley.js
(you can
hardlink from Github if you want):
<script src="https://raw.githubusercontent.com/Hardmath123/nearley/master/lib/nearley.js"></script>
<script src="my-generated-grammar.js"></script>
This is a basic overview of nearley syntax and usage. For an advanced styleguide, see this file.
A parser consists of several nonterminals, which are constructions in a language. A nonterminal is made up of a series of either other nonterminals or strings. In nearley, you define a nonterminal by giving its name and its expansions.
Strings are the terminals, which match those string literals (specified as JSON-compatible strings).
The following grammar matches a number, a plus sign, and another number:
expression -> number "+" number
Anything from a #
to the end of a line is ignored as a comment:
expression -> number "+" number # sum of two numbers
A nonterminal can have multiple expansions, separated by vertical bars (|
):
expression ->
number "+" number
| number "-" number
| number "*" number
| number "/" number
The parser tries to parse the first nonterminal that you define in a file.
However, you can (and should!) introduce more nonterminals as "helpers". In
this example, we would have to define the expansion of number
.
Each meaning (called a production rule) can have a postprocessing function, that can format the data in a way that you would like:
expression -> number "+" number {%
function (data, location, reject) {
return ["sum", data[0], data[2]];
}
%}
data
is an array whose elements match the nonterminals in order. The
postprocessor id
returns the first token in the match (literally
function(data) {data[0];}
).
location
is the index at which that rule was found. Retaining this
information in a syntax tree is useful if you're writing an interpreter and
want to give fancy error messages for runtime errors.
If, after examining the data, you want to force the rule to fail anyway, return
reject
. An example of this is allowing a variable name to be a word that is
not a string:
variable -> word {%
function(data, location, reject) {
if (KEYWORDS.indexOf(data[0]) === -1) {
return data[0]; // It's a valid name
} else {
return reject; // It's a keyword, so reject it
}
}
%}
You can write your postprocessors in CoffeeScript by adding @preprocessor coffee
to the top of your file. If you would like to support a different
postprocessor language, feel free to file a PR!
The epsilon rule is the empty rule that matches nothing. The constant
null
is the epsilon rule, so:
a -> null
| a "cow"
will match 0 or more cow
s in a row.
You can use valid RegExp charsets in a rule:
not_a_letter -> [^a-zA-Z]
The .
character can be used to represent "any character".
nearley compiles some higher-level constructs into BNF for you. In particular,
the *
, ?
, and +
operators from Regexes (or EBNF) are available as shown:
batman -> "na":* "batman" # nananana...nanabatman
You can also use capture groups with parentheses. Its contents can be anything that a rule can have:
banana -> "ba" ("na" {% id %} | "NA" {% id %}):+
You can create "polymorphic" rules through macros:
match3[X] -> $X $X $X
quote[X] -> "'" $X "'"
main -> match3[quote["Hello?"]]
# matches "'Hello?''Hello?''Hello?'"
Macros are dynamically scoped:
foo[X, Y] -> bar["moo" | "oink" | "baa"] $Y
bar[Z] -> $X " " $Z # 'remembers' $X from its caller
main -> foo["Cows", "."]
# matches "Cows oink." and "Cows moo."
Macros cannot be recursive (nearleyc
will go into an infinite loop trying
to expand the macro-loop).
For more intricate postprocessors, or any other functionality you may need, you
can include parts of literal JavaScript between production rules by surrounding
it with @{% ... %}
:
@{% var makeCowWithString = require('./cow.js') %}
cow -> "moo" {% function(d) {makeCowWithString(d[0]); } %}
Note that it doesn't matter where you define these; they all get hoisted to the top of the generated code.
You can include the content of other parser files:
@include "../misc/primitives.ne" # path relative to file being compiled
sum -> number "+" number
There are also some built-in parsers whose contents you can include:
@builtin "cow.ne"
main -> cow:+
See the builtin/
directory for an index of this library. Contributions are
welcome here!
Including a parser imports all of the nonterminals defined in the parser, as well as any JS, macros, and config options defined there.
Nearley assumes by default that your fundamental unit of parsing, called a token, is a character. That is, you're parsing a list of characters. However, sometimes you want to preprocess your string to turn it into a list of lexical tokens. This means, instead of seeing "1", "2", "3", the nearley might just see a single list item "123". This is called tokenizing, and it can bring you decent performance gains. It also allows you to write cleaner, more maintainable grammars and to prevent ambiguous grammars.
Tokens can be defined in two ways: literal tokens and testable tokens. A literal token matches exactly, while a testable token runs a function to test whether it is a match or not.
@{%
var print_tok = {literal: "print"};
var number_tok = {test: function(x) {return x.constructor === Number; }}
%}
main -> %print_tok %number_tok
Now, instead of parsing the string "print 12"
, you would parse the array
["print", 12]
.
You can write your own tokenizer using regular expressions, or choose from several existing tokenizing libraries for node.
(If someone writes a tokenizer plugin for nearley, I would wholeheartedly accept it!)
nearley exposes the following API:
var grammar = require("generated-code.js");
var nearley = require("nearley");
// Create a Parser object from our grammar.
var p = new nearley.Parser(grammar.ParserRules, grammar.ParserStart);
// Parse something
p.feed("1+1");
// p.results --> [ ["sum", "1", "1"] ]
The Parser
object can be fed data in parts with .feed(data)
. You can then
find an array of parsings with the .results
property. If results
is empty,
then there are no parsings. If results
contains multiple values, then that
combination is ambiguous.
The incremental feeding design is inspired so that you can parse data from stream-like inputs, or even dynamic readline inputs. For example, to create a Python-style REPL where it continues to prompt you until you have entered a complete block.
p.feed(prompt_user(">>> "));
while (p.results.length < 1) {
p.feed(prompt_user("... "));
}
console.log(p.results);
If there are no possible parsings, nearley will throw an error whose offset
property is the index of the offending token.
try {
p.feed("1+gorgonzola");
} catch(parseError) {
console.log(
"Error at character " + parseError.offset
); // "Error at character 2"
}
The global install will provide nearley-test
, a simple command-line tool you
can use to inspect what a parser is doing. You input a generated grammar.js
file, and then give it some input to test the parser against. nearley-test
prints out the output if successful, and also gives you the complete parse
table used by the algorithm. This is very helpful when you're testing a new
parser.
This was previously called bin/nearleythere.js
and written by Robin.
The Unparser takes a (compiled) parser and outputs a random string that would be accepted by the parser.
$ nearley-unparse -s number <(nearleyc builtin/prims.ne)
-6.22E94
You can use the Unparser to...
The Unparser outputs as a stream by continuously writing characters to its output pipe. So, if it "goes off the deep end" and generates a huge string, you will still see output scrolling by in real-time.
As far as I know, nearley is the only parser generator with this feature. It is inspired by Roly Fentanes' randexp, which does the same thing with regular expressions.
nearley lets you convert your grammars to pretty SVG railroad diagrams that you can include in webpages, documentation, and even papers.
$ nearley-railroad regex.ne -o grammar.html
See a bigger example here.
(This feature is powered by
railroad-diagrams
by
tabatkins.)
You can read the calculator example to get
a feel for the syntax (see it live
here). There are
more sample grammars in the /examples
directory. For larger examples, we
also have experimental parsers for CSV, Lua, and JavaScript.
Clone, hack, PR. Tests live in test/
and can be called with npm test
. Make
sure you read test/profile.log
after tests run, and that nothing has died
(parsing is tricky, and small changes can kill efficiency).
If you're looking for something to do, here's a short list of things that would make me happy:
pearley
for Python and cearley
for C would be
awesome.Nearley is MIT licensed.
A big thanks to Nathan Dinsmore for teaching me how to Earley, Aria Stewart for helping structure nearley into a mature module, and Robin Windels for bootstrapping the grammar. Additionally, Jacob Edelman wrote an experimental JavaScript parser with nearley and contributed ideas for EBNF support. Joshua T. Corbin refactored the compiler to be much, much prettier. Bojidar Marinov implemented postprocessors-in-other-languages. Shachar Itzhaky fixed a subtle bug with nullables.
Atom users can write nearley grammars with this plugin by Bojidar Marinov.
Sublime Text users can write nearley grammars with this syntax by liam4.
FAQs
Simple, fast, powerful parser toolkit for JavaScript.
The npm package nearley receives a total of 2,728,429 weekly downloads. As such, nearley popularity was classified as popular.
We found that nearley demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
At its inaugural meeting, the JSR Working Group outlined plans for an open governance model and a roadmap to enhance JavaScript package management.
Security News
Research
An advanced npm supply chain attack is leveraging Ethereum smart contracts for decentralized, persistent malware control, evading traditional defenses.
Security News
Research
Attackers are impersonating Sindre Sorhus on npm with a fake 'chalk-node' package containing a malicious backdoor to compromise developers' projects.