Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

regular-grammar

Package Overview
Dependencies
Maintainers
1
Versions
10
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

regular-grammar - npm Package Compare versions

Comparing version 0.1.0 to 0.1.1

new_bench.diff

4

package.json
{
"name": "regular-grammar",
"version": "0.1.0",
"version": "0.1.1",
"description": "A simple regular grammar/parser generator",

@@ -16,3 +16,3 @@ "author": "Victor Grishchenko <victor.grishchenko@gmail.com>",

"scripts": {
"test": "tape test/01_grammar.js | tap-diff"
"test": "tape test/0?_*.js | tap-diff"
},

@@ -19,0 +19,0 @@ "keywords": [

# Simple regular grammar parser generator
The Swarm protocol evolved since 2012 in more or less the same direction, gradually becoming richer and richer.
Things became a bit hairy last November (2016) once object metadata structure was unified and made part of the protocol.
This weekend (11 Feb 2017) I finally realized how to write production rules in a concise way (see index.js).
As a result, 20 rather simple rules described the entire syntax.
This package creates regex parsers for simple regular grammars.
That is mostly valuable to implement various text-based
protocols, serialization formats and smallish languages.
By the Chomsky–Schützenberger hierarchy, Swarm is a regular language.
Hence, it is parsed by regular expressions.
We only support *regular* grammars, so for example,
arbitrarily-deep nesting and recursion are not possible.
But, by the Chomsky–Schützenberger hierarchy,
the next degree of sophistication is a context-free language, which is
probably too much for our use case.
After all, our goal is simplicity and predictability, not Turing completeness.
This package exports the Grammar class that converts simple grammar
definitions into regex parsers.
In particular, you may create parsers for reasonably functional subsets of SMTP, HTTP, or CSS.
You can't create a parser for general-case HTML, XML, or JSON. Those languages have arbitrarily deep nesting.
Still, you can create a parser for a subset of such a language (see the JSON example).
Advantages of the approach are:
* simple languages,
* uniform treating of whitespace and punctuation,
* correct formal parsers created in 5 minutes.
## Examples
(see `test/`)
```js
const THREE_WORD = new Grammar({
WORD: /\w+/,
SENTENCE: ' WORD{1,3} ',
});
tap.ok(THREE_WORD.is('what a grammar', 'SENTENCE'));
tap.ok(THREE_WORD.is('attention', 'SENTENCE'));
tap.notOk(THREE_WORD.is('you talk too much', 'SENTENCE'));
tap.notOk(THREE_WORD.is(' ', 'SENTENCE'));
const words = THREE_WORD.split(' exactly three\nwords', 'SENTENCE');
tap.deepEqual(words, [['exactly', 'three', 'words']]);
```
## Why not JSON?
JSON is an expensive to parse language that gives no guarantees
about the structure of the resulting objects. It is lax.
A parsed JSON object will have to be parsed anew
to derive *your* data structures.
In fact, it is possible to define restricted JSONey languages
using regular grammars.
```js
const INVENTORY_JSON = new Grammar({
FLOAT: /\d{1,16}(\.\d{1,15})?/,
INT: /\d{1,16}/,
STRING: /"(\\.|[^"])*"/,
IDKV: '"\\"id\\"" :STRING',
NAMEKV:'"\\"name\\"" :STRING',
QUANTKV:'"\\"quantity\\"" :INT',
PRICEKV:'"\\"price\\"" :FLOAT',
ENTRY: '{ IDKV ,NAMEKV? ,QUANTKV? ,PRICEKV? }',
LIST: '[ ENTRY ,ENTRY* ]'
});
// This particular language can be parsed by a JSON parser,
// but it is way more strict than JSON.
// Field names, their order and value types are all fixed.
tap.ok( INVENTORY_JSON.is( '{"id":"A123"}', 'ENTRY' ) );
tap.ok(INVENTORY_JSON.is(''+Math.PI, 'FLOAT'));
const bears = '{"id":"A345", "name":"teddy bear", "price": 5.45}';
tap.ok( INVENTORY_JSON.is(bears, 'ENTRY') );
tap.notOk(INVENTORY_JSON.is('{"id":123}', 'ENTRY'));
```
For that JSONey grammar, the parser looks like:
```
/^\s*\[\s*(\{\s*(?:(?:"id"\s*)\:\s*(?:"(?:\\.|[^"])*"))\s*(?:\,\s*(?:(?:"name"\s*)\:\s*(?:"(?:\\.|[^"])*")))?\s*(?:\,\s*(?:(?:"quantity"\s*)\:\s*(?:\d+)))?\s*(?:\,\s*(?:(?:"price"\s*)\:\s*(?:\d+(?:\.\d{1,15})?)))?\s*\}\s*)((?:\,\s*\{\s*(?:(?:"id"\s*)\:\s*(?:"(?:\\.|[^"])*"))\s*(?:\,\s*(?:(?:"name"\s*)\:\s*(?:"(?:\\.|[^"])*")))?\s*(?:\,\s*(?:(?:"quantity"\s*)\:\s*(?:\d+)))?\s*(?:\,\s*(?:(?:"price"\s*)\:\s*(?:\d+(?:\.\d{1,15})?)))?\s*\}\s*)*)\s*\]\s*\s*$/m
```
## Historical note
The package was initially made to parse a new version of the Swarm
protocol. The protocol gradually evolved since circa 2012, parsers
became more and more hairy with no obvious guarantees of correctness,
security or performance.
A formal grammar saved the day by describing the protocol in about
a dozen simple rules and producing a provably correct parser of
a reasonable performance.

@@ -49,3 +49,3 @@ /** The class generates regex _parsers for simple regular languages

const formula = m[0];
const marker = m[2] ? m[2] : (m[1] || '');
const marker = m[2] ? JSON.parse(m[2]) : (m[1] || '');
const rule = m[3] || 'EMPTY';

@@ -60,2 +60,3 @@ const quantifier = m[4] || '';

repeating,
empty: rule==='EMPTY'
};

@@ -101,3 +102,3 @@ ret.push(triplet);

} else {
ret = m + '(' + ret + ')';
ret = m + '\\s*(' + ret + ')';
}

@@ -137,2 +138,9 @@ } else {

});
joined = joined.replace(/\\s\*(\(+\?:|\(+|\)+)\\s\*/g, '\\s*$1');
joined = joined.replace(/\((\?\:)?\)/g, "");
joined = joined.replace(/(\\s\*)+/g, "\\s*");
// TODO test: no capture group for bodyless triplets
//console.log(rule_name, joined)
this._patterns[rule_name] = joined;

@@ -180,3 +188,3 @@

Grammar.TRIPLET_RE = /(\[.*?\]|"([^"]+)"|[^A-Za-z0-9\s])?([A-Z][A-Z0-9_]*)?([*+?|]|{\d+(?:,\d+)?})?/g;
Grammar.TRIPLET_RE = /(\[\S*?\]|("(?:\\.|[^"])*")|[^A-Za-z0-9\s])?([A-Z][A-Z0-9_]*)?([*+?|]|{\d+(?:,\d+)?})?/g;

@@ -183,0 +191,0 @@ function sterilize (pattern) {

const tape = require('tape');
const Grammar = require('..');
// TODO
// 5. local |

@@ -10,3 +8,3 @@ tape('grammar.02.A EMPTY', (tap) => {

WORD: /\w+/,
ANY: '{EMPTY WORD* }EMPTY',
ANY: '{ WORD* }',
});

@@ -30,6 +28,5 @@

tape('grammar.02.B COUNTS', (tap) => {
const THREE_WORD = new Grammar({
WORD: /\w+/,
SENTENCE: ' WORD{1,3} '
SENTENCE: ' WORD{1,3} ',
});

@@ -41,12 +38,14 @@

tap.notOk(THREE_WORD.is('you talk too much', 'SENTENCE'));
tap.notOk(THREE_WORD.is(' ', 'SENTENCE'));
const words = THREE_WORD.split(' exactly three\nwords', 'SENTENCE');
tap.deepEqual(words, [['exactly', 'three', 'words']]);
tap.end();
});
tape('grammar.02.C QUOTES', (tap) => {
const CONCRETE = new Grammar({
WORD: /\w+/,
SENTENCE: '"the"WORD+'
SENTENCE: '"the"WORD+',
});

@@ -60,7 +59,5 @@

tap.end();
});
tape('grammar.02.D OR', (tap) => {
const GEOMETRY = new Grammar({

@@ -72,3 +69,3 @@ COLOR: /WHITE|BLACK|RED/,

SIZE: '"BIG"| "SMALL"',
SENTENCE: 'COLOR| SIZE SHAPE'
SENTENCE: 'COLOR| SIZE SHAPE',
});

@@ -86,3 +83,22 @@

tap.end();
});
tape('grammar.02.E BABY ENGLISH', (tap) => {
// TODO word boundaries
const BABY_ENGLISH = new Grammar({
NOUN: /dog|cat|toy|mom/,
IS: /is|was|will be/,
ADJECTIVE: /funny|happy|good|bad/,
SENTENCE: 'NOUN IS ADJECTIVE',
});
tap.ok(BABY_ENGLISH.is('mom is good', 'SENTENCE'));
tap.ok(BABY_ENGLISH.is('cat is bad', 'SENTENCE'));
tap.ok(BABY_ENGLISH.is('dog was funny', 'SENTENCE'));
tap.notOk(BABY_ENGLISH.is('cat is dog', 'SENTENCE'));
tap.notOk(BABY_ENGLISH.is('was cat good', 'SENTENCE'));
tap.notOk(BABY_ENGLISH.is('dog is dog', 'SENTENCE'));
tap.end();
});
SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc