regular-grammar
Advanced tools
Comparing version 0.1.0 to 0.1.1
{ | ||
"name": "regular-grammar", | ||
"version": "0.1.0", | ||
"version": "0.1.1", | ||
"description": "A simple regular grammar/parser generator", | ||
@@ -16,3 +16,3 @@ "author": "Victor Grishchenko <victor.grishchenko@gmail.com>", | ||
"scripts": { | ||
"test": "tape test/01_grammar.js | tap-diff" | ||
"test": "tape test/0?_*.js | tap-diff" | ||
}, | ||
@@ -19,0 +19,0 @@ "keywords": [ |
# Simple regular grammar parser generator | ||
The Swarm protocol evolved since 2012 in more or less the same direction, gradually becoming richer and richer. | ||
Things became a bit hairy last November (2016) once object metadata structure was unified and made part of the protocol. | ||
This weekend (11 Feb 2017) I finally realized how to write production rules in a concise way (see index.js). | ||
As a result, 20 rather simple rules described the entire syntax. | ||
This package creates regex parsers for simple regular grammars. | ||
That is mostly valuable to implement various text-based | ||
protocols, serialization formats and smallish languages. | ||
By the Chomsky–Schützenberger hierarchy, Swarm is a regular language. | ||
Hence, it is parsed by regular expressions. | ||
We only support *regular* grammars, so for example, | ||
arbitrarily-deep nesting and recursion are not possible. | ||
But, by the Chomsky–Schützenberger hierarchy, | ||
the next degree of sophistication is a context-free language, which is | ||
probably too much for our use case. | ||
After all, our goal is simplicity and predictability, not Turing completeness. | ||
This package exports the Grammar class that converts simple grammar | ||
definitions into regex parsers. | ||
In particular, you may create parsers for reasonably functional subsets of SMTP, HTTP, or CSS. | ||
You can't create a parser for general-case HTML, XML, or JSON. Those languages have arbitrarily deep nesting. | ||
Still, you can create a parser for a subset of such a language (see the JSON example). | ||
Advantages of the approach are: | ||
* simple languages, | ||
* uniform treating of whitespace and punctuation, | ||
* correct formal parsers created in 5 minutes. | ||
## Examples | ||
(see `test/`) | ||
```js | ||
const THREE_WORD = new Grammar({ | ||
WORD: /\w+/, | ||
SENTENCE: ' WORD{1,3} ', | ||
}); | ||
tap.ok(THREE_WORD.is('what a grammar', 'SENTENCE')); | ||
tap.ok(THREE_WORD.is('attention', 'SENTENCE')); | ||
tap.notOk(THREE_WORD.is('you talk too much', 'SENTENCE')); | ||
tap.notOk(THREE_WORD.is(' ', 'SENTENCE')); | ||
const words = THREE_WORD.split(' exactly three\nwords', 'SENTENCE'); | ||
tap.deepEqual(words, [['exactly', 'three', 'words']]); | ||
``` | ||
## Why not JSON? | ||
JSON is an expensive to parse language that gives no guarantees | ||
about the structure of the resulting objects. It is lax. | ||
A parsed JSON object will have to be parsed anew | ||
to derive *your* data structures. | ||
In fact, it is possible to define restricted JSONey languages | ||
using regular grammars. | ||
```js | ||
const INVENTORY_JSON = new Grammar({ | ||
FLOAT: /\d{1,16}(\.\d{1,15})?/, | ||
INT: /\d{1,16}/, | ||
STRING: /"(\\.|[^"])*"/, | ||
IDKV: '"\\"id\\"" :STRING', | ||
NAMEKV:'"\\"name\\"" :STRING', | ||
QUANTKV:'"\\"quantity\\"" :INT', | ||
PRICEKV:'"\\"price\\"" :FLOAT', | ||
ENTRY: '{ IDKV ,NAMEKV? ,QUANTKV? ,PRICEKV? }', | ||
LIST: '[ ENTRY ,ENTRY* ]' | ||
}); | ||
// This particular language can be parsed by a JSON parser, | ||
// but it is way more strict than JSON. | ||
// Field names, their order and value types are all fixed. | ||
tap.ok( INVENTORY_JSON.is( '{"id":"A123"}', 'ENTRY' ) ); | ||
tap.ok(INVENTORY_JSON.is(''+Math.PI, 'FLOAT')); | ||
const bears = '{"id":"A345", "name":"teddy bear", "price": 5.45}'; | ||
tap.ok( INVENTORY_JSON.is(bears, 'ENTRY') ); | ||
tap.notOk(INVENTORY_JSON.is('{"id":123}', 'ENTRY')); | ||
``` | ||
For that JSONey grammar, the parser looks like: | ||
``` | ||
/^\s*\[\s*(\{\s*(?:(?:"id"\s*)\:\s*(?:"(?:\\.|[^"])*"))\s*(?:\,\s*(?:(?:"name"\s*)\:\s*(?:"(?:\\.|[^"])*")))?\s*(?:\,\s*(?:(?:"quantity"\s*)\:\s*(?:\d+)))?\s*(?:\,\s*(?:(?:"price"\s*)\:\s*(?:\d+(?:\.\d{1,15})?)))?\s*\}\s*)((?:\,\s*\{\s*(?:(?:"id"\s*)\:\s*(?:"(?:\\.|[^"])*"))\s*(?:\,\s*(?:(?:"name"\s*)\:\s*(?:"(?:\\.|[^"])*")))?\s*(?:\,\s*(?:(?:"quantity"\s*)\:\s*(?:\d+)))?\s*(?:\,\s*(?:(?:"price"\s*)\:\s*(?:\d+(?:\.\d{1,15})?)))?\s*\}\s*)*)\s*\]\s*\s*$/m | ||
``` | ||
## Historical note | ||
The package was initially made to parse a new version of the Swarm | ||
protocol. The protocol gradually evolved since circa 2012, parsers | ||
became more and more hairy with no obvious guarantees of correctness, | ||
security or performance. | ||
A formal grammar saved the day by describing the protocol in about | ||
a dozen simple rules and producing a provably correct parser of | ||
a reasonable performance. |
@@ -49,3 +49,3 @@ /** The class generates regex _parsers for simple regular languages | ||
const formula = m[0]; | ||
const marker = m[2] ? m[2] : (m[1] || ''); | ||
const marker = m[2] ? JSON.parse(m[2]) : (m[1] || ''); | ||
const rule = m[3] || 'EMPTY'; | ||
@@ -60,2 +60,3 @@ const quantifier = m[4] || ''; | ||
repeating, | ||
empty: rule==='EMPTY' | ||
}; | ||
@@ -101,3 +102,3 @@ ret.push(triplet); | ||
} else { | ||
ret = m + '(' + ret + ')'; | ||
ret = m + '\\s*(' + ret + ')'; | ||
} | ||
@@ -137,2 +138,9 @@ } else { | ||
}); | ||
joined = joined.replace(/\\s\*(\(+\?:|\(+|\)+)\\s\*/g, '\\s*$1'); | ||
joined = joined.replace(/\((\?\:)?\)/g, ""); | ||
joined = joined.replace(/(\\s\*)+/g, "\\s*"); | ||
// TODO test: no capture group for bodyless triplets | ||
//console.log(rule_name, joined) | ||
this._patterns[rule_name] = joined; | ||
@@ -180,3 +188,3 @@ | ||
Grammar.TRIPLET_RE = /(\[.*?\]|"([^"]+)"|[^A-Za-z0-9\s])?([A-Z][A-Z0-9_]*)?([*+?|]|{\d+(?:,\d+)?})?/g; | ||
Grammar.TRIPLET_RE = /(\[\S*?\]|("(?:\\.|[^"])*")|[^A-Za-z0-9\s])?([A-Z][A-Z0-9_]*)?([*+?|]|{\d+(?:,\d+)?})?/g; | ||
@@ -183,0 +191,0 @@ function sterilize (pattern) { |
const tape = require('tape'); | ||
const Grammar = require('..'); | ||
// TODO | ||
// 5. local | | ||
@@ -10,3 +8,3 @@ tape('grammar.02.A EMPTY', (tap) => { | ||
WORD: /\w+/, | ||
ANY: '{EMPTY WORD* }EMPTY', | ||
ANY: '{ WORD* }', | ||
}); | ||
@@ -30,6 +28,5 @@ | ||
tape('grammar.02.B COUNTS', (tap) => { | ||
const THREE_WORD = new Grammar({ | ||
WORD: /\w+/, | ||
SENTENCE: ' WORD{1,3} ' | ||
SENTENCE: ' WORD{1,3} ', | ||
}); | ||
@@ -41,12 +38,14 @@ | ||
tap.notOk(THREE_WORD.is('you talk too much', 'SENTENCE')); | ||
tap.notOk(THREE_WORD.is(' ', 'SENTENCE')); | ||
const words = THREE_WORD.split(' exactly three\nwords', 'SENTENCE'); | ||
tap.deepEqual(words, [['exactly', 'three', 'words']]); | ||
tap.end(); | ||
}); | ||
tape('grammar.02.C QUOTES', (tap) => { | ||
const CONCRETE = new Grammar({ | ||
WORD: /\w+/, | ||
SENTENCE: '"the"WORD+' | ||
SENTENCE: '"the"WORD+', | ||
}); | ||
@@ -60,7 +59,5 @@ | ||
tap.end(); | ||
}); | ||
tape('grammar.02.D OR', (tap) => { | ||
const GEOMETRY = new Grammar({ | ||
@@ -72,3 +69,3 @@ COLOR: /WHITE|BLACK|RED/, | ||
SIZE: '"BIG"| "SMALL"', | ||
SENTENCE: 'COLOR| SIZE SHAPE' | ||
SENTENCE: 'COLOR| SIZE SHAPE', | ||
}); | ||
@@ -86,3 +83,22 @@ | ||
tap.end(); | ||
}); | ||
tape('grammar.02.E BABY ENGLISH', (tap) => { | ||
// TODO word boundaries | ||
const BABY_ENGLISH = new Grammar({ | ||
NOUN: /dog|cat|toy|mom/, | ||
IS: /is|was|will be/, | ||
ADJECTIVE: /funny|happy|good|bad/, | ||
SENTENCE: 'NOUN IS ADJECTIVE', | ||
}); | ||
tap.ok(BABY_ENGLISH.is('mom is good', 'SENTENCE')); | ||
tap.ok(BABY_ENGLISH.is('cat is bad', 'SENTENCE')); | ||
tap.ok(BABY_ENGLISH.is('dog was funny', 'SENTENCE')); | ||
tap.notOk(BABY_ENGLISH.is('cat is dog', 'SENTENCE')); | ||
tap.notOk(BABY_ENGLISH.is('was cat good', 'SENTENCE')); | ||
tap.notOk(BABY_ENGLISH.is('dog is dog', 'SENTENCE')); | ||
tap.end(); | ||
}); |
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
51115
20
436
90