Security News
JSR Working Group Kicks Off with Ambitious Roadmap and Plans for Open Governance
At its inaugural meeting, the JSR Working Group outlined plans for an open governance model and a roadmap to enhance JavaScript package management.
The moo npm package is a tokenizer for compilers and other tools that operate on code. It takes a string of text and breaks it down into tokens that can be easily parsed or analyzed. This is useful for creating programming languages, interpreters, compilers, and other syntax-aware utilities.
Tokenization
This code sample demonstrates how to define a simple lexer with the moo package. It includes rules for whitespace, comments, numbers, strings, parentheses, keywords, and newlines. The lexer is then used to tokenize a sample string.
{"lexer": "const moo = require('moo');\nlet lexer = moo.compile({\n WS: /[ \t]+/,\n comment: /\/\/.*?$/,\n number: /\d+/,\n string: /\"(\\\\.|[^\\\\\"])*\"/,\n lparen: '('\n rparen: ')',\n keyword: ['while', 'if', 'else', 'moo'],\n NL: { match: /\n/, lineBreaks: true },\n});\nlexer.reset('if (42) moo');\nconsole.log(lexer.next()); // -> { type: 'keyword', value: 'if' }"}
Custom Error Handling
This code sample shows how to handle errors in tokenization with moo. An 'error' token type is defined using moo.error, which is used to throw an exception when the lexer encounters an invalid syntax.
{"lexerWithErrorHandling": "const moo = require('moo');\nlet lexer = moo.compile({\n WS: /[ \t]+/,\n comment: /\/\/.*?$/,\n number: /\d+/,\n string: /\"(\\\\.|[^\\\\\"])*\"/,\n error: moo.error,\n});\nlexer.reset('invalid input');\ntry {\n while (true) {\n let token = lexer.next();\n if (!token) break;\n if (token.type === 'error') throw Error('Invalid syntax');\n }\n} catch (e) {\n console.error(e.message);\n}"}
Chevrotain is a JavaScript parser building toolkit which is similar to moo but more powerful and complex. It provides not only tokenization but also parsing capabilities, allowing users to define both the lexer and the parser for their language. Chevrotain is typically used when you need a full parser rather than just a tokenizer.
Nearley is a simple, fast, and powerful parsing toolkit for JavaScript. It is similar to moo in that it can be used to tokenize input strings, but it also includes a parser that is generated from a grammar specification. Nearley is often used for more complex parsing tasks where a grammar-based approach is beneficial.
PEG.js is a parser generator for JavaScript based on the parsing expression grammar formalism. It generates parsers with excellent performance and error reporting from a grammar. While moo focuses on tokenization, PEG.js provides a full parsing solution with a focus on creating parsers from a domain-specific language (DSL).
Moo is a highly-optimised tokenizer/lexer generator. Use it to tokenize your strings, before parsing 'em with a parser like nearley or whatever else you're into.
Yup! Flying-cows-and-singed-steak fast.
Moo is the fastest JS tokenizer around. It's ~2–10x faster than most other tokenizers; it's a couple orders of magnitude faster than some of the slower ones.
Define your tokens using regular expressions. Moo will compile 'em down to a single RegExp for performance. It uses the new ES6 sticky flag where possible to make things faster; otherwise it falls back to an almost-as-efficient workaround. (For more than you ever wanted to know about this, read adventures in the land of substrings and RegExps.)
You might be able to go faster still by writing your lexer by hand rather than using RegExps, but that's icky.
Oh, and it avoids parsing RegExps by itself. Because that would be horrible.
First, you need to do the needful: $ npm install moo
, $ yarn install moo
, or whatever will ship this code to your computer. Alternatively, grab the moo.js
file by itself and slap it into your web page via a <script>
tag; it's completely standalone.
Then you can start roasting your very own lexer/tokenizer:
const moo = require('moo')
let lexer = moo.compile({
WS: /[ \t]+/,
comment: /\/\/.*?$/,
number: /(0|[1-9][0-9]*)/,
string: /"((?:\\["\\]|[^\n"\\])*)"/,
lparen: '(',
rparen: ')',
keyword: ['while', 'if', 'else', 'moo', 'cows'],
NL: { match: /\n/, lineBreaks: true },
})
And now throw some text at it:
lexer.reset('while (10) cows\nmoo')
lexer.next() // -> { type: 'keyword', value: 'while' }
lexer.next() // -> { type: 'WS', value: ' ' }
lexer.next() // -> { type: 'lparen', value: '(' }
lexer.next() // -> { type: 'number', value: '10' }
// ...
You can also feed it chunks of input at a time.
lexer.reset()
lexer.feed('while')
lexer.feed(' 10 cows\n')
lexer.next() // -> { type: 'keyword', value: 'while' }
// ...
If you've reached the end of Moo's internal buffer, next() will return undefined
. You can always feed() it more if that happens.
RegExps are nifty for making tokenizers, but they can be a bit of a pain. Here are some things to be aware of:
You often want to use non-greedy quantifiers: e.g. *?
instead of *
. Otherwise your tokens will be longer than you expect:
let lexer = moo.compile({
string: /"(.*)"/, // greedy quantifier *
// ...
})
lexer.reset('"foo" "bar"')
lexer.next() // -> { type: 'string', value: 'foo" "bar' }
Better:
let lexer = moo.compile({
string: /"(.*?)"/, // non-greedy quantifier *?
// ...
})
lexer.reset('"foo" "bar"')
lexer.next() // -> { type: 'string', value: 'foo' }
lexer.next() // -> { type: 'space', value: ' ' }
lexer.next() // -> { type: 'string', value: 'bar' }
The order of your rules matters. Earlier ones will take precedence.
moo.compile({
word: /[a-z]+/,
foo: 'foo',
}).reset('foo').next() // -> { type: 'word', value: 'foo' }
moo.compile({
foo: 'foo',
word: /[a-z]+/,
}).reset('foo').next() // -> { type: 'foo', value: 'foo' }
Moo uses multiline RegExps. This has a few quirks: for example, the dot /./
doesn't include newlines. Use [^]
instead if you want to match newlines too.
Since excluding capture groups like /[^ ]/
(no spaces) will include newlines, you have to be careful not to include them by accident! In particular, the whitespace metacharacter \s
includes newlines.
Moo tracks detailed information about the input for you.
It will track line numbers, as long as you apply the lineBreaks: true
option to any tokens which might contain newlines. Moo will try to warn you if you forget to do this.
Token objects (returned from next()
) have the following attributes:
type
: the name of the group, as passed to compile.value
: the contents of the capturing group (or the whole match, if the token RegExp doesn't define a capture).size
: the total length of the match (value
may be shorter if you have capturing groups).offset
: the number of bytes from the start of the buffer where the match starts.lineBreaks
: the number of line breaks found in the match. (Always zero if this rule has lineBreaks: false
.)line
: the line number of the beginning of the match, starting from 1.col
: the column where the match begins, starting from 1.Calling reset()
on your lexer will empty its internal buffer, and set the line, column, and offset counts back to their initial value.
If you don't want this, you can save()
the state, and later pass it as the second argument to reset()
to explicitly control the internal state of the lexer.
let state = lexer.save() // -> { line: 10 }
lexer.feed('some line\n')
lexer.next() // -> { line: 10 }
lexer.next() // -> { line: 11 }
// ...
lexer.reset('a different line\n', state)
lexer.next() // -> { line: 10 }
Moo makes it convenient to define literals and keywords.
moo.compile({
['lparen', '('],
['rparen', ')'],
['keyword', ['while', 'if', 'else', 'moo', 'cows']],
})
It'll automatically compile them into regular expressions, escaping them where necessary.
Important! Always write your literals like this:
['while', 'if', 'else', 'moo', 'cows']
And not like this:
/while|if|else|moo|cows/
The reason: Moo special-cases keywords to ensure the longest match principle applies, even in edge cases.
Imagine trying to parse the input className
with the following rules:
['keyword', ['class']],
['identifier', /[a-zA-Z]+/],
You'll get two tokens — ['class', 'Name']
-- which is not what you want! If you swap the order of the rules, you'll fix this example; but now you'll lex class
wrong (as an identifier
).
Moo solves this by checking to see if any of your literals can be matched by one of your other rules; if so, it doesn't lex the keyword separately, but instead handles it at a later stage (by checking identifiers against a list of keywords).
Sometimes you want your lexer to support different states. This is useful for string interpolation, for example: to tokenize a${{c: d}}e
, you might use:
let lexer = moo.states({
main: {
strstart: {match: '`', push: 'lit'},
ident: /\w+/,
lbrace: {match: '{', push: 'main'},
rbrace: {match: '}', pop: 1},
colon: ':',
space: {match: /\s+/, lineBreaks: true},
},
lit: {
interp: {match: '${', push: 'main'},
escape: /\\./,
strend: {match: '`', pop: 1},
const: {match: /(?:[^$`]|\$(?!\{))+/, lineBreaks: true},
},
})
// <= `a${{c: d}}e`
// => strstart const interp lbrace ident colon space ident rbrace rbrace const strend
It's also nice to let states inherit rules from other states and be able to count things, e.g. the interpolated expression state needs a }
rule that can tell if it's a closing brace or the end of the interpolation, but is otherwise identical to the normal expression state.
To support this, Moo allows annotating tokens with push
, pop
and next
:
push
moves the lexer to a new state, and pushes the old state onto the stack.pop
returns to a previous state, by removing one or more states from the stack.next
moves to a new state, but does not affect the stack.If no token matches, Moo will throw an Error.
If you'd rather treat errors as just another kind of token, you can ask Moo to do so.
moo.compile({
// ...
myError: moo.error,
})
moo.reset('invalid')
moo.next() // -> { type: 'myError', value: 'invalid' }
You can have a token type that both matches tokens and contains error values.
moo.compile({
// ...
myError: {match: /[\$?`]/, error: true},
})
Iterators: we got 'em.
for (let here of lexer) {
// here = { type: 'number', value: '123', ... }
}
Use itt's iteration tools with Moo.
for (let [here, next] = itt(lexer).lookahead()) { // pass a number if you need more tokens
// enjoy!
}
Before submitting an issue, remember...
FAQs
Optimised tokenizer/lexer generator! 🐄 Much performance. Moo!
The npm package moo receives a total of 7,158,657 weekly downloads. As such, moo popularity was classified as popular.
We found that moo demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
At its inaugural meeting, the JSR Working Group outlined plans for an open governance model and a roadmap to enhance JavaScript package management.
Security News
Research
An advanced npm supply chain attack is leveraging Ethereum smart contracts for decentralized, persistent malware control, evading traditional defenses.
Security News
Research
Attackers are impersonating Sindre Sorhus on npm with a fake 'chalk-node' package containing a malicious backdoor to compromise developers' projects.