
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
lexr is meant to be a lightweight tokenizer built in Javascript to be more modern and clean than the C ancestor.
lexr is compartmentalized to be able to work on its own however is aimed to be used hand in hand with grammr once the project is finished.
lexr and grammr are an effort to re-think how the traditional process of flex and bison work together and move development to a more modern process.
Both of these projects are developed in order to be implemented in the re-work of ivi a project which aims to visualize code for the purpose of teaching intro programming.
The current Lexical Analyzer has built-in support for Javascript with a plan on extending to other languages.
If you do not see your language supported or would like to simply use custom tokens it is possible to do so as well.
What is currently supported is
The entire library wraps around a Tokenizer class.
First import the library
let lexr = require('lexr');
In order to use built-in languages, initialize the tokenizer as so:
let tokenizer = new lexr.Tokenizer("Javascript");
If you would like to use fully custom tokens then simply initialize as so:
let tokenizer = new lexr.Tokenizer("");
If you have selected a built-in language you will not be able to add or remove tokens until you disable strict mode for tokens.
To do so call the disableStrict() function on the tokenizer instance.
Once you have done so or if you are working on a fully custom tokenizer you can add tokens 2 ways:
// Add a single token
// Arguments: tokenName, RegExp pattern
tokenizer.addToken("L_PAREN", /\(/);
// Add multiple tokens
// Must be in the form of a Javascript Object
const tokens = {
L_BRACE : /{/,
R_BRACE : new RegExp('}'),
};
tokenizer.addTokenSet(tokens);
You can also remove pre-existing tokens if you are using a custom language or have disabled strict mode.
tokenizer.removeToken("L_PAREN");
Tokenizer has ability to ignore whitespace and newlines by calling the methods on the instance.
Example:
let x = lexr.Tokenizer("Json");
x.ignoreWhiteSpace();
x.ignoreNewLine();
If you would like to add functions when tokens are recognized you can add them through a set or through individual addition by calling the proper addFunction function.
// Add functions through set
let funs = {
WHITESPACE : function() { whitespaceCount += 1 },
IDENTIFIER : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);
// Add single function
tokenizer.addFunction("NEW_LINE", function() { lineCount += 1 });
// Remove function
tokenizer.removeFunction("IDENTIFIER");
lexr slightly separates itself from flex in how functions are handled.
Your functions should not use any of the information taken from the current token since you have access to that information post tokenization.
This keeps the functions that are being executed smaller and cleaner.
An example of code that would go with the functions is as follows
let funs = {
WHITESPACE : function() { whitespaceCount += 1 },
NEW_LINE : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);
let input = `var a = 4;
var b = 3;`;
let whitespaceCount = 0;
let idNum = 0;
tokenizer.tokenize(input);
Since functions are contained within an object in the tokenizer, scoping can get a bit iffy.
You can use the example above however, a suggested usage is to make a functions.js in order to:
Instead of using yytext within your functions the suggested usage is to analyze post tokenization.
An example of grabbing all identifier names and inserting them into let's say a symbol table would look like:
let input = `var a = 4;
var b = 3;`;
let output = tokenizer.tokenize(input);
let symbolTable = {};
for (let i = 0; i < output.length; i++) {
if (output[i].token === "IDENTIFIER") {
symbolTable[output[i].value] = undefined;
}
}
By default the error token name when detecting an uncaught token will be ERROR however, if you would like to change the name you can do so by calling setErrTok as so:
tokenizer.setErrTok("DIFF_ERROR");
You can also ignore certain tokens from appearing in the output by either calling addIgnore
tokenizer.addIgnore("WHITESPACE");
Or by adding an entire set through an array or an object
let ignore = ["WHITESPACE", "VAR"];
tokenizer.addIgnoreSet(ignore);
// Or through an object which allows true or false
let ignore2 = {
"WHITESPACE" : true,
"VAR" : false,
};
tokenizer.addIgnoreSet(ignore2);
If you would like to unIgnore tokens programatically just call the unIgnore method
tokenizer.unIgnore("WHITESPACE");
There are options to make your output more verbose by adding a customOut field to the output object.
Similarly to how other operations work you can either add a set of tokens or a single token as well as remove them.
// Add a set of custom outputs
let customOut = {
"WHITESPACE" : 2,
"VAR" : 'declaration',
}
tokenizer.addCustomOutSet(customOut);
// Add a single custom output
tokenizer.addCustomOut("SEMI_COLON", 111);
// Remove a custom out
tokenizer.removeCustomOut("VAR");
A sample output object would then look like:
{ token: 'WHITESPACE', value: ' ', customOut: 2 }
Lastly in order to tokenize your input code simply call the tokenizer's tokenize method.
let output = tokenizer.tokenize(aString);
In its current form the output from tokenize(aString) will be in the form of a list of Objects each having 2 properties.
token being the token captured, and value being what determined the token.
let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("Javascript");
let input = "var a = null;";
let output = tokenizer.tokenize(input);
console.log(output);
Output would then be
[ { token: 'VAR', value: 'var' },
{ token: 'WHITESPACE', value: ' ' },
{ token: 'IDENTIFIER', value: 'a' },
{ token: 'WHITESPACE', value: ' ' },
{ token: 'ASSIGN', value: '=' },
{ token: 'WHITESPACE', value: ' ' },
{ token: 'NULL_LIT', value: 'null' },
{ token: 'SEMI_COLON', value: ';' } ]
let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("");
tokenizer.addToken("PLUS", /\+/);
tokenizer.setErrTok("DIFF_ERROR");
let input = "5+5;";
let output = tokenizer.tokenize(input);
console.log(output);
Output would then be
[ { token: 'DIFF_ERROR', value: '5' },
{ token: 'PLUS', value: '+' },
{ token: 'DIFF_ERROR', value: '5;' } ]
How I suggest development if you are not using built-in languages is to separate each part of the tokenization into its own file.
If you are using a complex language where the regexes can become very large, separate the building up of those regexes to another file and only export the final regex to your token object.
Since the Tokenizer can take in sets of information it is easiest to separate everything and use exports between files.
+-- src/
| +-- index.js
| +-- functions/
| +-- functions.js
| +-- tokens/
| +-- tokens.js
| +-- regexPatterns.js
| +-- ignore/
| +-- ignoreTokens.js
| +-- customOut/
| +-- customOutput.js
If there are any good freatures missing feel free to open an issue for a feature request.
FAQs
Lexical analyzer built in Javascript
We found that lexr demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.