Research
Security News
Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Lexing vs. Parsing: lexers make only a single pass (no back-tracking); parsers have transition tables and do lookahead. The lexer can have state, but it should only make state transitions based on the current input, and not look at old input while processing new input.
More formally, lexers can parse Chomsky's Level 3 grammars, but you need a parser for more Level 2 grammars (you'll probably still want to use a lexer, but it won't be enough). Level 3 is the most basic of the levels, and a Level 3 language can be represented by a single regular expression, though that's probably not the most maintainable representation. Level 2 includes things like TeX's nested braces, e.g., \textit{You don't want to \textbf{stop at the X} and not realize that the Y should still be italicized}
, or PDF's nested strings, e.g., (This is a long string (with (in parentheses) a parenthetical))
. You'll use a lexer to split the raw input into a stream of tokens, but then you'll need a parser to resolve the stream of tokens into discrete representations, e.g., a tree in the TeX case:
var body = [{
style: null,
children: [{
style: 'italic',
children: [
'You don't want to ',
{
style: 'bold',
children: ['stop at the X'],
},
' and not realize that the Y should still be italicized'
]
}]
}]
Or just a string in the PDF string case:
"This is a long string (with (in parentheses) a parenthetical)"
The new lexing.Tokenizer(default_rules [, state_rules])
implementation provided in this module is the most basic lexer provided, representing state as a stack of strings. The lexing.Tokenizer
constructor takes an optional second argument: an object mapping state names to lists of rules that apply only in those states. These operate like exclusive conditional states in flex
, except there are no exceptions to the exclusivity, i.e., there is no <*>
condition specifier. The current state is the last (top) state in the state stack. The default_rules
rules apply only when the state stack is empty (the default).
The tokenizer has one main function, tokenizer.map(string_iterable)
, which returns a TokenIterable
. string_iterable
should implement the StringIterable
interface, i.e.:
interface StringIterable {
position: number;
size: number;
next(length: number): string;
peek(length: number): string;
skip(length: number): number;
}
The following readers defined in lexing
all return instances implementing the StringIterable
interface:
new lexing.StringIterable(str)
lexing.StringIterable.fromBuffer(buffer, encoding)
new lexing.FileStringIterator(file_descriptor)
There are other Buffer-based readers as well:
new lexing.BufferIterator(buffer)
lexing.BufferIterator.fromString(str, encoding)
new lexing.FileBufferIterator(file_descriptor)
The TokenIterable
instance returned by tokenizer.map(...)
has one method: next()
, which returns a non-null Token
.
Every Token
has a non-null name
field (a string) and a value
field (of any type; potentially null or undefined).
Each rule is a [RegExp, Function]
tuple. When a rule's regular expression matches the input, the following happens:
input_string.match(rule[0])
as match
.match[0]
).input_string.match(rule[0])
, with the tokenizer bound as this
inside the rule's function.If no rules in the current state match the current input, the tokenizer will throw an "Invalid language" error.
From the shell in your project's root directory:
npm install lexing --save
In your code:
var lexing = require('lexing');
// It will use the first rule with a matching regex, so go from more specific
// to more catch-all. The ^ anchor before every regex is required!
var rules = [
[/^$/, function(match) {
return lexing.Token('EOF', null);
}],
[/^\s+/, function(match) {
return null; // ignore whitespace
}],
[/^[^!"#$%&'()*+,\-./:;<=>?@[\\\]\^_`{|}~\s]+/, function(match) {
return lexing.Token('WORD', match[0]);
}],
[/^./, function(match) {
return lexing.Token('PUNCTUATION', match[0]);
}],
];
var tokenizer = new lexing.Tokenizer(rules);
var input = new lexing.StringIterator("'It wasn't at all my fault', I cried.");
var output = tokenizer.map(input);
do {
var token = output.next();
console.log('token=%s => %j', token.name, token.value);
} while (token.name !== 'EOF');
You should have the Node.js type declarations in your project at type_declarations/DefinitelyTyped/node/node.d.ts
.
If you have an index.d.ts
reference collector at type_declarations/index.d.ts
, like I do, you can configure it so to pull in the lexing.d.ts
declarations:
// the usual imports:
/// <reference path="DefinitelyTyped/node/node.d.ts" />
/// <reference path="DefinitelyTyped/async/async.d.ts" />
/// <reference path="DefinitelyTyped/yargs/yargs.d.ts" />
// the self-declaring packages:
/// <reference path="../node_modules/lexing/lexing.d.ts" />
You'll need the TypeScript compiler and the node.d.ts
type declarations from the DefinitelyTyped
repo. The following command will install / download those, compile index.ts
to index.js
and build lexing.d.ts
from index.ts
, as needed:
make
TypeScript doesn't make exporting an importable set of type declarations easy. adts was my first foray into hacking a tsc --declaration
build-step together, but this module takes a different approach. There are a couple requirements for making this work:
////
markers in the TypeScript source to designate module boundaries when generating the declarations. TypeScript uses ///
to configure the compiler, so I use ////
to confuse their compiler into giving me what I want. The declaration build step in the Makefile strips these markers.lexing
installs things with npm
as usual, and has Node.js type declarations at type_declarations/DefinitelyTyped/node/node.d.ts
. If lexing
includes its own Node.js type declarations, the TypeScript compiler will complain about all kinds of duplicates / conflicts. (This is even if the two Node.js type declarations are identical -- they must be at the same filepath for TypeScript to recognize that they are the same.)Copyright 2015 Christopher Brown. MIT Licensed.
FAQs
Regex-based lexer
The npm package lexing receives a total of 722 weekly downloads. As such, lexing popularity was classified as not popular.
We found that lexing demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.