Product
Socket Now Supports uv.lock Files
Socket now supports uv.lock files to ensure consistent, secure dependency resolution for Python projects and enhance supply chain security.
Lexing vs. Parsing: lexers make only a single pass (no back-tracking); parsers have transition tables and do lookahead. The lexer can have state, but it should only make state transitions based on the current input, and not look at old input while processing new input.
More formally, lexers can parse Chomsky's Level 3 grammars, but you need a parser for more Level 2 grammars (you'll probably still want to use a lexer, but it won't be enough). Level 3 is the most basic of the levels, and a Level 3 language can be represented by a single regular expression, though that's probably not the most maintainable representation. Level 2 includes things like TeX's nested braces, e.g., \textit{You don't want to \textbf{stop at the X} and not realize that the Y should still be italicized}
, or PDF's nested strings, e.g., (This is a long string (with (in parentheses) a parenthetical))
. You'll use a lexer to split the raw input into a stream of tokens, but then you'll need a parser to resolve the stream of tokens into discrete representations, e.g., a tree in the TeX case:
var body = [{
style: null,
children: [{
style: 'italic',
children: [
'You don't want to ',
{
style: 'bold',
children: ['stop at the X'],
},
' and not realize that the Y should still be italicized'
]
}]
}]
Or just a string in the PDF string case:
"This is a long string (with (in parentheses) a parenthetical)"
The new lexing.Tokenizer(default_rules [, state_rules])
implementation provided in this module is the most basic lexer provided, representing state as a stack of strings. The lexing.Tokenizer
constructor takes an optional second argument: an object mapping state names to lists of rules that apply only in those states. These operate like exclusive conditional states in flex
, except there are no exceptions to the exclusivity, i.e., there is no <*>
condition specifier. The current state is the last (top) state in the state stack. The default_rules
rules apply only when the state stack is empty (the default).
The tokenizer has one main function, tokenizer.map(string_iterable)
, which returns a TokenIterable
. string_iterable
should implement the StringIterable
interface, i.e.:
interface StringIterable {
position: number;
size: number;
next(length: number): string;
peek(length: number): string;
skip(length: number): number;
}
The following readers defined in lexing
all return instances implementing the StringIterable
interface:
new lexing.StringIterable(str)
lexing.StringIterable.fromBuffer(buffer, encoding)
new lexing.FileStringIterator(file_descriptor)
There are other Buffer-based readers as well:
new lexing.BufferIterator(buffer)
lexing.BufferIterator.fromString(str, encoding)
new lexing.FileBufferIterator(file_descriptor)
The TokenIterable
instance returned by tokenizer.map(...)
has one method: next()
, which returns a non-null Token
.
Every Token
has a non-null name
field (a string) and a value
field (of any type; potentially null or undefined).
Each rule is a [RegExp, Function]
tuple. When a rule's regular expression matches the input, the following happens:
input_string.match(rule[0])
as match
.match[0]
).input_string.match(rule[0])
, with the tokenizer bound as this
inside the rule's function.If no rules in the current state match the current input, the tokenizer will throw an "Invalid language" error.
From the shell in your project's root directory:
npm installl lexing --save
In your code:
var lexing = require('lexing');
// It will use the first rule with a matching regex, so go from more specific
// to more catch-all. The ^ anchor before every regex is required!
var rules = [
[/^$/, function(match) {
return lexing.Token('EOF', null);
}],
[/^\s+/, function(match) {
return null; // ignore whitespace
}],
[/^[^!"#$%&'()*+,\-./:;<=>?@[\\\]\^_`{|}~\s]+/, function(match) {
return lexing.Token('WORD', match[0]);
}],
[/^./, function(match) {
return lexing.Token('PUNCTUATION', match[0]);
}],
];
var tokenizer = new lexing.Tokenizer(rules);
var input = new lexing.StringIterator("'It wasn't at all my fault', I cried.");
var output = tokenizer.map(input);
do {
var token = output.next();
console.log('token=%s => %j', token.name, token.value);
} while (token.name !== 'EOF');
You should have the Node.js type declarations in your project at type_declarations/DefinitelyTyped/node/node.d.ts
.
If you have an index.d.ts
reference collector at type_declarations/index.d.ts
, like I do, you can configure it so to pull in the lexing.d.ts
declarations:
// the usual imports:
/// <reference path="DefinitelyTyped/node/node.d.ts" />
/// <reference path="DefinitelyTyped/async/async.d.ts" />
/// <reference path="DefinitelyTyped/yargs/yargs.d.ts" />
// the self-declaring packages:
/// <reference path="../node_modules/lexing/lexing.d.ts" />
You'll need the TypeScript compiler and the node.d.ts
type declarations from the DefinitelyTyped
repo. The following command will install / download those, compile index.ts
to index.js
and build lexing.d.ts
from index.ts
, as needed:
make
TypeScript doesn't make exporting an importable set of type declarations easy. adts was my first foray into hacking a tsc --declaration
build-step together, but this module takes a different approach. There are a couple requirements for making this work:
////
markers in the TypeScript source to designate module boundaries when generating the declarations. TypeScript uses ///
to configure the compiler, so I use ////
to confuse their compiler into giving me what I want. The declaration build step in the Makefile strips these markers.lexing
installs things with npm
as usual, and has Node.js type declarations at type_declarations/DefinitelyTyped/node/node.d.ts
. If lexing
includes its own Node.js type declarations, the TypeScript compiler will complain about all kinds of duplicates / conflicts. (This is even if the two Node.js type declarations are identical -- they must be at the same filepath for TypeScript to recognize that they are the same.)Copyright 2015 Christopher Brown. MIT Licensed.
FAQs
Regex-based lexer
We found that lexing demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket now supports uv.lock files to ensure consistent, secure dependency resolution for Python projects and enhance supply chain security.
Research
Security News
Socket researchers have discovered multiple malicious npm packages targeting Solana private keys, abusing Gmail to exfiltrate the data and drain Solana wallets.
Security News
PEP 770 proposes adding SBOM support to Python packages to improve transparency and catch hidden non-Python dependencies that security tools often miss.