![Maven Central Adds Sigstore Signature Validation](https://cdn.sanity.io/images/cgdhsj6q/production/7da3bc8a946cfb5df15d7fcf49767faedc72b483-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Maven Central Adds Sigstore Signature Validation
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
@adguard/css-tokenizer
Advanced tools
This library provides two distinct CSS tokenizers:
:contains()
and :xpath()
.Table of contents:
You can install the library using
yarn add @adguard/css-tokenizer
npm install @adguard/css-tokenizer
pnpm add @adguard/css-tokenizer
To appreciate the necessity for a custom tokenizer, it's essential to understand the concept of Extended CSS, recognize the challenges it poses, and discover how we can effectively address these issues.
Extended CSS is a superset of CSS used by adblockers to provide more robust filtering capabilities. In practical terms, Extended CSS introduces additional pseudo-classes that are not defined in the CSS specification. For more information, please refer to the following resources:
The standard CSS tokenizer cannot handle Extended CSS's pseudo-classes in every case. For example, the :contains()
pseudo-class can have the following syntax:
div:contains(i'm a parameter)
A standard CSS tokenizer interprets the single quotation mark ('
) as a string delimiter, causing an error due to the
lack of a closing )
character. This deviation from the expected syntax results in a parsing issue.
The :xpath()
pseudo-class poses a similar challenge for a standard CSS tokenizer, as it can have syntax like this:
div:xpath(//*...)
A standard tokenizer mistakenly identifies the /*
sequence as the start of a comment, leading to incorrect parsing,
however, the /*
sequence is the part of the XPath expression.
We've designed the standard CSS tokenizer to rigorously adhere to the CSS Syntax Level 3 specification. However, we've
also introduced the ability to handle certain pseudo-classes in a custom manner, akin to how the <url-token>
is
managed in the CSS specs. When the tokenizer encounters a function token (pattern: function-name(
), it searches for a
handler function in the functionHandlers
map based on the function name and calls the custom handler if it exists.
The custom handler receives a single argument: the shared tokenizer context object, which can be used to manage the function, similar to how other tokens are handled in the library.
This approach allows us to maintain a native, specification-compliant CSS tokenizer with minimal overhead while also providing the flexibility to manage special pseudo-classes in a custom way.
In essence, the Extended CSS tokenizer is a standard CSS tokenizer with custom function handlers for special pseudo-classes.
It's crucial to emphasize that our implementation remains committed to the token types specified in the CSS W3C standards. We do not introduce new token types, ensuring that our tokenizer stays in harmony with the official CSS Syntax Level 3 specification. This dedication to adhering to industry standards and best practices guarantees that our library maintains compatibility and consistency with CSS-related tools and workflows.
By preserving the standard CSS token types, we aim to provide users with a reliable and seamless experience while working with CSS, upholding the integrity of the language as defined by the W3C.
Here's a straightforward example of how to use the library:
// `tokenize` is a regular CSS tokenizer (and doesn't support Extended CSS)
// `tokenizeExtended` is an Extended CSS tokenizer
const { tokenize, tokenizeExtended, getFormattedTokenName } = require('@adguard/css-tokenizer');
// Input to tokenize
const CSS_SOURCE = `div:contains(aa'bb) { display: none !important; }`;
const COLUMNS = Object.freeze({
TOKEN: 'Token',
START: 'Start',
END: 'End',
FRAGMENT: 'Fragment'
});
// Prepare the data array
const data = [];
// Tokenize the input - feel free to try `tokenize` and `tokenizeExtended`
tokenizeExtended(CSS_SOURCE, (token, start, end) => {
data.push({
[COLUMNS.TOKEN]: getFormattedTokenName(token),
[COLUMNS.START]: start,
[COLUMNS.END]: end,
[COLUMNS.FRAGMENT]: CSS_SOURCE.substring(start, end),
});
});
// Print the tokenization result as a table
console.table(data, Object.values(COLUMNS));
tokenize
/**
* CSS tokenizer function
*
* @param source Source code to tokenize
* @param onToken Tokenizer callback which is called for each token found in source code
* @param onError Error callback which is called when a parsing error is found (optional)
* @param functionHandlers Custom function handlers (optional)
*/
function tokenize(
source: string,
onToken: OnTokenCallback,
onError: OnErrorCallback = () => {},
functionHandlers?: Map<number, TokenizerContextFunction>,
): void;
where
/**
* Callback which is called when a token is found
*
* @param type Token type
* @param start Token start offset
* @param end Token end offset
* @param props Other token properties (if any)
* @param stop Function to halt the tokenization process
* @note Hash tokens have a type flag set to either "id" or "unrestricted". The type flag defaults to "unrestricted" if
* not otherwise set
*/
type OnTokenCallback = (
type: TokenType,
start: number,
end: number,
props: Record<string, unknown> | undefined,
stop: () => void
);
/**
* Callback which is called when a parsing error is found. According to the spec, parsing errors are not fatal and
* therefore the tokenizer is quite permissive, but if needed, the error callback can be used.
*
* @param message Error message
* @param start Error start offset
* @param end Error end offset
* @see {@link https://www.w3.org/TR/css-syntax-3/#error-handling}
*/
type OnErrorCallback = (message: string, start: number, end: number) => void;
/**
* Function handler
*
* @param context Reference to the tokenizer context instance
* @param ...args Additional arguments (if any)
*/
type TokenizerContextFunction = (context: TokenizerContext, ...args: any[]) => void;
tokenizeExtended
tokenizeExtended
is an extended version of the tokenize
function that supports custom function handlers. This
function is designed to handle special pseudo-classes like :contains()
and :xpath()
.
/**
* Extended CSS tokenizer function
*
* @param source Source code to tokenize
* @param onToken Tokenizer callback which is called for each token found in source code
* @param onError Error callback which is called when a parsing error is found (optional)
* @param functionHandlers Custom function handlers (optional)
* @note If you specify custom function handlers, they will be merged with the default function handlers. If you
* duplicate a function handler, the custom one will be used instead of the default one, so you can override the default
* function handlers this way, if you want to.
*/
function tokenizeExtended(
source: string,
onToken: OnTokenCallback,
onError: OnErrorCallback = () => {},
functionHandlers: Map<number, TokenizerContextFunction> = new Map(),
): void
hasToken
/**
* Checks if the given raw string contains any of the specified tokens.
*
* @param raw - The raw string to be tokenized and checked.
* @param tokens - A set of token types to check for in the raw string.
* @param tokenizer - The tokenizer function to use. Defaults to `tokenizeExtended`.
*
* @example hasToken('div:contains("foo")', new Set([TokenType.Function]), tokenizeExtended); // true
*
* @returns `true` if any of the specified tokens are found in the raw string, otherwise `false`.
*/
function hasToken = (
raw: string,
tokens: Set<TokenType>,
tokenizer: TokenizerFunction = tokenizeExtended,
): boolean
TokenizerContext
A class that represents the tokenizer context. It is used to manage the tokenizer state and provides access to the source code, current position, and other relevant information.
decodeIdent
/**
* Decodes a CSS identifier according to the CSS Syntax Module Level 3 specification.
*
* @param ident CSS identifier to decode.
*
* @example
* ```ts
* decodeIdent(String.raw`\00075\00072\0006C`); // 'url'
* decodeIdent('url'); // 'url'
* ```
*
* @returns Decoded CSS identifier.
*/
function decodeIdent(ident: string): string;
CSS_TOKENIZER_VERSION
/**
* @adguard/css-tokenizer version
*/
const CSS_TOKENIZER_VERSION: string;
TokenType
An enumeration of token types recognized by the tokenizer. They are strictly based on the CSS Syntax Level 3 specification.
See https://www.w3.org/TR/css-syntax-3/#tokenization for more details.
getBaseTokenName
/**
* Get base token name by token type
*
* @param type Token type
*
* @example
* ```ts
* getBaseTokenName(TokenType.Ident); // 'ident'
* getBaseTokenName(-1); // 'unknown'
* ```
*
* @returns Base token name or 'unknown' if token type is unknown
*/
function getBaseTokenName(type: TokenType): string;
getFormattedTokenName
/**
* Get formatted token name by token type
*
* @param type Token type
*
* @example
* ```ts
* getFormattedTokenName(TokenType.Ident); // '<ident-token>'
* getFormattedTokenName(-1); // '<unknown-token>'
* ```
*
* @returns Formatted token name or `'<unknown-token>'` if token type is unknown
*/
function getFormattedTokenName(type: TokenType): string;
[!NOTE] Our API and token list is also compatible with the CSSTree's tokenizer API, and in the long term, we plan to integrate this library into CSSTree via our ECSSTree library, see this issue for more details.
You can find the benchmark results in the benchmark/RESULTS.md file.
If you have any questions or ideas for new features, please open an issue or a discussion. We will be happy to discuss it with you.
This project is licensed under the MIT license. See the LICENSE file for details.
FAQs
CSS / Extended CSS tokenizer
The npm package @adguard/css-tokenizer receives a total of 0 weekly downloads. As such, @adguard/css-tokenizer popularity was classified as not popular.
We found that @adguard/css-tokenizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 3 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.