Selector Schema codegen
Introduction
ssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module
For a better experience using this library, you should know:
- HTML CSS selectors (CSS3 standard min), Xpath
- regular expressions (PCRE)
Project solving next problems:
- designed for SSR (server-side-render) html pages parsers, NOT FOR REST-API, GRAPHQL ENDPOINTS
- decrease boilerplate code
- generates independent modules from the project that can be reused.
- generates docstring documentation and the signature of the parser output.
- for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).
- support annotation and parsing of JSON-like strings from a document
- AST API codegen for developing a converter for parsing
Support converters
Current support converters
| Python (3.8-3.13) | bs4, lxml ( typing_extensions if py < 3.10 ) | N | Y | Y | TypedDict1, list, dict | ruff |
| ... | parsel ( typing_extensions if py < 3.10 ) | Y | Y | N | ... | ... |
| ... | selectolax (lexbor) ( typing_extensions if py < 3.10 ) | N | Y | N | ... | ... |
| ... | lxml ( typing_extensions if py < 3.10 ) | Y | Y | N | ... | ... |
js (ES6)2 | pure (firefox/chrome extension/nodejs) | Y | Y | Y | JSDoc | prettier |
| go (1.10+) (UNSTABLE) | goquery, gjson (4) | N | Y | N | struct(+json anchors), array, map | gofmt |
lua (5.2+), luajit(2+) (UNSTABLE)5 | lua-htmlparser, lrexlib(opt), dkjson | N | Y | N | EmmyLua | LuaFormatter |
-
CSS3 means support next selectors:
- basic: (
tag, .class, #id, tag1,tag2)
- combined: (
div p, ul > li, h2 +p, title ~head)
- attribute: (
a[href], input[type='text'], a[href*='...'], ...)
- CSS3 pseudo classes: (
:nth-child(n), :first-child, :last-child)
-
CSS4 means support next selectors:
:nth-of-type(), :where(), :is(), :not() etc
-
1this annotation type was deliberately chosen as a compromise reasons:
Python has many ways of serialization: namedtuple, dataclass, attrs, pydantic, msgspec, etc
- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
-
2ES8 standart required if needed use PCRE re.S | re.DOTALL flag
-
3js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!
-
4golang has not been tested much, there may be issues
-
formatter dependency - optional dependency for prettify and fix codestyle
-
5lua
- Experimental Research PoC, performance and stability are not guaranteed
- Priority on generation to pure lua without C-libs dependencies. using mva/htmlparser and dhkolf/dkjson
- Translates unsupported CSS3 selectors into the equivalent in the form of function calls:
- for example,
div +p is equivalent to CssExt.combine_plus(root:select("div"), "p")
- Translates PCRE regex to string pattern matching (with restrictions) for more information in lua_re_compat.py
Limitations
For maximum portability of the configuration to the target language:
- If possible, use CSS selectors: they are guaranteed to be converted to XPATH
- Unlike javascript, most html parse libs implement CSS3 selectors standard. They may not fully implement the functionality!
Check the html parser lib documentation aboud CSS selectors before implement code. Examples:
- HTML parser libs maybe not supports attribute selectors:
*=, ~=, |=, ^=, $=
- Several libs not support pseudo classes (eg: standard dart.html lib miss this feature).
Getting started
ssc_gen required python 3.10 version or higher
Install
pip:
pip install ssc_codegen
uv:
uv pip install ssc_codegen
Example
Create a file schema.py with:
from ssc_codegen import ItemSchema, D
class HelloWorld(ItemSchema):
title = D().css('title').text()
a_hrefs = D().css_all('a').attr('href')
try it in cli
[!note]
this tools developed for testing purposes, not for web-scraping tasks
eval from file
Download any html file and pass as argument:
ssc-gen parse-from-file index.html -t schema.py:HelloWorld
Short options descriptions:
-t --target - config schema file and class from where to start the parser

send GET request to url and parse response
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld

send request via Chromium browser (CDP protocol)
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld
[!note]
if script cannot found chrome executable - provide it manually:
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium
Convert to code
Convert to code for use in projects:
[!note]
for example, used js: it can be fast test in developer console
ssc-gen js schema.py -o .
Code output looks like this:
class HelloWorld {
constructor(doc) {
if (typeof doc === "string") {
this._doc = new DOMParser().parseFromString(doc, "text/html");
} else if (doc instanceof Document || doc instanceof Element) {
this._doc = doc;
} else {
throw new Error("Invalid input: Expected a Document, Element, or string");
}
}
_parseTitle(v) {
let v0 = v.querySelector("title");
return typeof v0.textContent === "undefined"
? v0.documentElement.textContent
: v0.textContent;
}
_parseAHrefs(v) {
let v0 = Array.from(v.querySelectorAll("a"));
return v0.map((e) => e.getAttribute("href"));
}
parse() {
return {
title: this._parseTitle(this._doc),
a_hrefs: this._parseAHrefs(this._doc),
};
}
}
Copy code output and past to developer console:
Print output:
alert(JSON.stringify(new HelloWorld(document).parse()));

You can use any html source:
- parse from html files
- parse from http responses
- parse from browsers: playwright, selenium, chrome-cdp, etc.
- call curl in shell and parse STDIN
- use in STDIN pipelines with third-party tools like projectdiscovery/httpx
See also
- Brief about css selectors and regular expressions
- Explain short document on how to understand DSL syntax
- LLM experimental prompt for generate code
- Explain short note how to explain and read sscgen schema configs
- Quickstart about css selectors and regular expressions.
- Tutorial basic usage ssc-gen
- AST reference about generation code from AST