designed for SSR (server-side-render) html pages parsers, NOT FOR REST-API, GRAPHQL ENDPOINTS
decrease boilerplate code
generates independent modules from the project that can be reused.
generates docstring documentation and the signature of the parser output.
for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).
support annotation and parsing of JSON-like strings from a document
AST API codegen for developing a converter for parsing

Support converters

Current support converters

Language	HTML parser lib + dependencies	XPath	CSS3	CSS4	Generated annotations, types, structs	formatter dependency
Python (3.8-3.13)	bs4, lxml ( typing_extensions if py < 3.10 )	N	Y	Y	TypedDict`1`, list, dict	ruff
...	parsel ( typing_extensions if py < 3.10 )	Y	Y	N	...	...
...	selectolax (lexbor) ( typing_extensions if py < 3.10 )	N	Y	N	...	...
...	lxml ( typing_extensions if py < 3.10 )	Y	Y	N	...	...
js (ES6)`2`	pure (firefox/chrome extension/nodejs)	Y	Y	Y	JSDoc	prettier
go (1.10+) (UNSTABLE)	goquery, gjson (`4`)	N	Y	N	struct(+json anchors), array, map	gofmt
lua (5.2+), luajit(2+) (UNSTABLE)`5`	lua-htmlparser, lrexlib(opt), dkjson	N	Y	N	EmmyLua	LuaFormatter

CSS3 means support next selectors:
- basic: (tag, .class, #id, tag1,tag2)
- combined: (div p, ul > li, h2 +p, title ~head)
- attribute: (a[href], input[type='text'], a[href*='...'], ...)
- CSS3 pseudo classes: (:nth-child(n), :first-child, :last-child)
CSS4 means support next selectors:
- :nth-of-type(), :where(), :is(), :not() etc
1this annotation type was deliberately chosen as a compromise reasons: Python has many ways of serialization: namedtuple, dataclass, attrs, pydantic, msgspec, etc
- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
2ES8 standart required if needed use PCRE re.S | re.DOTALL flag
3js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!
4golang has not been tested much, there may be issues
formatter dependency - optional dependency for prettify and fix codestyle
5lua
- Experimental Research PoC, performance and stability are not guaranteed
- Priority on generation to pure lua without C-libs dependencies. using mva/htmlparser and dhkolf/dkjson
- Translates unsupported CSS3 selectors into the equivalent in the form of function calls:
  - for example, div +p is equivalent to CssExt.combine_plus(root:select("div"), "p")
- Translates PCRE regex to string pattern matching (with restrictions) for more information in lua_re_compat.py

Limitations

For maximum portability of the configuration to the target language:

If possible, use CSS selectors: they are guaranteed to be converted to XPATH
Unlike javascript, most html parse libs implement CSS3 selectors standard. They may not fully implement the functionality! Check the html parser lib documentation aboud CSS selectors before implement code. Examples:
- Several libs not support + operations (eg: selectolax(modest), dart.universal_html)
- For research purpose, lua_htmlparser include converter for unsupported CSS3 query syntax

HTML parser libs maybe not supports attribute selectors: *=, ~=, |=, ^=, $=
Several libs not support pseudo classes (eg: standard dart.html lib miss this feature).

Getting started

ssc_gen required python 3.10 version or higher

Install

pip:

pip install ssc_codegen

uv:

uv pip install ssc_codegen

Example

Create a file `schema.py` with:

from ssc_codegen import ItemSchema, D

class HelloWorld(ItemSchema):
    title = D().css('title').text()
    a_hrefs = D().css_all('a').attr('href')

try it in cli

[!note] this tools developed for testing purposes, not for web-scraping tasks

eval from file

Download any html file and pass as argument:

ssc-gen parse-from-file index.html -t schema.py:HelloWorld

Short options descriptions:

-t --target - config schema file and class from where to start the parser

out1

send GET request to url and parse response

ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld

out1

send request via Chromium browser (CDP protocol)

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld

[!note] if script cannot found chrome executable - provide it manually:

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium

Convert to code

Convert to code for use in projects:

[!note] for example, used js: it can be fast test in developer console

ssc-gen js schema.py -o .

Code output looks like this:

// autogenerated by ssc-gen DO NOT_EDIT
/***
 *
 * {
 *     "title": "String",
 *     "a_hrefs": "Array<String>"
 * }*/
class HelloWorld {
  constructor(doc) {
    if (typeof doc === "string") {
      this._doc = new DOMParser().parseFromString(doc, "text/html");
    } else if (doc instanceof Document || doc instanceof Element) {
      this._doc = doc;
    } else {
      throw new Error("Invalid input: Expected a Document, Element, or string");
    }
  }

  _parseTitle(v) {
    let v0 = v.querySelector("title");
    return typeof v0.textContent === "undefined"
      ? v0.documentElement.textContent
      : v0.textContent;
  }

  _parseAHrefs(v) {
    let v0 = Array.from(v.querySelectorAll("a"));
    return v0.map((e) => e.getAttribute("href"));
  }

  parse() {
    return {
      title: this._parseTitle(this._doc),
      a_hrefs: this._parseAHrefs(this._doc),
    };
  }
}

Copy code output and past to developer console:

Print output:

alert(JSON.stringify(new HelloWorld(document).parse()));

example

You can use any html source:

parse from html files
parse from http responses
parse from browsers: playwright, selenium, chrome-cdp, etc.
call curl in shell and parse STDIN
use in STDIN pipelines with third-party tools like projectdiscovery/httpx

ssc_codegen

Selector Schema codegen

Introduction

For a better experience using this library, you should know:

Project solving next problems:

Support converters

Limitations

Getting started

Install

Example

Create a file `schema.py` with:

try it in cli

eval from file

send GET request to url and parse response

send request via Chromium browser (CDP protocol)

Convert to code

Copy code output and past to developer console:

See also

Related posts

ssc_codegen

Selector Schema codegen

Introduction

For a better experience using this library, you should know:

Project solving next problems:

Support converters

Limitations

Getting started

Install

Example

Create a file schema.py with:

try it in cli

eval from file

send GET request to url and parse response

send request via Chromium browser (CDP protocol)

Convert to code

Copy code output and past to developer console:

See also

Related posts

The Nightmare Before Deployment

Malicious NuGet Package Typosquats Popular .NET Tracing Library to Steal Wallet Passwords

Create a file `schema.py` with: