Socket
Book a DemoInstallSign in
Socket

ssc_codegen

Package Overview
Dependencies
Maintainers
1
Versions
99
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

ssc_codegen

Python-dsl code converter to html parser for web scraping

Source
pipPyPI
Version
0.15.3
Maintainers
1

Selector Schema codegen

Introduction

ssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module

For a better experience using this library, you should know:

  • HTML CSS selectors (CSS3 standard min), Xpath
  • regular expressions (PCRE)

Project solving next problems:

  • designed for SSR (server-side-render) html pages parsers, NOT FOR REST-API, GRAPHQL ENDPOINTS
  • decrease boilerplate code
  • generates independent modules from the project that can be reused.
  • generates docstring documentation and the signature of the parser output.
  • for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).
  • support annotation and parsing of JSON-like strings from a document
  • AST API codegen for developing a converter for parsing

Support converters

Current support converters

LanguageHTML parser lib + dependenciesXPathCSS3CSS4Generated annotations, types, structsformatter dependency
Python (3.8-3.13)bs4, lxml ( typing_extensions if py < 3.10 )NYYTypedDict1, list, dictruff
...parsel ( typing_extensions if py < 3.10 )YYN......
...selectolax (lexbor) ( typing_extensions if py < 3.10 )NYN......
...lxml ( typing_extensions if py < 3.10 )YYN......
js (ES6)2pure (firefox/chrome extension/nodejs)YYYJSDocprettier
go (1.10+) (UNSTABLE)goquery, gjson (4)NYNstruct(+json anchors), array, mapgofmt
lua (5.2+), luajit(2+) (UNSTABLE)5lua-htmlparser, lrexlib(opt), dkjsonNYNEmmyLuaLuaFormatter
  • CSS3 means support next selectors:

    • basic: (tag, .class, #id, tag1,tag2)
    • combined: (div p, ul > li, h2 +p, title ~head)
    • attribute: (a[href], input[type='text'], a[href*='...'], ...)
    • CSS3 pseudo classes: (:nth-child(n), :first-child, :last-child)
  • CSS4 means support next selectors:

    • :nth-of-type(), :where(), :is(), :not() etc
  • 1this annotation type was deliberately chosen as a compromise reasons: Python has many ways of serialization: namedtuple, dataclass, attrs, pydantic, msgspec, etc

    • TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
  • 2ES8 standart required if needed use PCRE re.S | re.DOTALL flag

  • 3js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!

  • 4golang has not been tested much, there may be issues

  • formatter dependency - optional dependency for prettify and fix codestyle

  • 5lua

    • Experimental Research PoC, performance and stability are not guaranteed
    • Priority on generation to pure lua without C-libs dependencies. using mva/htmlparser and dhkolf/dkjson
    • Translates unsupported CSS3 selectors into the equivalent in the form of function calls:
      • for example, div +p is equivalent to CssExt.combine_plus(root:select("div"), "p")
    • Translates PCRE regex to string pattern matching (with restrictions) for more information in lua_re_compat.py

Limitations

For maximum portability of the configuration to the target language:

  • If possible, use CSS selectors: they are guaranteed to be converted to XPATH
  • Unlike javascript, most html parse libs implement CSS3 selectors standard. They may not fully implement the functionality! Check the html parser lib documentation aboud CSS selectors before implement code. Examples:
  • HTML parser libs maybe not supports attribute selectors: *=, ~=, |=, ^=, $=
  • Several libs not support pseudo classes (eg: standard dart.html lib miss this feature).

Getting started

ssc_gen required python 3.10 version or higher

Install

pip:

pip install ssc_codegen

uv:

uv pip install ssc_codegen

Example

Create a file schema.py with:

from ssc_codegen import ItemSchema, D

class HelloWorld(ItemSchema):
    title = D().css('title').text()
    a_hrefs = D().css_all('a').attr('href')

try it in cli

[!note] this tools developed for testing purposes, not for web-scraping tasks

eval from file

Download any html file and pass as argument:

ssc-gen parse-from-file index.html -t schema.py:HelloWorld

Short options descriptions:

  • -t --target - config schema file and class from where to start the parser

out1

send GET request to url and parse response

ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld

out1

send request via Chromium browser (CDP protocol)

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld

[!note] if script cannot found chrome executable - provide it manually:

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium

Convert to code

Convert to code for use in projects:

[!note] for example, used js: it can be fast test in developer console

ssc-gen js schema.py -o .

Code output looks like this:

// autogenerated by ssc-gen DO NOT_EDIT
/***
 *
 * {
 *     "title": "String",
 *     "a_hrefs": "Array<String>"
 * }*/
class HelloWorld {
  constructor(doc) {
    if (typeof doc === "string") {
      this._doc = new DOMParser().parseFromString(doc, "text/html");
    } else if (doc instanceof Document || doc instanceof Element) {
      this._doc = doc;
    } else {
      throw new Error("Invalid input: Expected a Document, Element, or string");
    }
  }

  _parseTitle(v) {
    let v0 = v.querySelector("title");
    return typeof v0.textContent === "undefined"
      ? v0.documentElement.textContent
      : v0.textContent;
  }

  _parseAHrefs(v) {
    let v0 = Array.from(v.querySelectorAll("a"));
    return v0.map((e) => e.getAttribute("href"));
  }

  parse() {
    return {
      title: this._parseTitle(this._doc),
      a_hrefs: this._parseAHrefs(this._doc),
    };
  }
}

Copy code output and past to developer console:

Print output:

alert(JSON.stringify(new HelloWorld(document).parse()));

example

You can use any html source:

  • parse from html files
  • parse from http responses
  • parse from browsers: playwright, selenium, chrome-cdp, etc.
  • call curl in shell and parse STDIN
  • use in STDIN pipelines with third-party tools like projectdiscovery/httpx

See also

  • Brief about css selectors and regular expressions
  • Explain short document on how to understand DSL syntax
  • LLM experimental prompt for generate code
  • Explain short note how to explain and read sscgen schema configs
  • Quickstart about css selectors and regular expressions.
  • Tutorial basic usage ssc-gen
  • AST reference about generation code from AST

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts