
Research
2025 Report: Destructive Malware in Open Source Packages
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.
ssc_codegen
Advanced tools
ssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module
Current support converters
| Language | HTML parser lib + dependencies | XPath | CSS3 | CSS4 | Generated annotations, types, structs | formatter dependency |
|---|---|---|---|---|---|---|
| Python (3.8-3.13) | bs4, lxml ( typing_extensions if py < 3.10 ) | N | Y | Y | TypedDict1, list, dict | ruff |
| ... | parsel ( typing_extensions if py < 3.10 ) | Y | Y | N | ... | ... |
| ... | selectolax (lexbor) ( typing_extensions if py < 3.10 ) | N | Y | N | ... | ... |
| ... | lxml ( typing_extensions if py < 3.10 ) | Y | Y | N | ... | ... |
js (ES6)2 | pure (firefox/chrome extension/nodejs) | Y | Y | Y | JSDoc | prettier |
| go (1.10+) (UNSTABLE) | goquery, gjson (4) | N | Y | N | struct(+json anchors), array, map | gofmt |
lua (5.2+), luajit(2+) (UNSTABLE)5 | lua-htmlparser, lrexlib(opt), dkjson | N | Y | N | EmmyLua | LuaFormatter |
CSS3 means support next selectors:
tag, .class, #id, tag1,tag2)div p, ul > li, h2 +p, title ~head)a[href], input[type='text'], a[href*='...'], ...):nth-child(n), :first-child, :last-child)CSS4 means support next selectors:
:nth-of-type(), :where(), :is(), :not() etc1this annotation type was deliberately chosen as a compromise reasons:
Python has many ways of serialization: namedtuple, dataclass, attrs, pydantic, msgspec, etc
2ES8 standart required if needed use PCRE re.S | re.DOTALL flag
3js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!
4golang has not been tested much, there may be issues
formatter dependency - optional dependency for prettify and fix codestyle
5lua
div +p is equivalent to CssExt.combine_plus(root:select("div"), "p")For maximum portability of the configuration to the target language:
+ operations (eg: selectolax(modest), dart.universal_html)*=, ~=, |=, ^=, $=ssc_gen required python 3.10 version or higher
pip:
pip install ssc_codegen
uv:
uv pip install ssc_codegen
schema.py with:from ssc_codegen import ItemSchema, D
class HelloWorld(ItemSchema):
title = D().css('title').text()
a_hrefs = D().css_all('a').attr('href')
[!note] this tools developed for testing purposes, not for web-scraping tasks
Download any html file and pass as argument:
ssc-gen parse-from-file index.html -t schema.py:HelloWorld
Short options descriptions:
-t --target - config schema file and class from where to start the parser
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld
[!note] if script cannot found chrome executable - provide it manually:
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium
Convert to code for use in projects:
[!note] for example, used js: it can be fast test in developer console
ssc-gen js schema.py -o .
Code output looks like this:
// autogenerated by ssc-gen DO NOT_EDIT
/***
*
* {
* "title": "String",
* "a_hrefs": "Array<String>"
* }*/
class HelloWorld {
constructor(doc) {
if (typeof doc === "string") {
this._doc = new DOMParser().parseFromString(doc, "text/html");
} else if (doc instanceof Document || doc instanceof Element) {
this._doc = doc;
} else {
throw new Error("Invalid input: Expected a Document, Element, or string");
}
}
_parseTitle(v) {
let v0 = v.querySelector("title");
return typeof v0.textContent === "undefined"
? v0.documentElement.textContent
: v0.textContent;
}
_parseAHrefs(v) {
let v0 = Array.from(v.querySelectorAll("a"));
return v0.map((e) => e.getAttribute("href"));
}
parse() {
return {
title: this._parseTitle(this._doc),
a_hrefs: this._parseAHrefs(this._doc),
};
}
}
Print output:
alert(JSON.stringify(new HelloWorld(document).parse()));

You can use any html source:
FAQs
Python-dsl code converter to html parser for web scraping
We found that ssc_codegen demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.

Security News
Socket CTO Ahmad Nassri shares practical AI coding techniques, tools, and team workflows, plus what still feels noisy and why shipping remains human-led.

Research
/Security News
A five-month operation turned 27 npm packages into durable hosting for browser-run lures that mimic document-sharing portals and Microsoft sign-in, targeting 25 organizations across manufacturing, industrial automation, plastics, and healthcare for credential theft.