
Research
/Security News
Contagious Interview Campaign Escalates With 67 Malicious npm Packages and New Malware Loader
North Korean threat actors deploy 67 malicious npm packages using the newly discovered XORIndex malware loader.
.. image:: docs/logo.png :alt: selectolax logo
.. image:: https://img.shields.io/pypi/v/selectolax.svg :target: https://pypi.python.org/pypi/selectolax
A fast HTML5 parser with CSS selectors using Modest <https://github.com/lexborisov/Modest/>
_ and
Lexbor <https://github.com/lexbor/lexbor>
_ engines.
From PyPI using pip:
.. code-block:: bash
pip install selectolax
If installation fails due to compilation errors, you may need to install Cython <https://github.com/cython/cython>
_:
.. code-block:: bash
pip install selectolax[cython]
This usually happens when you try to install an outdated version of selectolax on a newer version of Python.
Development version from GitHub:
.. code-block:: bash
git clone --recursive https://github.com/rushter/selectolax
cd selectolax
pip install -r requirements_dev.txt
python setup.py install
How to compile selectolax while developing:
.. code-block:: bash
make clean
make dev
Here are some basic examples to get you started with selectolax:
Parsing HTML and extracting text:
.. code:: python
In [1]: from selectolax.parser import HTMLParser
...:
...: html = """
...: <h1 id="title" data-updated="20201101">Hi there</h1>
...: <div class="post">Lorem Ipsum is simply dummy text of the printing and typesetting industry. </div>
...: <div class="post">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</div>
...: """
...: tree = HTMLParser(html)
In [2]: tree.css_first('h1#title').text()
Out[2]: 'Hi there'
In [3]: tree.css_first('h1#title').attributes
Out[3]: {'id': 'title', 'data-updated': '20201101'}
In [4]: [node.text() for node in tree.css('.post')]
Out[4]:
['Lorem Ipsum is simply dummy text of the printing and typesetting industry. ',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit.']
Using advanced CSS selectors:
.. code:: python
In [1]: html = "<div><p id=p1><p id=p2><p id=p3><a>link</a><p id=p4><p id=p5>text<p id=p6></div>"
...: selector = "div > :nth-child(2n+1):not(:has(a))"
In [2]: for node in HTMLParser(html).css(selector):
...: print(node.attributes, node.text(), node.tag)
...: print(node.parent.tag)
...: print(node.html)
...:
{'id': 'p1'} p
div
<p id="p1"></p>
{'id': 'p5'} text p
div
<p id="p5">text</p>
Detailed overview <https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb>
_Selectolax supports two backends: Modest
and Lexbor
. By default, all examples use the Modest backend.
Most of the features between backends are almost identical, but there are still some differences.
As of 2024, the preferred backend is Lexbor
. The Modest
backend is still available for compatibility reasons
and the underlying C library that selectolax uses is not maintained anymore.
To use lexbor
, just import the parser and use it in the similar way to the HTMLParser
.
.. code:: python
In [1]: from selectolax.lexbor import LexborHTMLParser
In [2]: html = """
...: <title>Hi there</title>
...: <div id="updated">2021-08-15</div>
...: """
In [3]: parser = LexborHTMLParser(html)
In [4]: parser.root.css_first("#updated").text()
Out[4]: '2021-08-15'
examples/benchmark.py
for more information.============================ =========== Package Time ============================ =========== Beautiful Soup (html.parser) 61.02 sec. lxml / Beautiful Soup (lxml) 9.09 sec. html5_parser 16.10 sec. selectolax (Modest) 2.94 sec. selectolax (Lexbor) 2.39 sec. ============================ ===========
selectolax API reference <http://selectolax.readthedocs.io/en/latest/parser.html>
_Video introduction to web scraping using selectolax <https://youtu.be/HpRsfpPuUzE>
_How to Scrape 7k Products with Python using selectolax and httpx <https://www.youtube.com/watch?v=XpGvq755J2U>
_Detailed overview <https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb>
_Modest introduction <https://lexborisov.github.io/Modest/>
_Modest benchmark <http://lexborisov.github.io/benchmark-html-persers/>
_Python benchmark <https://rushter.com/blog/python-fast-html-parser/>
_Another Python benchmark <https://www.peterbe.com/plog/selectolax-or-pyquery>
_LGPL2.1 <https://github.com/lexborisov/Modest/blob/master/LICENSE>
_MIT <https://github.com/rushter/selectolax/blob/master/LICENSE>
_FAQs
Fast HTML5 parser with CSS selectors.
We found that selectolax demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
North Korean threat actors deploy 67 malicious npm packages using the newly discovered XORIndex malware loader.
Security News
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
Security News
CAI is a new open source AI framework that automates penetration testing tasks like scanning and exploitation up to 3,600× faster than humans.