Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
A lightweight package for crawling the web with the minimalist of code.
from air_web import get
# Crawl example.com and convert to Markdown
get("https://example.com")
This is proof that web crawling can be done simply and at the highest level of API. No need to install the required 102 dependencies for one feature, no need to create a lot of class instances in order to get something up and running. With a simple get()
, you get almost everything from the Internet.
air_web
uses PyO3 as backend, utilizing the html2text
crate for the to_markdown
function, as well as some .pyi
code to get everything typed nicely.
For the Python side, I used primp
for browser fingerprint impersonation and selectolax
for selecting HTML nodes.
Redirectors are used to redirect the HTML selector to one specific node to tidy up the results. For example, if the page has navbars, we can select the main content part to skip it through (if the nav isn't important).
air_web
comes with a pre-built redirector for Medium posts, you can pass it to the get()
function via redirectors=[...]
to get a cleaner result for posts:
from air_web import redirectors
get(
"https://medium.com/p/post-id-here",
redirectors=[
redirectors.medium, # skip footers, navs, and straight to the post
]
)
📝 Note: The reason why the redirectors
argument takes a list is that I want to make it sequential, meaning you can add multiple redirectors at once from other providers or custom ones made by you!
You can create a custom redirector via functions or string literals containing a CSS selector. Below is an example that selects an element with the class .article
and is inside of the main
tag.
🖱️ CSS selectors:
from air_web import Redirector # type
MY_REDIRECTOR: Redirector = "main .article" # CSS
Alternatively, you can use functions to manipulate the HTML nodes to clean everything up. Below is an example that removes advertisements from the node.
🏭 Functional selectors:
from air_web import (
Node, # an HTML node
ok, # indicates the node exists and is not None
redirector # a decorator for typing (optional)
)
@redirector
def my_redirector(node: Node):
main = ok(node.css_first("main"))
# Remove advertisement
ad = ok(main.css_first(".advertisement"))
ad.decompose()
return main
def get(
url: str,
*,
redirectors: List[Redirector] = [],
ok_codes: list[int] = [],
**kwargs,
) -> str: ...
Sends an HTTP GET request to the specified website and returns the Markdown result.
Args:
url
(str): The URL to fetch.redirectors
(list[Redirector]): A list of redirectors for indexing into or manipulating specific nodes before converting the HTML to Markdown. See the "Redirectors" section.ok_codes
(list[int]): A list of OK codes indicating the success status. Even if provided custom ones or not, the code 200
is always on the list.def to_markdown(t: str) -> str: ...
Converts HTML to Markdown. (src/lib.rs
)
FAQs
A lightweight package for crawling the web with the minimalist of code.
We found that air-web demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.