Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
.. image:: https://img.shields.io/pypi/v/acrawler.svg :target: https://pypi.org/project/acrawler/ :alt: PyPI .. image:: https://readthedocs.org/projects/acrawler/badge/?version=latest :target: https://acrawler.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status
🔍 A powerful web-crawling framework, based on aiohttp.
Parsel <https://parsel.readthedocs.io/en/latest/>
_pyppeteer <https://github.com/miyakogi/pyppeteer>
_To install, simply use pip:
.. code-block:: bash
$ pip install acrawler
(Optional) $ pip install uvloop #(only Linux/macOS, for faster asyncio event loop) $ pip install aioredis #(if you need Redis support) $ pip install motor #(if you need MongoDB support) $ pip install aiofiles #(if you need FileRequest)
Documentation and tutorial are available online at https://acrawler.readthedocs.io/ and in the docs
directory.
Scrape imdb.com ^^^^^^^^^^^^^^^
.. code-block:: python
from acrawler import Crawler, Request, ParselItem, Handler, register, get_logger
class MovieItem(ParselItem): log = True css = { # just some normal css rules # see Parsel for detailed information "date": ".subtext a[href*=releaseinfo]::text", "time": ".subtext time::text", "rating": "span[itemprop=ratingValue]::text", "rating_count": "span[itemprop=ratingCount]::text", "metascore": ".metacriticScore span::text",
# if you provide a list with additional functions,
# they are considered as field processor function
"title": ["h1::text", str.strip],
# the following four fules is for getting all matching values
# the rule starts with [ and ends with ] comparing to normal rules
"genres": "[.subtext a[href*=genres]::text]",
"director": "[h4:contains(Director) ~ a[href*=name]::text]",
"writers": "[h4:contains(Writer) ~ a[href*=name]::text]",
"stars": "[h4:contains(Star) ~ a[href*=name]::text]",
}
class IMDBCrawler(Crawler): config = {"MAX_REQUESTS": 4, "DOWNLOAD_DELAY": 1}
async def start_requests(self):
yield Request("https://www.imdb.com/chart/moviemeter", callback=self.parse)
def parse(self, response):
yield from response.follow(
".lister-list tr .titleColumn a::attr(href)", callback=self.parse_movie
)
def parse_movie(self, response):
url = response.url_str
yield MovieItem(response.sel, extra={"url": url.split("?")[0]})
@register() class HorrorHandler(Handler): family = "MovieItem" logger = get_logger("horrorlog")
async def handle_after(self, item):
if item["genres"] and "Horror" in item["genres"]:
self.logger.warning(f"({item['title']}) is a horror movie!!!!")
@MovieItem.bind() def process_time(value): # a self-defined field processing function # process time to minutes # '3h 1min' -> 181 if value: res = 0 segs = value.split(" ") for seg in segs: if seg.endswith("min"): res += int(seg.replace("min", "")) elif seg.endswith("h"): res += 60 * int(seg.replace("h", "")) return res return value
if name == "main": IMDBCrawler().run()
Scrape quotes.toscrape.com ^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
from acrawler import Parser, Crawler, ParselItem, Request
logger = get_logger("quotes")
class QuoteItem(ParselItem): log = True default = {"type": "quote"} css = {"author": "small.author::text"} xpath = {"text": ['.//span[@class="text"]/text()', lambda s: s.strip("“")[:20]]}
class AuthorItem(ParselItem): log = True default = {"type": "author"} css = {"name": "h3.author-title::text", "born": "span.author-born-date::text"}
class QuoteCrawler(Crawler):
main_page = r"quotes.toscrape.com/page/\d+"
author_page = r"quotes.toscrape.com/author/.*"
parsers = [
Parser(
in_pattern=main_page,
follow_patterns=[main_page, author_page],
item_type=QuoteItem,
css_divider=".quote",
),
Parser(in_pattern=author_page, item_type=AuthorItem),
]
async def start_requests(self):
yield Request(url="http://quotes.toscrape.com/page/1/")
if name == "main": QuoteCrawler().run()
See examples <examples/>
_.
FAQs
A simple web-crawling framework, based on aiohttp.
We found that acrawler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.