Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
cobweb是一个基于python的分布式爬虫调度框架,目前支持分布式爬虫,单机爬虫,支持自定义数据库,支持自定义数据存储,支持自定义数据处理等操作。
cobweb主要由3个模块和一个配置文件组成:Launcher启动器、Crawler采集器、Pipeline存储和setting配置文件。
pip3 install --upgrade cobweb-launcher
from cobweb import LauncherAir
# 创建启动器
app = LauncherAir(task="test", project="test")
# 设置采集种子
app.SEEDS = [{
"url": "https://www.baidu.com"
}]
...
# 启动任务
app.start()
2. 自定义配置文件参数
from cobweb import LauncherPro
# 创建启动器
app = LauncherPro(
task="test",
project="test"
)
...
# 启动任务
app.start()
默认配置文件:import cobweb.setting
不推荐!!!目前有bug,随缘使用... 例如:同级目录下自定义创建了setting.py文件。
from cobweb import LauncherAir
app = LauncherAir(
task="test",
project="test",
setting="import setting"
)
...
app.start()
from cobweb import LauncherPro
# 创建启动器
app = LauncherPro(
task="test",
project="test",
REDIS_CONFIG = {
"host": ...,
"password":...,
"port": ...,
"db": ...
}
)
...
# 启动任务
app.start()
@app.request
使用装饰器封装自定义请求方法,作用于发生请求前的操作,返回Request对象或继承于BaseItem对象,用于控制请求参数。
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Seed, Request, BaseItem
app = LauncherAir(
task="test",
project="test"
)
...
@app.request
def request(seed: Seed) -> Union[Request, BaseItem]:
# 可自定义headers,代理,构造请求参数等操作
proxies = {"http": ..., "https": ...}
yield Request(seed.url, seed, ..., proxies=proxies, timeout=15)
# yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
...
app.start()
默认请求方法
def request(seed: Seed) -> Union[Request, BaseItem]:
yield Request(seed.url, seed, timeout=5)
@app.download
使用装饰器封装自定义下载方法,作用于发生请求时的操作,返回Response对象或继承于BaseItem对象,用于控制请求参数。
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Request, Response, BaseItem
app = LauncherAir(
task="test",
project="test"
)
...
@app.download
def download(item: Request) -> Union[BaseItem, Response]:
...
response = ...
...
yield Response(item.seed, response, ...) # 返回Response对象,进行解析
# yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
...
app.start()
默认下载方法
def download(item: Request) -> Union[Seed, BaseItem, Response, str]:
response = item.download()
yield Response(item.seed, response, **item.to_dict)
自定义解析需要由一个存储数据类和解析方法组成。存储数据类继承于BaseItem的对象,规定存储表名及字段, 解析方法返回继承于BaseItem的对象,yield返回进行控制数据存储流程。
from typing import Union
from cobweb import LauncherAir
from cobweb.base import Seed, Response, BaseItem
class TestItem(BaseItem):
__TABLE__ = "test_data" # 表名
__FIELDS__ = "field1, field2, field3" # 字段名
app = LauncherAir(
task="test",
project="test"
)
...
@app.parse
def parse(item: Response) -> Union[Seed, BaseItem]:
...
yield TestItem(item.seed, field1=..., field2=..., field3=...)
# yield Seed(...) # 构造新种子推送至消费队列
...
app.start()
默认解析方法
def parse(item: Request) -> Union[Seed, BaseItem]:
upload_item = item.to_dict
upload_item["text"] = item.response.text
yield ConsoleItem(item.seed, data=json.dumps(upload_item, ensure_ascii=False))
未更新流程图!!!
FAQs
spider_hole
We found that cobweb-launcher demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.