
Product
Introducing Tier 1 Reachability: Precision CVE Triage for Enterprise Teams
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.
cobweb是一个基于python的分布式爬虫调度框架,目前支持分布式爬虫,单机爬虫,支持自定义数据库,支持自定义数据存储,支持自定义数据处理等操作。
cobweb主要由3个模块和一个配置文件组成:Launcher启动器、Crawler采集器、Pipeline存储和setting配置文件。
pip3 install --upgrade cobweb-launcher
from cobweb import Launcher
# 创建启动器
app = Launcher(task="test", project="test")
# 设置采集种子
app.SEEDS = [{
"url": "https://www.baidu.com"
}]
...
# 启动任务
app.start()
默认配置文件:import cobweb.setting
不推荐!!!目前有bug,随缘使用... 例如:同级目录下自定义创建了setting.py文件。
from cobweb import Launcher
app = Launcher(
task="test",
project="test",
setting="import setting"
)
...
app.start()
from cobweb import Launcher
# 创建启动器
app = Launcher(
task="test",
project="test",
REDIS_CONFIG = {
"host": ...,
"password":...,
"port": ...,
"db": ...
}
)
...
# 启动任务
app.start()
@app.request
使用装饰器封装自定义请求方法,作用于发生请求前的操作,返回Request对象或继承于BaseItem对象,用于控制请求参数。
from typing import Union
from cobweb import Launcher
from cobweb.base import Seed, Request, BaseItem
app = Launcher(
task="test",
project="test"
)
...
@app.request
def request(seed: Seed) -> Union[Request, BaseItem]:
# 可自定义headers,代理,构造请求参数等操作
proxies = {"http": ..., "https": ...}
yield Request(seed.url, seed, ..., proxies=proxies, timeout=15)
# yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
...
app.start()
默认请求方法
def request(seed: Seed) -> Union[Request, BaseItem]:
yield Request(seed.url, seed, timeout=5)
@app.download
使用装饰器封装自定义下载方法,作用于发生请求时的操作,返回Response对象或继承于BaseItem对象,用于控制请求参数。
from typing import Union
from cobweb import Launcher
from cobweb.base import Request, Response, BaseItem
app = Launcher(
task="test",
project="test"
)
...
@app.download
def download(item: Request) -> Union[BaseItem, Response]:
...
response = ...
...
yield Response(item.seed, response, ...) # 返回Response对象,进行解析
# yield xxxItem(seed, ...) # 跳过请求和解析直接进入数据存储流程
...
app.start()
默认下载方法
def download(item: Request) -> Union[Seed, BaseItem, Response, str]:
response = item.download()
yield Response(item.seed, response, **item.to_dict)
自定义解析需要由一个存储数据类和解析方法组成。存储数据类继承于BaseItem的对象,规定存储表名及字段, 解析方法返回继承于BaseItem的对象,yield返回进行控制数据存储流程。
from typing import Union
from cobweb import Launcher
from cobweb.base import Seed, Response, BaseItem
class TestItem(BaseItem):
__TABLE__ = "test_data" # 表名
__FIELDS__ = "field1, field2, field3" # 字段名
app = Launcher(
task="test",
project="test"
)
...
@app.parse
def parse(item: Response) -> Union[Seed, BaseItem]:
...
yield TestItem(item.seed, field1=..., field2=..., field3=...)
# yield Seed(...) # 构造新种子推送至消费队列
...
app.start()
默认解析方法
def parse(item: Request) -> Union[Seed, BaseItem]:
upload_item = item.to_dict
upload_item["text"] = item.response.text
yield ConsoleItem(item.seed, data=json.dumps(upload_item, ensure_ascii=False))
未更新流程图!!!
FAQs
spider_hole
We found that cobweb-launcher demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.
Research
/Security News
Ongoing npm supply chain attack spreads to DuckDB: multiple packages compromised with the same wallet-drainer malware.
Security News
The MCP Steering Committee has launched the official MCP Registry in preview, a central hub for discovering and publishing MCP servers.