
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
Advanced web crawler and scraper with built-in parsing, following, rate limiting, and plugin system
现代化、高性能的网络爬虫和数据提取库,基于 TypeScript 构建,提供强大的解析引擎、插件系统和配置管理。
npm install crawlx
import { CrawlX } from 'crawlx';
const crawler = new CrawlX();
// 爬取单个URL
const result = await crawler.crawl('https://example.com');
console.log('标题:', result.response.$.find('title').text());
await crawler.destroy();
import { createScraper } from 'crawlx';
const scraper = createScraper();
const result = await scraper.crawl('https://example.com', {
parse: {
title: 'title',
headings: ['h1', 'h2'],
links: '[a@href]',
metadata: {
description: 'meta[name="description"]@content',
keywords: 'meta[name="keywords"]@content',
},
},
});
console.log('提取的数据:', result.parsed);
await scraper.destroy();
import { quickCrawl } from 'crawlx';
const result = await quickCrawl('https://example.com', {
title: 'title',
description: 'meta[name="description"]@content',
});
console.log(result.parsed);
CrawlX 提供多种预配置的工厂函数:
import {
createLightweightCrawler,
createHighPerformanceCrawler,
createScraper,
createSpider,
createMonitor,
createValidator
} from 'crawlx';
// 轻量级爬虫 - 适合简单任务
const lightweight = createLightweightCrawler();
// 高性能爬虫 - 适合大规模操作
const highPerf = createHighPerformanceCrawler();
// 数据提取器 - 优化的数据提取
const scraper = createScraper();
// 网络蜘蛛 - 链接跟踪和内容发现
const spider = createSpider();
// 监控器 - 网站变化检测
const monitor = createMonitor();
// 验证器 - 链接健康检查
const validator = createValidator();
强大的CSS选择器解析系统:
const parseRule = {
title: 'title', // 文本内容
links: '[a@href]', // 属性值
images: ['img@src'], // 数组
price: '.price | trim | number', // 过滤器
};
const parseRule = {
products: {
_scope: '.product', // 作用域
name: '.name',
price: '.price | trim | number',
image: 'img@src',
details: {
_scope: '.details',
description: '.desc',
specs: ['.spec'],
},
},
};
const parseRule = {
title: 'title',
url: () => window.location.href,
timestamp: () => new Date().toISOString(),
productCount: ($) => $('.product').length,
};
const crawler = new CrawlX({
mode: 'high-performance',
concurrency: 10,
timeout: 30000,
userAgent: 'MyBot/1.0',
headers: {
'Accept': 'text/html,application/xhtml+xml',
},
});
const crawler = new CrawlX({
plugins: {
delay: {
enabled: true,
defaultDelay: 1000,
randomDelay: true,
},
rateLimit: {
enabled: true,
globalLimit: { requests: 100, window: 60000 },
},
retry: {
enabled: true,
maxRetries: 3,
exponentialBackoff: true,
},
},
});
CRAWLX_MODE=high-performance
CRAWLX_CONCURRENCY=10
CRAWLX_TIMEOUT=30000
CRAWLX_PLUGINS_DELAY_ENABLED=true
import { ConfigPresets } from 'crawlx';
// 开发预设
const devCrawler = ConfigPresets.development();
// 生产预设
const prodCrawler = ConfigPresets.production();
// 测试预设
const testCrawler = ConfigPresets.testing();
class CustomPlugin {
name = 'custom';
version = '1.0.0';
priority = 100;
async onTaskComplete(result) {
result.customData = {
processedAt: new Date().toISOString(),
};
return result;
}
}
const crawler = new CrawlX();
crawler.addPlugin(new CustomPlugin());
const crawler = new CrawlX();
crawler.on('task-start', (task) => {
console.log(`开始: ${task.url}`);
});
crawler.on('task-complete', (result) => {
console.log(`完成: ${result.response.url}`);
});
crawler.on('data-extracted', (data, url) => {
console.log(`从 ${url} 提取的数据:`, data);
});
crawler.on('task-error', (error, task) => {
console.log(`失败: ${task.url} - ${error.message}`);
});
import { CrawlXError, NetworkError, TimeoutError } from 'crawlx';
try {
const result = await crawler.crawl('https://example.com');
} catch (error) {
if (error instanceof NetworkError) {
console.log('网络错误:', error.statusCode);
} else if (error instanceof TimeoutError) {
console.log('超时:', error.timeout);
} else if (error instanceof CrawlXError) {
console.log('CrawlX错误:', error.code, error.context);
}
}
const stats = crawler.getStats();
console.log('爬虫统计:', {
isRunning: stats.isRunning,
results: stats.results,
scheduler: stats.scheduler,
httpClient: stats.httpClient,
plugins: stats.plugins,
});
┌─────────────────┐
│ CrawlX Core │ ← 主要协调器
├─────────────────┤
│ Plugin Manager │ ← 可扩展层
├─────────────────┤
│ Task Scheduler │ ← 并发和队列
├─────────────────┤
│ HTTP Client │ ← 网络层
├─────────────────┤
│ Parser Engine │ ← 数据提取
├─────────────────┤
│ Config Manager │ ← 配置管理
└─────────────────┘
欢迎贡献!请查看我们的贡献指南。
准备开始爬取了吗? 查看快速开始指南开始您的CrawlX之旅!
FAQs
Advanced web crawler and scraper with built-in parsing, following, rate limiting, and plugin system
We found that crawlx demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.