New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

crawlx

Package Overview
Dependencies
Maintainers
1
Versions
6
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

crawlx

Advanced web crawler and scraper with built-in parsing, following, rate limiting, and plugin system

latest
Source
npmnpm
Version
2.0.1
Version published
Maintainers
1
Created
Source

CrawlX 2.0

现代化、高性能的网络爬虫和数据提取库,基于 TypeScript 构建,提供强大的解析引擎、插件系统和配置管理。

npm NPM TypeScript Node.js

✨ 特性

  • 🚀 高性能: 轻量级和高性能双模式,适应不同场景
  • 🔧 TypeScript: 完整的类型安全和优秀的开发体验
  • 🧩 插件系统: 可扩展的插件架构,6个内置插件 + 自定义插件支持
  • 📊 强大解析: CSS选择器 + 过滤器管道 + 作用域解析
  • 🕷️ 链接跟踪: 智能链接发现和深度控制
  • 速率控制: 令牌桶算法的高级速率限制
  • 🔄 智能重试: 指数退避的重试机制
  • 📝 结构化日志: 多传输的日志系统
  • 🛡️ 错误处理: 完善的错误类型和恢复机制
  • ⚙️ 灵活配置: Schema验证 + 环境变量 + 预设配置

📦 安装

npm install crawlx

🚀 快速开始

基础爬取

import { CrawlX } from 'crawlx';

const crawler = new CrawlX();

// 爬取单个URL
const result = await crawler.crawl('https://example.com');
console.log('标题:', result.response.$.find('title').text());

await crawler.destroy();

数据提取

import { createScraper } from 'crawlx';

const scraper = createScraper();

const result = await scraper.crawl('https://example.com', {
  parse: {
    title: 'title',
    headings: ['h1', 'h2'],
    links: '[a@href]',
    metadata: {
      description: 'meta[name="description"]@content',
      keywords: 'meta[name="keywords"]@content',
    },
  },
});

console.log('提取的数据:', result.parsed);
await scraper.destroy();

快速爬取

import { quickCrawl } from 'crawlx';

const result = await quickCrawl('https://example.com', {
  title: 'title',
  description: 'meta[name="description"]@content',
});

console.log(result.parsed);

🏭 工厂函数

CrawlX 提供多种预配置的工厂函数:

import { 
  createLightweightCrawler,
  createHighPerformanceCrawler,
  createScraper,
  createSpider,
  createMonitor,
  createValidator 
} from 'crawlx';

// 轻量级爬虫 - 适合简单任务
const lightweight = createLightweightCrawler();

// 高性能爬虫 - 适合大规模操作
const highPerf = createHighPerformanceCrawler();

// 数据提取器 - 优化的数据提取
const scraper = createScraper();

// 网络蜘蛛 - 链接跟踪和内容发现
const spider = createSpider();

// 监控器 - 网站变化检测
const monitor = createMonitor();

// 验证器 - 链接健康检查
const validator = createValidator();

📊 数据解析

强大的CSS选择器解析系统:

基础选择器

const parseRule = {
  title: 'title',                    // 文本内容
  links: '[a@href]',                 // 属性值
  images: ['img@src'],               // 数组
  price: '.price | trim | number',   // 过滤器
};

嵌套结构

const parseRule = {
  products: {
    _scope: '.product',              // 作用域
    name: '.name',
    price: '.price | trim | number',
    image: 'img@src',
    details: {
      _scope: '.details',
      description: '.desc',
      specs: ['.spec'],
    },
  },
};

自定义函数

const parseRule = {
  title: 'title',
  url: () => window.location.href,
  timestamp: () => new Date().toISOString(),
  productCount: ($) => $('.product').length,
};

⚙️ 配置

基础配置

const crawler = new CrawlX({
  mode: 'high-performance',
  concurrency: 10,
  timeout: 30000,
  userAgent: 'MyBot/1.0',
  headers: {
    'Accept': 'text/html,application/xhtml+xml',
  },
});

插件配置

const crawler = new CrawlX({
  plugins: {
    delay: {
      enabled: true,
      defaultDelay: 1000,
      randomDelay: true,
    },
    rateLimit: {
      enabled: true,
      globalLimit: { requests: 100, window: 60000 },
    },
    retry: {
      enabled: true,
      maxRetries: 3,
      exponentialBackoff: true,
    },
  },
});

环境变量

CRAWLX_MODE=high-performance
CRAWLX_CONCURRENCY=10
CRAWLX_TIMEOUT=30000
CRAWLX_PLUGINS_DELAY_ENABLED=true

配置预设

import { ConfigPresets } from 'crawlx';

// 开发预设
const devCrawler = ConfigPresets.development();

// 生产预设
const prodCrawler = ConfigPresets.production();

// 测试预设
const testCrawler = ConfigPresets.testing();

🧩 插件系统

内置插件

  • ParsePlugin: 数据解析和提取
  • FollowPlugin: 链接跟踪和发现
  • RetryPlugin: 自动重试机制
  • DelayPlugin: 请求延迟和礼貌性
  • DuplicateFilterPlugin: URL去重过滤
  • RateLimitPlugin: 高级速率限制

自定义插件

class CustomPlugin {
  name = 'custom';
  version = '1.0.0';
  priority = 100;

  async onTaskComplete(result) {
    result.customData = {
      processedAt: new Date().toISOString(),
    };
    return result;
  }
}

const crawler = new CrawlX();
crawler.addPlugin(new CustomPlugin());

📡 事件处理

const crawler = new CrawlX();

crawler.on('task-start', (task) => {
  console.log(`开始: ${task.url}`);
});

crawler.on('task-complete', (result) => {
  console.log(`完成: ${result.response.url}`);
});

crawler.on('data-extracted', (data, url) => {
  console.log(`从 ${url} 提取的数据:`, data);
});

crawler.on('task-error', (error, task) => {
  console.log(`失败: ${task.url} - ${error.message}`);
});

🛡️ 错误处理

import { CrawlXError, NetworkError, TimeoutError } from 'crawlx';

try {
  const result = await crawler.crawl('https://example.com');
} catch (error) {
  if (error instanceof NetworkError) {
    console.log('网络错误:', error.statusCode);
  } else if (error instanceof TimeoutError) {
    console.log('超时:', error.timeout);
  } else if (error instanceof CrawlXError) {
    console.log('CrawlX错误:', error.code, error.context);
  }
}

📈 监控统计

const stats = crawler.getStats();
console.log('爬虫统计:', {
  isRunning: stats.isRunning,
  results: stats.results,
  scheduler: stats.scheduler,
  httpClient: stats.httpClient,
  plugins: stats.plugins,
});

🏗️ 架构

┌─────────────────┐
│   CrawlX Core   │ ← 主要协调器
├─────────────────┤
│ Plugin Manager  │ ← 可扩展层
├─────────────────┤
│ Task Scheduler  │ ← 并发和队列
├─────────────────┤
│  HTTP Client    │ ← 网络层
├─────────────────┤
│ Parser Engine   │ ← 数据提取
├─────────────────┤
│ Config Manager  │ ← 配置管理
└─────────────────┘

📚 文档

  • 快速开始
  • 高级示例
  • 插件开发
  • 性能调优
  • API文档

🤝 贡献

欢迎贡献!请查看我们的贡献指南

📄 许可证

MIT License

🆘 支持

准备开始爬取了吗? 查看快速开始指南开始您的CrawlX之旅!

Keywords

web-crawler

FAQs

Package last updated on 15 Jun 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts