# 查看帮助信息
npx web-scraper-mcp --help

# 爬取网站图片
npx web-scraper-mcp scrape images https://example.com

# 爬取网站文本
npx web-scraper-mcp scrape text https://example.com

# 列出已爬取的图片
npx web-scraper-mcp list images

# 获取爬取状态
npx web-scraper-mcp status

这是使用 web-scraper-mcp 最简单的方式，特别适合偶尔使用或不想在本地安装的用户。

快速开始

作为MCP服务器运行

安装依赖：

cd web-scraper-mcp
npm install
npm run build

启动服务器：

npm start

或者跳过以上，直接添加到IDE(最推荐) 3. 在Claude Code中配置MCP服务器：

{
  "mcpServers": {
    "web-scraper-mcp": {
      "command": "npx",
      "args": [
        "-y",
        "web-scraper-mcp",
      ]
    }
  }
}

直接使用API

import { ImageScraper, TextScraper } from 'web-scraper-mcp';

// 图片爬取
const imageScraper = new ImageScraper('./images');
const images = await imageScraper.scrapeImagesFromUrl('https://example.com');

// 文本爬取
const textScraper = new TextScraper('./text');
const content = await textScraper.scrapeTextFromUrl('https://example.com');

可用工具

1. scrape_images

爬取指定网站的所有图片并保存到本地。

参数:

url (string, 必需): 要爬取图片的网站URL
outputDir (string, 可选): 图片保存目录，默认为 ./scraped-images
maxConcurrent (number, 可选): 并发下载数量，默认为5

示例:

{
  "name": "scrape_images",
  "arguments": {
    "url": "https://example.com",
    "outputDir": "./my-images",
    "maxConcurrent": 10
  }
}

2. scrape_text

爬取网站的文本内容并保存为Markdown文件。

参数:

url (string, 必需): 要爬取文本的网站URL
outputDir (string, 可选): 文本保存目录，默认为 ./scraped-text

示例:

{
  "name": "scrape_text",
  "arguments": {
    "url": "https://example.com",
    "outputDir": "./my-texts"
  }
}

3. list_images

列出所有已下载的图片信息。

参数:

outputDir (string, 可选): 图片目录路径，默认为 ./scraped-images

示例:

{
  "name": "list_images",
  "arguments": {
    "outputDir": "./my-images"
  }
}

4. list_texts

列出所有已提取的文本文件。

参数:

outputDir (string, 可选): 文本目录路径，默认为 ./scraped-text

示例:

{
  "name": "list_texts",
  "arguments": {
    "outputDir": "./my-texts"
  }
}

5. cleanup_images

清理所有下载的图片文件。

参数:

outputDir (string, 可选): 要清理的图片目录路径，默认为 ./scraped-images

示例:

{
  "name": "cleanup_images",
  "arguments": {
    "outputDir": "./my-images"
  }
}

6. cleanup_texts

清理所有提取的文本文件。

参数:

outputDir (string, 可选): 要清理的文本目录路径，默认为 ./scraped-text

示例:

{
  "name": "cleanup_texts",
  "arguments": {
    "outputDir": "./my-texts"
  }
}

7. get_scraping_status

获取爬虫状态信息。

参数:

outputDir (string, 可选): 输出目录路径

示例:

{
  "name": "get_scraping_status",
  "arguments": {
    "outputDir": "./output"
  }
}

CLI 命令

web-scraper-mcp 提供了命令行界面，可以通过 npx 或安装后使用。

start

启动 MCP 服务器模式。

npx web-scraper-mcp start

scrape

爬取网站内容。

# 爬取图片
npx web-scraper-mcp scrape images https://example.com

# 爬取文本
npx web-scraper-mcp scrape text https://example.com

# 指定输出目录
npx web-scraper-mcp scrape images https://example.com -o ./my-images

# 设置并发数（仅对图片有效）
npx web-scraper-mcp scrape images https://example.com -c 10

list

列出已爬取的内容。

# 列出图片
npx web-scraper-mcp list images

# 列出文本
npx web-scraper-mcp list texts

# 指定目录
npx web-scraper-mcp list images -o ./my-images

status

获取爬取状态。

# 获取默认状态
npx web-scraper-mcp status

# 指定目录
npx web-scraper-mcp status -o ./output

cleanup

清理爬取的内容。

# 清理图片
npx web-scraper-mcp cleanup images

# 清理文本
npx web-scraper-mcp cleanup texts

# 指定目录
npx web-scraper-mcp cleanup images -o ./my-images

使用示例

示例1: 爬取网站图片和文本

// 1. 爬取图片
const imageResult = await server.requestToolCall({
  name: 'scrape_images',
  arguments: {
    url: 'https://example.com',
    outputDir: './example-images'
  }
});

// 2. 爬取文本
const textResult = await server.requestToolCall({
  name: 'scrape_text',
  arguments: {
    url: 'https://example.com',
    outputDir: './example-texts'
  }
});

// 3. 检查状态
const status = await server.requestToolCall({
  name: 'get_scraping_status'
});

示例2: 批量处理多个网站

const websites = [
  'https://example1.com',
  'https://example2.com',
  'https://example3.com'
];

for (const website of websites) {
  // 爬取图片
  await server.requestToolCall({
    name: 'scrape_images',
    arguments: {
      url: website,
      outputDir: `./images-${new Date().toISOString().split('T')[0]}`
    }
  });

  // 爬取文本
  await server.requestToolCall({
    name: 'scrape_text',
    arguments: {
      url: website,
      outputDir: `./texts-${new Date().toISOString().split('T')[0]}`
    }
  });
}

配置选项

环境变量

SCRAPER_OUTPUT_DIR: 默认输出目录
SCRAPER_MAX_CONCURRENT: 默认并发数量
SCRAPER_USER_AGENT: 自定义User-Agent
SCRAPER_TIMEOUT: 请求超时时间（毫秒）

配置文件

创建 config.json 文件：

{
  "imageScraper": {
    "outputDir": "./scraped-images",
    "maxConcurrent": 5,
    "allowedFormats": ["jpg", "png", "gif", "webp", "svg"]
  },
  "textScraper": {
    "outputDir": "./scraped-text",
    "excludeSelectors": ["script", "style", "nav", "footer"]
  }
}

开发指南

项目结构

web-scraper-mcp/
├── src/
│   ├── scrapers/
│   │   ├── imageScraper.ts     # 图片爬虫
│   │   └── textScraper.ts      # 文本爬虫
│   ├── utils/
│   │   ├── validators.ts       # 输入验证
│   │   └── logger.ts           # 日志工具
│   └── index.ts               # MCP服务器入口
├── package.json
├── tsconfig.json
├── README.md
└── dist/                      # 编译输出

运行测试

npm test

构建项目

npm run build

开发模式

npm run dev

CLI 开发模式

# 运行 CLI 脚本（TypeScript）
npm run dev:cli -- --help

# 爬取图片（开发模式）
npm run dev:cli -- scrape images https://example.com

# 爬取文本（开发模式）
npm run dev:cli -- scrape text https://example.com

故障排除

常见问题

网络错误
- 检查网络连接
- 确认URL格式正确
- 验证目标网站可访问
权限错误
- 确认输出目录有写入权限
- 检查文件路径是否合法
内存溢出
- 减少并发下载数量
- 分批处理大量图片
超时错误
- 增加超时时间
- 检查网络速度

调试模式

启用调试日志：

import { Logger } from './utils/logger';
Logger.setLevel('DEBUG');

性能优化

图片爬取优化

使用合适的并发数量
限制图片文件大小
排除不需要的图片格式

文本爬取优化

排除不必要的HTML元素
使用高效的文本提取算法
优化文件写入操作

内存管理

及时释放不再使用的资源
使用流式处理大文件
避免内存泄漏

许可证

MIT License

贡献指南

Fork 项目
创建功能分支 (git checkout -b feature/AmazingFeature)
提交更改 (git commit -m 'Add some AmazingFeature')
推送到分支 (git push origin feature/AmazingFeature)
创建 Pull Request

支持

提交问题: GitHub Issues
文档: Wiki
示例: Examples

更新日志

v1.1.0

添加 CLI 支持，可通过 npx 直接使用
新增命令行界面，支持 scrape、list、status、cleanup 等命令
优化 README 文档，添加详细的使用说明

v1.0.0

初始版本发布
支持图片和文本爬取
MCP协议集成
错误处理和验证
完整的API文档

Keywords

mcp

web-scraper

anthropic

model-context-protocol

web-crawler

image-scraping

FAQs

What is web-scraper-mcp?

Is web-scraper-mcp popular?

Is web-scraper-mcp well maintained?

Package last updated on 24 Nov 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

web-scraper-mcp

Web Scraper MCP Server

功能特性

图片爬取

文本爬取

MCP协议支持

安全特性

安装

使用 npx（推荐）

快速开始

作为MCP服务器运行

直接使用API

可用工具

1. scrape_images

2. scrape_text

3. list_images

4. list_texts

5. cleanup_images

6. cleanup_texts

7. get_scraping_status

CLI 命令

start

scrape

list

status

cleanup

使用示例

示例1: 爬取网站图片和文本

示例2: 批量处理多个网站

配置选项

环境变量

配置文件

开发指南

项目结构

运行测试

构建项目

开发模式

CLI 开发模式

故障排除

常见问题

调试模式

性能优化

图片爬取优化

文本爬取优化

内存管理

许可证

贡献指南

支持

更新日志

v1.1.0

v1.0.0

Keywords

Related posts

The Nightmare Before Deployment

Malicious NuGet Package Typosquats Popular .NET Tracing Library to Steal Wallet Passwords