New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

gpt-web-crawler

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

gpt-web-crawler

A web crawler for GPTs to build knowledge bases

  • 0.0.2
  • PyPI
  • Socket score

Maintainers
1

简体中文 English

Introduction

GPT-Web-Crawler is a web crawler based on python and puppeteer. It can crawl web pages and extract content (including WebPages' title,url,keywords,description,all text content,all images and screenshot) from web pages. It is very easy to use and can be used to crawl web pages and extract content from web pages in a few lines of code. It is very suitable for people who are not familiar with web crawling and want to use web crawling to extract content from web pages. Crawler Working

The output of the spider can be a json file, which can be easily converted to a csv file, imported into a database or building an AI agent. Assistant demo

Getting Started

Step1. Install the package.

pip install gpt-web-crawler

Step2. Copy config_template.py and rename it to config.py. Then, edit the config.py file to config the openai api key and other settings, if you need use ProSpider to help you extract content from web pages. If you don't need to use ai help you extract content from web pages, you can keep the config.py file unchanged.

Step3. Run the following code to start a spider.

from gpt_web_crawler import run_spider,NoobSpider
run_spider(NoobSpider, 
           max_page_count= 10 ,
           start_urls="https://www.jiecang.cn/", 
           output_file = "test_pakages.json",
           extract_rules= r'.*\.html' )

Spiders

Spider TypeDescription
NoobSpiderBasic web page scraping
CatSpiderWeb page scraping with screenshots
ProSpiderWeb page scraping with AI-extracted content
LionSpiderWeb page scraping with all images extracted

Cat Spider

Cat spider is a spider that can take screenshots of web pages. It is based on the Noob spider and uses puppeteer to simulate browser operations to take screenshots of the entire web page and save it as an image. So when you use the Cat spider, you need to install puppeteer first.

npm install puppeteer

TODO

  • 支持无需配置config.py

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc