Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

scrab

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

scrab

Fast and easy to use scraper for the content-centered web pages, e.g. blog posts, news, etc.

  • 0.0.6
  • PyPI
  • Socket score

Maintainers
1

scrab - Fuzzy content scraper

Python package PyPI - Python Version GitHub Release GitHub Release License: MIT

Fast and easy to use content scraper for topic-centred web pages, e.g. blog posts, news and wikis.

The tool uses heuristics to extract main content and ignores surrounding noise. No processing rules. No XPath. No configuration.

Installing

pip install scrab

Usage

scrab https://blog.post

Store extracted content to a file:

scrab https://blog.post > content.txt

ToDo List

  • Support <main> tag
  • Add support for lists
  • Add support for scripts
  • Add support for markdown output format
  • Download and save referenced images
  • Extract and embed links

Development

# Lint with flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

# Check with mypy
mypy ./scrab
mypy ./tests

# Run tests
pytest

Publish to PyPI:

rm -rf dist/*
python setup.py sdist bdist_wheel
twine upload dist/*

License

This project is licensed under the MIT License.

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc