Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

datahtml

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

datahtml

A lib to work with html and web data

  • 0.6.0
  • Source
  • PyPI
  • Socket score

Maintainers
1

datahtml

PyPI - Version PyPI - Python Version readthedocs


datahtml is a library for crawling and extraction of data from html and xml content.

Datahtml lets you:

  • Extract ld+json data from html
  • Extract frequently used meta tags from html (those that are used for SEO and social media, between others)
  • Extract Article data from a html, usually from Newspaper sites
  • Parse RSS feeds from sites
  • Crawl some specific social media sites like google and youtube

Under the hood datahtml uses libraries like BeautifoulSoup, Newspaper2k, feedparser between others, but datahtml takes an opinionated approach for crawling based on our expriencies doing so.

Quickstart

pip install datahtml

from datahtml import web, crawler

c = crawler.LocalCrawler()
w = web.download("https://www.infobae.com", crawler=c)
w.links()

License

datahtml is distributed under the terms of the MPL-2.0 license.

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc