Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

harvest-webforum

Package Overview
Dependencies
Maintainers
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

harvest-webforum

A toolkit for extracting posts and post metadata from web forums

  • 1.1.0
  • PyPI
  • Socket score

Maintainers
2

Harvest - A toolkit for extracting posts and post metadata from web forums

Actions Status codecov PyPI version

Automatic extraction of forum posts and metadata is a challenging task since forums do not expose their content in a standardized structure. Harvest performs this task reliably for many web forums and offers an easy way to extract data from web forums.

Installation

At the command line:

$ pip install harvest-webforum

If you want to install from the latest sources, you can do:

$ git clone https://github.com/fhgr/harvest.git
$ cd harvest
$ python3 setup.py install

Python library

Embedding harvest into your code is easy, as outlined below:

from urllib.request import urlopen, Request
from harvest import extract_data

USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"

url = "https://forum.videolan.org/viewtopic.php?f=14&t=145604"
req = Request(url, headers={'User-Agent': USER_AGENT})
html = urlopen(req).read().decode('utf-8')

result = extract_data(html, url)
print(result)

WEB-FORUM-52 gold standard

The corpus currently contains from 52 different web forums gold standard documents. These documents are also used by the integrations test of harvest.

Publication

  • Weichselbraun, Albert, Brasoveanu, Adrian M. P., Waldvogel, Roger and Odoni, Fabian. (2020). “Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums”. IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2020), Melbourne, Australia, Accepted 27 October 2020.

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc