Thresher
data:image/s3,"s3://crabby-images/fe674/fe674b0672d5d44d90c76d25f80d8e01736a6e17" alt="Code Climate"
Thresher is a library for modern web scraping in Node.js. It is unique in that:
- it is headless: URLs are rendered in a GUI-less browser, meaning the version of the HTML you scrape is the same one visitors would see on their screen
- it is declarative: Scrapers are defined in separate JSON files. This mean no programming required! It also means any other software supporting the same format could use the same scraper definitions.
Thresher was developed as part of the ContentMine stack for mining the academic literature.
Installation
thresher
is very easy to install. Simply:
npm install --save thresher
Contributing
We are not yet accepting contributions, if you'd like to help please drop me an email (richard@contentmine.org) and I'll let you know when we're ready for that.
Release History
- 0.1.0 - fork thresher from quickscrape
License
Copyright (c) 2014 Richard Smith-Unna
Licensed under the MIT license.