Horseman Article Parser
A web page article parser which returns an object containing the article's formatted text & other attributes including sentiment, keyphrases, people, places, organisations and spelling suggestions.
Prerequisites
Node.js, NPM & Chrome / Chromium
Install
npm install horseman-article-parser --save
Usage Example
var parser = require('horseman-article-parser');
var options = {
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth"
}
parser.parseArticle(options)
.then(function (article) {
var response = {
title: article.title.text,
excerpt: article.excerpt,
metadescription: article.meta.description.text,
url: article.url,
sentiment: { score: article.sentiment.score, comparative: article.sentiment.comparative },
keyphrases: article.processed.keyphrases,
people: article.people,
orgs: article.orgs,
places: article.places,
text: {
raw: article.processed.text.raw,
formatted: article.processed.text.formatted,
html: article.processed.text.html
},
spelling: article.spelling,
lighthouse: article.lighthouse
}
console.log(response);
})
.catch(function (error) {
console.log(error.message)
console.log(error.stack);
})
parseArticle(options, <socket>)
accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.
See horseman-article-parser-ui as an example.
Options
The options below are set by default
var options = {
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
// node-horsman options (https://ghub.io/node-horseman)
horseman: {
timeout: 10000,
cookies: './cookies.json'
},
// clean-html options (https://ghub.io/clean-html)
cleanhtml: {
'add-remove-tags': ['blockquote', 'span'],
'remove-empty-tags': ['span'],
'replace-nbsp': true
},
// html-to-text options (https://ghub.io/html-to-text)
htmltotext: {
wordwrap: 100,
noLinkBrackets: true,
ignoreHref: true,
tables: true,
uppercaseHeadings: true
},
// retext-keywords options (https://ghub.io/retext-keywords)
retextkeywords: { maximum: 10 },
// lighthouse options (https://github.com/GoogleChrome/lighthouse)
lighthouse: { chromeFlags: ['--headless'] }
}
At a minimum you should pass a url
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth"
}
there are some additional "complex" options available
var options = {
// array of html elements to stip before analysis
striptags: [],
// readability options (https://ghub.io/node-readability)
readability: {},
// retext spell options (https://ghub.io/retext-spell)
retextspell: {}
}
Development
Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting.
Build the dependencies with:
npm install
Lint the index.js file with:
npm run lint --fix
Dependencies
Dev Dependencies
License
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details