Horseman Article Parser
A web page article parser which returns an object containing the article's formatted text and other attributes including sentiment, keyphrases, people, places, organisations, spelling suggestions, in-article links, meta data & lighthouse audit results.
Prerequisites
Node.js & NPM
Install
npm install horseman-article-parser --save
Usage Example
var parser = require('horseman-article-parser');
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
lighthouse: {
enabled: true
}
}
parser.parseArticle(options)
.then(function (article) {
var response = {
title: article.title.text,
excerpt: article.excerpt,
metadescription: article.meta.description.text,
url: article.url,
sentiment: { score: article.sentiment.score, comparative: article.sentiment.comparative },
keyphrases: article.processed.keyphrases,
keywords: article.processed.keywords,
people: article.people,
orgs: article.orgs,
places: article.places,
text: {
raw: article.processed.text.raw,
formatted: article.processed.text.formatted,
html: article.processed.text.html
},
spelling: article.spelling,
meta: article.meta,
links: article.links,
lighthouse: article.lighthouse
}
console.log(response);
})
.catch(function (error) {
console.log(error.message)
console.log(error.stack);
})
parseArticle(options, <socket>)
accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.
See horseman-article-parser-ui as an example.
Options
The options below are set by default
var options = {
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
// puppeteer options (https://github.com/GoogleChrome/puppeteer)
puppeteer: {
headless: true,
defaultViewport: null,
},
// clean-html options (https://ghub.io/clean-html)
cleanhtml: {
'add-remove-tags': ['blockquote', 'span'],
'remove-empty-tags': ['span'],
'replace-nbsp': true
},
// html-to-text options (https://ghub.io/html-to-text)
htmltotext: {
wordwrap: 100,
noLinkBrackets: true,
ignoreHref: true,
tables: true,
uppercaseHeadings: true
},
// retext-keywords options (https://ghub.io/retext-keywords)
retextkeywords: { maximum: 10 },
// lighthouse options (https://github.com/GoogleChrome/lighthouse)
lighthouse: {
enabled: false
}
}
For more Puppeteer launch options see https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions
At a minimum you should pass a url
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth"
}
If you want to enable lighthouse analysis pass the following
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
lighthouse: {
enabled: true
}
}
there are some additional "complex" options available
var options = {
// array of html elements to stip before analysis
striptags: [],
// readability options (https://ghub.io/node-readability)
readability: {},
// retext spell options (https://ghub.io/retext-spell)
retextspell: {}
}
Development
Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting.
Build the dependencies with:
npm install
Lint the project files with:
npm run lint
Test the package with:
npm run test
Dependencies
Dev Dependencies
License
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details
Notes
Due to node-readability being stale I have imported the relevent functions into this project and refactored it so it doesn't use request and therfor has no vulnrabilities.