Horseman Article Parser
A web page article parser which returns an object containing the article's formatted text and other attributes including sentiment, keyphrases, people, places, organisations, spelling suggestions, in-article links, meta data & lighthouse audit results.
Prerequisites
Node.js & NPM
Install
npm install horseman-article-parser --save
Usage Example
var parser = require('horseman-article-parser');
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
enabled: ['lighthouse', 'screenshot', 'links', 'sentiment', 'entities', 'spelling', 'keywords']
}
parser.parseArticle(options)
.then(function (article) {
var response = {
title: article.title.text,
excerpt: article.excerpt,
metadescription: article.meta.description.text,
url: article.url,
sentiment: { score: article.sentiment.score, comparative: article.sentiment.comparative },
keyphrases: article.processed.keyphrases,
keywords: article.processed.keywords,
people: article.people,
orgs: article.orgs,
places: article.places,
text: {
raw: article.processed.text.raw,
formatted: article.processed.text.formatted,
html: article.processed.text.html
},
spelling: article.spelling,
meta: article.meta,
links: article.links,
lighthouse: article.lighthouse
}
console.log(response);
})
.catch(function (error) {
console.log(error.message)
console.log(error.stack);
})
parseArticle(options, <socket>)
accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.
See horseman-article-parser-ui as an example.
Options
The options below are set by default
var options = {
// puppeteer options (https://github.com/GoogleChrome/puppeteer)
puppeteer: {
// puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)
launch: {
headless: true,
defaultViewport: null
},
// puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)
goto: {
waitUntil: 'domcontentloaded'
}
},
// clean-html options (https://ghub.io/clean-html)
cleanhtml: {
'add-remove-tags': ['blockquote', 'span'],
'remove-empty-tags': ['span'],
'replace-nbsp': true
},
// html-to-text options (https://ghub.io/html-to-text)
htmltotext: {
wordwrap: 100,
noLinkBrackets: true,
ignoreHref: true,
tables: true,
uppercaseHeadings: true
},
// retext-keywords options (https://ghub.io/retext-keywords)
retextkeywords: { maximum: 10 }
}
At a minimum you should pass a url
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth"
}
If you want to enable the advanced features you should pass the following
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
enabled: ['lighthouse', 'screenshot', 'links', 'sentiment', 'entities', 'spelling', 'keywords']
}
If you want to pass cookies to puppeteer use the following
var options = {
puppeteer: {
cookies: [{ name: 'cookie1', value: 'val1', domain: '.domain1' },{ name: 'cookie2', value: 'val2', domain: '.domain2' }]
}
}
To strip tags before processing use the following
var options = {
striptags: ['.something', '#somethingelse']
}
If you need to dismiss any popups e.g. a privacy popup use the following
var options = {
clickelements: ['#button1', '#button2']
}
there are some additional "complex" options available
var options = {
// array of html elements to stip before analysis
striptags: [],
// readability options (https://ghub.io/node-readability)
readability: {},
// retext spell options (https://ghub.io/retext-spell)
retextspell: {}
}
Development
Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting.
Build the dependencies with:
npm install
Lint the project files with:
npm run lint
Test the package with:
npm run test
Dependencies
Dev Dependencies
License
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details
Notes
Due to node-readability being stale I have imported the relevent functions into this project and refactored it so it doesn't use request and therfor has no vulnrabilities.