Postlight's Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.
Postlight Parser powers Postlight Reader, a browser extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.
Postlight Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.
How? Like this.
Installation
yarn add @postlight/parser
npm install @postlight/parser
Usage
import Parser from '@postlight/parser';
Parser.parse(url).then(result => console.log(result));
The result looks like this:
{
"title": "Thunder (mascot)",
"content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
"author": "Wikipedia Contributors",
"date_published": "2016-09-16T20:56:00.000Z",
"lead_image_url": null,
"dek": null,
"next_page_url": null,
"url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
"domain": "en.wikipedia.org",
"excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
"word_count": 4677,
"direction": "ltr",
"total_pages": 1,
"rendered_pages": 1
}
If Parser is unable to find a field, that field will return null
.
parse()
Options
Content Formats
By default, Postlight Parser returns the content
field as HTML. However, you can override this behavior by passing in options to the parse
function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html'
, 'markdown'
, and 'text'
). For example:
Parser.parse(url, { contentType: 'markdown' }).then(result =>
console.log(result)
);
This returns the the page's content
as GitHub-flavored Markdown:
"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."
You can include custom headers in requests by passing name-value pairs to the parse
function as follows:
Parser.parse(url, {
headers: {
Cookie: 'name=value; name2=value2; name3=value3',
'User-Agent':
'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',
},
}).then(result => console.log(result));
Pre-fetched HTML
You can use Postlight Parser to parse custom or pre-fetched HTML by passing an HTML string to the parse
function as follows:
Parser.parse(url, {
html:
'<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));
Note that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.
The command-line parser
Postlight Parser also ships with a CLI, meaning you can use it from your command line like so:
yarn global add @postlight/parser
npm -g install @postlight/parser
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --add-extractor ./src/extractors/fixtures/postlight.com/index.js
License
Licensed under either of the below, at your preference:
Contributing
For details on how to contribute to Postlight Parser, including how to write a custom content extractor for any site, see CONTRIBUTING.md
Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.
🔬 A Labs project from your friends at Postlight. Happy coding!