#node-htmlcarve
Extract essential meta-informations from any web page, fast and dead simple. Do you need general informations from a given html-site, like the title, a summary, a favicon or a possible RSS-Feed? Just throw an url into this module, and it'll try to find that stuff for you.
Installation
Clone this repository, grab the single coffeescript/javascript-file, or simply use NPM:
npm install htmlcarve
##Usage
use it from the command line:
htmlcarve http://venturebeat.com/
or from inside a script (I'll use coffeescript):
htmlcarve = require "htmlcarve"
htmlcarve.fromUrl "http://venturebeat.com/", (error, data) ->
console.log data.result
##Samples
{ source:
{ html_meta:
{ title: 'Ouch: HP is now promoting PCs running Windows 7 (because Windows 8 isn\'t doing so hot) | VentureBeat | Business | by Ricardo Bilton',
summary: undefined,
image: 'http://venturebeat.files.wordpress.com/2014/01/patrick-collison-headshot.jpg?w=311&h=150&crop=1',
language: 'en',
feed: 'http://feeds.venturebeat.com/VentureBeat',
favicon: 'http://0.gravatar.com/blavatar/6a5449d7551fc1e8f149b0920ca4b6f6?s=16',
keywords: undefined,
author: undefined },
open_graph:
{ title: 'Ouch: HP is now promoting PCs running Windows 7 (because Windows 8 isn\'t doing so hot)',
summary: 'HP\'s new Windows 7 promotion should tell you all you need to know about the state of its Windows 8 hardware. With its latest promotion, HP is heavily pushing PCs running Windows 7, which it says it...',
image: 'http://venturebeat.files.wordpress.com/2014/01/hp-windows.png',
language: undefined },
twitter_card:
{ title: undefined,
summary: undefined,
image: undefined,
author: '@chernandburn' } },
result:
{ title: 'Ouch: HP is now promoting PCs running Windows 7 (because Windows 8 isn\'t doing so hot)',
summary: 'HP\'s new Windows 7 promotion should tell you all you need to know about the state of its Windows 8 hardware. With its latest promotion, HP is heavily pushing PCs running Windows 7, which it says it...',
image: 'http://venturebeat.files.wordpress.com/2014/01/hp-windows.png',
author: '@chernandburn',
language: 'en',
feed: 'http://feeds.venturebeat.com/VentureBeat',
favicon: 'http://0.gravatar.com/blavatar/6a5449d7551fc1e8f149b0920ca4b6f6?s=16',
keywords: undefined },
links:
{ deep: 'http://venturebeat.com/2014/01/20/ouch-hp-is-now-promoting-pcs-running-windows-7-because-windows-8-isnt-doing-so-hot/',
shallow: 'http://venturebeat.com/2014/01/20/ouch-hp-is-now-promoting-pcs-running-windows-7-because-windows-8-isnt-doing-so-hot/',
base: 'http://venturebeat.com' } }
$ htmlcarve http://www.spin.com/articles/miserable-halloween-dream-stream/
{ source:
{ html_meta:
{ title: 'Stream Miserable\'s Cratering \'Halloween Dream\' | SPIN | SPIN Mix | Premieres',
summary: 'Ex-Whirr singer preps solo EP for February 18 release',
image: 'http://www.spin.com/sites/all/themes/zen_spin/assets/images/default-images/spin-logo.png',
language: 'en',
feed: 'http://www.spin.com/rss.xml',
favicon: 'http://www.spin.com/favicon.ico',
keywords: 'miserable',
author: undefined },
open_graph:
{ title: 'Stream Miserable\'s Cratering \'Halloween Dream\'',
summary: 'Ex-Whirr singer preps solo EP for February 18 release',
image: 'http://www.spin.com/sites/all/files/140122-miserable.jpg',
language: undefined },
twitter_card:
{ title: 'Stream Miserable\'s Cratering \'Halloween Dream\'',
summary: 'Ex-Whirr singer preps solo EP for February 18 release',
image: 'http://www.spin.com/sites/all/files/140122-miserable.jpg',
author: undefined } },
result:
{ title: 'Stream Miserable\'s Cratering \'Halloween Dream\'',
summary: 'Ex-Whirr singer preps solo EP for February 18 release',
image: 'http://www.spin.com/sites/all/files/140122-miserable.jpg',
author: undefined,
language: 'en',
feed: 'http://www.spin.com/rss.xml',
favicon: 'http://www.spin.com/favicon.ico',
keywords: 'miserable' },
links:
{ deep: 'http://www.spin.com/articles/miserable-halloween-dream-stream/',
shallow: 'http://www.spin.com/articles/miserable-halloween-dream-stream/',
base: 'http://www.spin.com' } }
##How does this stuff work?
Htmlcarve will process several steps to gather all that informations.
-
Scan for OpenGraphProtocol (OGP) metadata, and use ist. Usually these informations (if present) are the most valuable and desireable ones.
-
Look for Twitter Card metadata. Append the found informations.
-
Go through general html metatags and extract informations from there.
-
Merge the results. If any information is present in more than one step above, use the information from the higher-priorized source. Priorization-order: OGP > TwitterCard > HtmlMetaTags.
##ToDo/Roadmap:
- summarize the html-content on the given page, if no further informations are found.
- extract keywords if none are present
- include the full protocols, not only this quick'n'dirty hack.
- include schema.org
- include microdata formats
##License
MIT.