soupselect

Adds CSS selector support to htmlparser for scraping activities - port of soupselect (python)

0.2.0
latest
npm

Version published: 14 years ago

Weekly downloads: 201; increased by3.61%

Maintainers: 0

Weekly downloads

Created: 14 years ago

Source

node-soupselect

A port of Simon Willison's soupselect for use with node.js and node-htmlparser.

$ npm install soupselect

Minimal example...

var select = require('soupselect').select;
// dom provided by htmlparser...
select(dom, "#main a.article").forEach(function(element) {//...});

Wanted a friendly way to scrape HTML using node.js. Tried using jsdom, prompted by this article but, unfortunately, jsdom takes a strict view of lax HTML making it unusable for scraping the kind of soup found in real world web pages. Luckily htmlparser is more forgiving. More details on this found here.

A complete example including fetching HTML etc...;

var select = require('soupselect').select,
    htmlparser = require("htmlparser"),
    http = require('http'),
    sys = require('sys');

// fetch some HTML...
var http = require('http');
var host = 'www.reddit.com';
var client = http.createClient(80, host);
var request = client.request('GET', '/',{'host': host});

request.on('response', function (response) {
    response.setEncoding('utf8');

    var body = "";
    response.on('data', function (chunk) {
        body = body + chunk;
    });

    response.on('end', function() {
    
        // now we have the whole body, parse it and select the nodes we want...
        var handler = new htmlparser.DefaultHandler(function(err, dom) {
            if (err) {
                sys.debug("Error: " + err);
            } else {
            
                // soupselect happening here...
                var titles = select(dom, 'a.title');
            
                sys.puts("Top stories from reddit");
                titles.forEach(function(title) {
                    sys.puts("- " + title.children[0].raw + " [" + title.attribs.href + "]\n");
                })
            }
        });

        var parser = new htmlparser.Parser(handler);
        parser.parseComplete(body);
    });
});
request.end();

Notes:

Requires node-htmlparser > 1.6.2 & node.js 2+
Calls to select are synchronous - not worth trying to make it asynchronous IMO given the use case

FAQs

What is soupselect?

Is soupselect popular?

Is soupselect well maintained?

Package last updated on 28 Apr 2011

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

soupselect

node-soupselect

Related posts

vlt Launches "reproduce": A New Tool Challenging the Limits of Package Provenance

Malicious PyPI Package Exploits Deezer API for Coordinated Music Piracy