Security News
Research
Supply Chain Attack on Rspack npm Packages Injects Cryptojacking Malware
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
An automatic web page content extractor for Node.js!
Automatically grab the main text out of a webpage like this:
extractor = require('unfluff');
data = extractor(my_html_data);
console.log(data.text);
In other words, it turns pretty webpages into boring plain text/json data:
This might be useful for:
Please don't use this for:
This library is largely based on python-goose by Xavier Grangier which is in turn based on goose by Gravity Labs. However, it's not an exact port so it may behave differently on some pages and the feature set is a little bit different. If you are looking for a python or Scala/Java/JVM solution, check out those libraries!
npm install --save unfluff
You can use unfluff
from node or right on the command line!
This is what unfluff
will try to grab from a web page:
title
- The document's title (from the <title> tag)text
- The main text of the document with all the junk thrown awaytags
- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.canonicalLink
- The canonical url of the document, if given.lang
- The language of the document, either detected or supplied by you.description
- The description of the document, from <meta> tagsfavicon
- The url of the document's favicon.This is returned as a simple json object.
You can pass a webpage to unfluff and it will try to parse out the interesting bits.
You can either pass in a file name:
unfluff my_file.html
Or you can pipe it in:
curl -s "http://somesite.com/page" | unfluff
You can easily chain this together with other unix commands to do cool stuff. For example, you can download a web page, parse it and then use jq to print it just the body text.
curl -s "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff | jq -r .text
And here's how to find the top 10 most common words in an article:
curl -s "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff | tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head -10
extractor(html, language)
html: The html you want to parse
language (optional): The document's two-letter language code. This will be auto-detected as best as possible, but there might be cases where you want to override it.
The extraction algorithm depends heavily on the language, so it probably won't work if you have the language set incorrectly.
extractor = require('unfluff');
data = extractor(my_html_data);
Or supply the language code yourself:
extractor = require('unfluff', 'en');
data = extractor(my_html_data);
data
will then be a json object that looks like this:
{
"title": "Shovel Knight review: rewrite history",
"text": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it. [.. snip ..]",
"tags": [],
"canonicalLink": "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u",
"lang": "en",
"description": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it.",
"favicon": "http://cdn1.vox-cdn.com/community_logos/42931/favicon.ico"
}
0.0.1
FAQs
A web page content extractor
We found that unfluff demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.
Security News
Sonar’s acquisition of Tidelift highlights a growing industry shift toward sustainable open source funding, addressing maintainer burnout and critical software dependencies.