New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

read-art

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

read-art

Scrape article from any page, automatically, make web page readability.

0.3.0
Source
npm

Version published: 10 years ago

Weekly downloads: 55; increased by587.5%

Maintainers: 1

Weekly downloads

Created: 11 years ago

Source

read-art

Readability reference to Arc90's.
Scrape article from any page, automatically.
Make any web page readability, no matter Chinese or English,very useful for ElasticSearch data spider.

快速抓取网页文章标题和内容，适合node.js爬虫使用，服务于ElasticSearch。

NOTES: the property dataType was changed to output, sorry for that.

Features

Fast speed base on Cheerio
Automatic Read Title & Content
Follow Redirects
Automatic Decoding Content Encodings(Avoid Messy Codes, Especially Chinese)
Gzip/Deflate Encoding(Automatic Decompress)
Proxy
Generate User-Agent

Installation

npm install read-art

Usage

read(html/uri [, options], callback)

read-art is designed to be the simplest way possible to make web-article scrape, it supports the definitions such as:

html/uri Html or Uri string.
options An optional options object, including:
- output The data type of article content, including: html, text. see more from Output
- killBreaks A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one <br /> or not, true as default.
- options from cheerio
- options from req-fast
callback The callback to run - callback(error, article, options)

See test or examples folder for a complete example

Just try it

var read = require('read-art');
// read from google could be
read('http://google.com', { charset: 'utf8' }, function(err, art, options){
  if(err){
    throw err;
  }
  var title = art.title,      // title of article
      content = art.content,  // content of article
      html = art.html;        // whole original innerHTML
});
// or
read({ uri: 'http://google.com', charset: 'utf8' }, function(err, art, options){

});
// what about html?
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', { charset: 'utf8' }, function(err, art, options){

});
// of course could be
read({ uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', charset: 'utf8' }, function(err, art, options){

});

CAUTION: Title must be wrapped in a <title> tag and content must be wrapped in a <body> tag.

Output

You can set different types to wrap the outputs

text

Returns the inner text of article content(strip html tags), e.g.:

read('http://example.com', {
  output: 'text'
}, function(err, art){
  // art.content will be formatted as TEXT
});
// or
read('http://example.com', {
  output: {
    type: 'text',
    stripSpaces: true
  }
}, function(err, art){
  // art.content will be formatted as TEXT
});

html

Returns the inner HTML of article content, e.g.:

read('http://example.com', {
  output: 'html'
}, function(err, art){
  // art.content will be formatted as HTML
});
// or
read('http://example.com', {
  output: {
    type: 'html',
    stripSpaces: true
  }
}, function(err, art){
  // art.content will be formatted as HTML
});

json

Returns the restful result of article content, e.g.:

read('http://example.com', {
  output: 'json'
}, function(err, art){
  // art.content will be formatted as JSON
});
// or
read('http://example.com', {
  output: {
    type: 'json',
    stripSpaces: true
  }
}, function(err, art){
  // art.content will be formatted as JSON
});

The art.content will be an Array such as:

[
  { "type": "img", "value": "http://example.com/jpg/site1/20140519/00188b1996f214e3a25417.jpg" },
  { "type": "text", "value": "TEXT goes here..." }
]

There only two types are supported now: img and text

As you see, the output could be defined in two ways:

Simple String, should be one of text, html and json.
Complex Object, including:

type: one of text, html and json, default as 'html'.
stripSpaces: a value indicating whether strip tab symbols(\r\t\n), default as false.

Features

Refrain from the crazy messy codes

read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', {
  charset: 'gbk'
}, function(err, art){
  // ...
});

Generate agent to simulate browsers

read('http://example.com', {
  agent: true // true as default
}, function(err, art){
  // ...
});

Use proxy to avoid being blocked.

read('http://example.com', {
  proxy: {
    host: 'http://myproxy.com/',
    port: 8081,
    proxyAuth: 'user:password'
  }
}, function(err, art){
  // ...
});

Test

npm test

Other Library

luin/node-readability

luin/node-readability is the first module which implements Readability in node.js, lots of hit points, easy to use, but the problem is - Too slow. It was based on JSDOM, the HTML must be written in strict mode, you can not make any mistake, e.g.:

<P>Paragraphs</p>
<p>My book name is <read-art></p>
<div><p>Hey, dude!</div>

All above will cause hiberarchy errors, and otherwise, JSDOM is a memory killer.

bndr/node-read

bndr/node-read is good, and I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.

TODO

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Keywords

FAQs

What is read-art?

Is read-art popular?

Is read-art well maintained?

Package last updated on 05 Nov 2014

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

read-art

Features

Installation

Usage

Output

text

html

json

Features

Refrain from the crazy messy codes

Generate agent to simulate browsers

Use proxy to avoid being blocked.

Test

Other Library

luin/node-readability

bndr/node-read

TODO

License

Keywords

Related posts

Typosquatted Go Packages Deliver Malware Loader Targeting Linux and macOS Systems

Bybit Hack Puts Crypto Losses at $1.6B, Surpassing All of Last Year in Just Two Months