read-art

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

read-art

Scrape article from any page, automatically, make web page readability.

0.3.2
Source
npm

Version published: 10 years ago

Weekly downloads: 55; increased by587.5%

Maintainers: 1

Weekly downloads

Created: 11 years ago

Source

read-art

Readability reference to Arc90's.
Scrape article from any page (automatically).
Make any web page readable, no matter Chinese or English.

快速抓取网页文章标题和内容，适合node.js爬虫使用，服务于ElasticSearch。

Features

Faster Than Any Readability Module
High Performance - Less memory
Automatic Read Title & Content
Follow Redirects
Automatic Decoding Content Encodings(Avoid Messy Codes, Especially Chinese)
Gzip/Deflate Support
Proxy Support
Generate User-Agent

Performance

In my case, the indexed data is about 400 thousand per day, 10 million per month, and the maximize indexing speed is 35/second, the memory cost are limited under 100 megabytes.

Pictures don't lie:

Gochas

All the spiders are managed by PM2 (I am currently working on that with friends, very welcome to use this amazing tool).
Loose coupling between Spiders, Indexers and Data, they're queued by NSQ.

Installation

npm install read-art

Usage

read(html/uri [, options], callback)

It supports the definitions such as:

html/uri Html or Uri string.
options An optional options object, including:
- output The data type of article content, including: html, text or json. see more from Output
- killBreaks A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one <br /> or not, true by default.
- minTextLength If the content is less than [minTextLength] characters, don't even count it, 25 by default.
- options from cheerio
- options from req-fast
- scoreRule Custom the score rules of each node, head over to Score Rule to get more information. One arguments will be passed into the callback function:
  - node The cheerio object.
callback The callback to run - callback(error, article, options)

See test or examples folder for a complete example

Examples

With High Availability: spider2

var read = require('read-art');
// read from google:
read('http://google.com', function(err, art, options){
    if(err){
      throw err;
    }
    var title = art.title,      // title of article
        content = art.content,  // content of article
        html = art.html;        // whole original innerHTML
});
// or:
read({
    uri: 'http://google.com',
    charset: 'utf8'
  }, function(err, art, options){

});
// what about html?
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', function(err, art, options){

});
// of course could be
read({
    uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>'
  }, function(err, art, options){

});

CAUTION: Title must be wrapped in a <title> tag and content must be wrapped in a <body> tag.

## Score Rule In some situations, we need to custom score rules to grab the correct content of article, such as BBS and QA forums. There are two effective ways to do this: - **minTextLength** It's useful to get rid of useless elements (`P` / `DIV`), e.g. `minTextLength: 100` will dump all the blocks that `node.text().length` is less than `100`.

scoreRule You can custom the score rules manually, e.g.:
```
scoreRule: function(node){
  if (node.hasClass('w740')) {
    return 100;
  }
}
```
The elements which have the w740 className will get 100 bonus points, that will make the node to be the topCandidate, which means it's enough to make the text of DIV/P.w740 to be the content of current article.

Example

read('http://club.autohome.com.cn/bbs/thread-c-66-37239726-1.html', {
  minTextLength: 0,
  scoreRule: function(node){
    if (node.hasClass('w740')) {
      return 100;
    }
  }
}, function(err, art){

});

Output

You can wrap the content of article with different types, the output option could be:

String One of text, html and json, html by default.
Object Key-value pairs including:
- type One of text, html and json.
- stripSpaces A value indicates whether strip the tab symbols (\r\n\t) or not, false by default.

text

Returns the inner text, e.g.:

read('http://example.com', {
  output: 'text'
}, function(err, art){
  // art.content will be formatted as TEXT
});
// or
read('http://example.com', {
  output: {
    type: 'text',
    stripSpaces: true
  }
}, function(err, art){
  // art.content will be formatted as TEXT
});

html

Returns the inner HTML, e.g.:

read('http://example.com', {
  output: 'html'
}, function(err, art){
  // art.content will be formatted as HTML
});
// or
read('http://example.com', {
  output: {
    type: 'html',
    stripSpaces: true
  }
}, function(err, art){
  // art.content will be formatted as HTML
});

json

Returns the restful result, e.g.:

read('http://example.com', {
  output: 'json'
}, function(err, art){
  // art.content will be formatted as JSON
});
// or
read('http://example.com', {
  output: {
    type: 'json',
    stripSpaces: true
  }
}, function(err, art){
  // art.content will be formatted as Array
});

The art.content will be an Array such as:

[
  { "type": "img", "value": "http://example.com/jpg/site1/20140519/00188b1996f214e3a25417.jpg" },
  { "type": "text", "value": "TEXT goes here..." }
]

Util now there are only two types - img and text, the src of img element is absolute even if the original is a relative one.

Notes The video sources of the sites are quite different, it's hard to fit all in a common way, I haven't find a good way to solve that, PRs are in demand.

You Should Known

Pass the charset manually to refrain from the crazy messy codes

read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', {
  charset: 'gbk'
}, function(err, art){
  // ...
});

Generate agent to simulate browsers

read('http://example.com', {
  agent: true // true as default
}, function(err, art){
  // ...
});

Use proxy to avoid being blocked.

read('http://example.com', {
  proxy: {
    host: 'http://myproxy.com/',
    port: 8081,
    proxyAuth: 'user:password'
  }
}, function(err, art){
  // ...
});

Test

npm test

luin/node-readability is an old Readability that be transformed from Arc90, easy to use, but the problem is - TOO SLOW. It was based on jsdom, so, the HTML must be written in strict mode, which means you can not make any mistake, e.g.:

<P>Paragraphs</p>
<p>My book name is <read-art></p>
<div><p>Hey, dude!</div>

All above will cause hiberarchy errors, more seriously, jsdom is a memory killer.

bndr/node-read

I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.

TODO

get published time
get author
get source
pagination

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Keywords

FAQs

What is read-art?

Is read-art popular?

Is read-art well maintained?

Package last updated on 03 Jan 2015

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

read-art

read-art

Features

Performance

Gochas

Installation

Usage

Examples

Example

Output

text

html

json

You Should Known

Pass the charset manually to refrain from the crazy messy codes

Generate agent to simulate browsers

Use proxy to avoid being blocked.

Test

Other Library

luin/node-readability

bndr/node-read

TODO

License

Keywords

Related posts

read-art

Features

Performance

Gochas

Installation

Usage

Examples

Example

Output

text

html

json

You Should Known

Pass the charset manually to refrain from the crazy messy codes

Generate agent to simulate browsers

Use proxy to avoid being blocked.

Test

Other Library

luin/node-readability

bndr/node-read

TODO

License

Keywords

Related posts

Typosquatted Go Packages Deliver Malware Loader Targeting Linux and macOS Systems

Bybit Hack Puts Crypto Losses at $1.6B, Surpassing All of Last Year in Just Two Months