read-art


- Readability reference to Arc90's.
- Scrape article from any page, automatically.
- Make any web page readability, no matter Chinese or English,very useful for ElasticSearch data spider.
快速抓取网页文章标题和内容,适合node.js爬虫使用,服务于ElasticSearch。
NOTES: the property dataType
was changed to output
, sorry for that.
Features
- Fast speed base on Cheerio
- Automatic Read Title & Content
- Follow Redirects
- Automatic Decoding Content Encodings(Avoid Messy Codes, Especially Chinese)
- Gzip/Deflate Encoding(Automatic Decompress)
- Proxy
- Generate User-Agent
Installation
npm install read-art
Usage
read(html/uri [, options], callback)
read-art is designed to be the simplest way possible to make web-article scrape, it supports the definitions such as:
- html/uri Html or Uri string.
- options An optional options object, including:
- output The data type of article content, including: html, text. see more from Output
- killBreaks A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one
<br />
or not, true
as default. - options from cheerio
- options from req-fast
- callback The callback to run -
callback(error, article, options)
See test or examples folder for a complete example
Just try it
var read = require('read-art');
read('http://google.com', { charset: 'utf8' }, function(err, art, options){
if(err){
throw err;
}
var title = art.title,
content = art.content,
html = art.html;
});
read({ uri: 'http://google.com', charset: 'utf8' }, function(err, art, options){
});
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', { charset: 'utf8' }, function(err, art, options){
});
read({ uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', charset: 'utf8' }, function(err, art, options){
});
CAUTION: Title must be wrapped in a <title>
tag and content must be wrapped in a <body>
tag.
Output
You can set different types to wrap the outputs
text
Returns the inner text of article content(strip html tags), e.g.:
read('http://example.com', {
output: 'text'
}, function(err, art){
});
read('http://example.com', {
output: {
type: 'text',
stripSpaces: true
}
}, function(err, art){
});
html
Returns the inner HTML of article content, e.g.:
read('http://example.com', {
output: 'html'
}, function(err, art){
});
read('http://example.com', {
output: {
type: 'html',
stripSpaces: true
}
}, function(err, art){
});
json
Returns the restful result of article content, e.g.:
read('http://example.com', {
output: 'json'
}, function(err, art){
});
read('http://example.com', {
output: {
type: 'json',
stripSpaces: true
}
}, function(err, art){
});
The art.content will be an Array such as:
[
{ "type": "img", "value": "http://example.com/jpg/site1/20140519/00188b1996f214e3a25417.jpg" },
{ "type": "text", "value": "TEXT goes here..." }
]
There only two types are supported now: img and text
As you see, the output could be defined in two ways:
- Simple String, should be one of text, html and json.
- Complex Object, including:
- type: one of text, html and json, default as 'html'.
- stripSpaces: a value indicating whether strip tab symbols(\r\t\n), default as false.
Features
Refrain from the crazy messy codes
read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', {
charset: 'gbk'
}, function(err, art){
});
Generate agent to simulate browsers
read('http://example.com', {
agent: true
}, function(err, art){
});
Use proxy to avoid being blocked.
read('http://example.com', {
proxy: {
host: 'http://myproxy.com/',
port: 8081,
proxyAuth: 'user:password'
}
}, function(err, art){
});
Test
npm test
Other Library
luin/node-readability is the first module which implements Readability in node.js, lots of hit points, easy to use, but the problem is - Too slow. It was based on JSDOM
, the HTML must be written in strict mode, you can not make any mistake, e.g.:
<P>Paragraphs</p>
<p>My book name is <read-art></p>
<div><p>Hey, dude!</div>
All above will cause hiberarchy errors, and otherwise, JSDOM
is a memory killer.
bndr/node-read is good, and I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.
TODO
License
Copyright 2014 Tjatse
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.