Comparing version 0.3.0 to 0.3.1
@@ -0,1 +1,5 @@ | ||
# 2014/11/28 | ||
- RexExp of videos | ||
- Update documentation | ||
# 2014/11/05 | ||
@@ -2,0 +6,0 @@ - Decode HTML entities manually |
@@ -30,3 +30,3 @@ // Copyright 2014 Tjatse | ||
re_stopwords = /[\.。::!;;](\s|$)/, | ||
re_videos = /http:\/\/(?:www\.)?(?:youtube|vimeo|youku|tudou|56|letv|iqiyi)\.com/i, | ||
re_videos = /(youtube|vimeo|youku|tudou|56|letv|iqiyi|sohu|sina|163)\.(com|com\.cn|cn|net)/i, | ||
re_imgUrl = /\.(gif|jpe?g|png)$/i, | ||
@@ -33,0 +33,0 @@ re_commas = /[,,.。;;??、]/g; |
{ | ||
"name": "read-art", | ||
"version": "0.3.0", | ||
"version": "0.3.1", | ||
"description": "Scrape article from any page, automatically, make web page readability.", | ||
@@ -5,0 +5,0 @@ "main": "index.js", |
115
README.md
@@ -6,18 +6,37 @@ read-art [](http://badge.fury.io/js/read-art) [](https://travis-ci.org/Tjatse/node-readability) | ||
1. Readability reference to Arc90's. | ||
2. Scrape article from any page, automatically. | ||
3. Make any web page readability, no matter Chinese or English,very useful for ElasticSearch data spider. | ||
2. Scrape article from any page (automatically). | ||
3. Make any web page readable, no matter Chinese or English. | ||
> *快速抓取网页文章标题和内容,适合node.js爬虫使用,服务于ElasticSearch。* | ||
**NOTES: the property `dataType` was changed to `output`, sorry for that.** | ||
## Features | ||
- Fast speed base on Cheerio | ||
- Faster Than Any Readability Module | ||
- High Performance - Less memory | ||
- Automatic Read Title & Content | ||
- Follow Redirects | ||
- Automatic Decoding Content Encodings(Avoid Messy Codes, Especially Chinese) | ||
- Gzip/Deflate Encoding(Automatic Decompress) | ||
- Proxy | ||
- Gzip/Deflate Support | ||
- Proxy Support | ||
- Generate User-Agent | ||
## Performance | ||
In my case, the indexed data is about **400 thousand per day**, **10 million per month**, and the maximize indexing speed is **35/second**, the memory cost are limited **under 100 megabytes**. | ||
**Pictures don't lie:** | ||
 | ||
 | ||
 | ||
 | ||
Maybe you wanna known: | ||
- All the spiders are managed by [PM2](https://github.com/Unitech/PM2) (I am currently working on that with friends, very welcome to use the amazing tool). | ||
- Loose coupling between Spiders, Indexers and Data, they're queued by NSQ. | ||
## Pure Example With High Availability | ||
[spider2](https://github.com/Tjatse/spider2) | ||
## Installation | ||
@@ -33,8 +52,8 @@ ```javascript | ||
read-art is designed to be the simplest way possible to make web-article scrape, it supports the definitions such as: | ||
It supports the definitions such as: | ||
* **html/uri** Html or Uri string. | ||
* **options** An optional options object, including: | ||
- **output** The data type of article content, including: html, text. see more from [Output](#output) | ||
- **killBreaks** A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one `<br />` or not, `true` as default. | ||
- **output** The data type of article content, including: `html`, `text` or `json`. see more from [Output](#output) | ||
- **killBreaks** A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one `<br />` or not, `true` by default. | ||
- **options from [cheerio](https://github.com/cheeriojs/cheerio)** | ||
@@ -46,24 +65,29 @@ - **options from [req-fast](https://github.com/Tjatse/req-fast)** | ||
Just try it | ||
### Simple Examples | ||
```javascript | ||
var read = require('read-art'); | ||
// read from google could be | ||
read('http://google.com', { charset: 'utf8' }, function(err, art, options){ | ||
if(err){ | ||
throw err; | ||
} | ||
var title = art.title, // title of article | ||
content = art.content, // content of article | ||
html = art.html; // whole original innerHTML | ||
// read from google: | ||
read('http://google.com', function(err, art, options){ | ||
if(err){ | ||
throw err; | ||
} | ||
var title = art.title, // title of article | ||
content = art.content, // content of article | ||
html = art.html; // whole original innerHTML | ||
}); | ||
// or | ||
read({ uri: 'http://google.com', charset: 'utf8' }, function(err, art, options){ | ||
// or: | ||
read({ | ||
uri: 'http://google.com', | ||
charset: 'utf8' | ||
}, function(err, art, options){ | ||
}); | ||
// what about html? | ||
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', { charset: 'utf8' }, function(err, art, options){ | ||
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', function(err, art, options){ | ||
}); | ||
// of course could be | ||
read({ uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', charset: 'utf8' }, function(err, art, options){ | ||
read({ | ||
uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>' | ||
}, function(err, art, options){ | ||
@@ -75,5 +99,14 @@ }); | ||
## Output | ||
You can set different types to wrap the outputs | ||
You can wrap the content of article with different types, the `output` option could be: | ||
- **String** | ||
One of `text`, `html` and `json`, `html` by default. | ||
- **Object** | ||
Key-value pairs including: | ||
- **type** | ||
One of `text`, `html` and `json`. | ||
- **stripSpaces** | ||
A value indicates whether strip the tab symbols (\r\n\t) or not, `false` by default. | ||
### text | ||
Returns the inner text of article content(strip html tags), e.g.: | ||
Returns the inner text, e.g.: | ||
```javascript | ||
@@ -97,3 +130,3 @@ read('http://example.com', { | ||
### html | ||
Returns the inner HTML of article content, e.g.: | ||
Returns the inner HTML, e.g.: | ||
```javascript | ||
@@ -116,4 +149,6 @@ read('http://example.com', { | ||
**Notes** Videos could be scraped now, the domains currently are supported: *youtube|vimeo|youku|tudou|56|letv|iqiyi|sohu|sina|163*. | ||
### json | ||
Returns the restful result of article content, e.g.: | ||
Returns the restful result, e.g.: | ||
```javascript | ||
@@ -132,5 +167,6 @@ read('http://example.com', { | ||
}, function(err, art){ | ||
// art.content will be formatted as JSON | ||
// art.content will be formatted as Array | ||
}); | ||
``` | ||
The art.content will be an Array such as: | ||
@@ -143,12 +179,10 @@ ```json | ||
``` | ||
There only two types are supported now: *img* and *text* | ||
As you see, the output could be defined in two ways: | ||
1. Simple String, should be one of *text*, *html* and *json*. | ||
2. Complex Object, including: | ||
- type: one of *text*, *html* and *json*, default as 'html'. | ||
- stripSpaces: a value indicating whether strip tab symbols(\r\t\n), default as false. | ||
Util now there are only two types - *img* and *text*, the `src` of `img` element is absolute even if the original is a relative one. | ||
## Features | ||
### Refrain from the crazy messy codes | ||
**Notes** The video sources of the sites are quite different, it's hard to fit all in a common way, I haven't find a good way to solve that, PRs are in demand. | ||
## You Should Known | ||
### Pass the charset manually to refrain from the crazy messy codes | ||
```javascript | ||
@@ -191,3 +225,3 @@ read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', { | ||
### [luin/node-readability](https://github.com/luin/node-readability) | ||
luin/node-readability is the first module which implements Readability in node.js, lots of hit points, easy to use, but the problem is - Too slow. It was based on `JSDOM`, the HTML must be written in strict mode, you can not make any mistake, e.g.: | ||
luin/node-readability is an old Readability that be transformed from **Arc90**, easy to use, but the problem is - TOO SLOW. It was based on `jsdom`, so, the HTML must be written in strict mode, which means you can not make any mistake, e.g.: | ||
@@ -199,9 +233,9 @@ ```html | ||
``` | ||
All above will cause hiberarchy errors, and otherwise, `JSDOM` is a memory killer. | ||
All above will cause `hiberarchy errors`, more seriously, `jsdom` is a memory killer. | ||
### [bndr/node-read](https://github.com/bndr/node-read) | ||
bndr/node-read is good, and I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own. | ||
I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own. | ||
## TODO | ||
- [x] get video, img tags | ||
- [ ] get published time | ||
@@ -211,3 +245,2 @@ - [ ] get author | ||
- [ ] pagination | ||
- [x] more tests | ||
@@ -214,0 +247,0 @@ ## License |
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
40051
252