@@ -30,3 +30,3 @@ // Copyright 2014 Tjatse
		re_stopwords = /[\.。:：！;；](\s\|$)/,
		re_videos = /http:\/\/(?:www\.)?(?:youtube\|vimeo\|youku\|tudou\|56\|letv\|iqiyi)\.com/i,
		re_videos = /(youtube\|vimeo\|youku\|tudou\|56\|letv\|iqiyi\|sohu\|sina\|163)\.(com\|com\.cn\|cn\|net)/i,
		re_imgUrl = /\.(gif\|jpe?g\|png)$/i,
		@@ -33,0 +33,0 @@ re_commas = /[,，.。;；?？、]/g;

package.json

		{
		"name": "read-art",
		"version": "0.3.0",
		"version": "0.3.1",
		"description": "Scrape article from any page, automatically, make web page readability.",
		@@ -5,0 +5,0 @@ "main": "index.js",

115

README.md

		@@ -6,18 +6,37 @@ read-art [![NPM version](https://badge.fury.io/js/read-art.svg)](http://badge.fury.io/js/read-art) [![Build Status](https://travis-ci.org/Tjatse/node-readability.svg?branch=master)](https://travis-ci.org/Tjatse/node-readability)
		1. Readability reference to Arc90's.
		2. Scrape article from any page, automatically.
		3. Make any web page readability, no matter Chinese or English,very useful for ElasticSearch data spider.
		2. Scrape article from any page (automatically).
		3. Make any web page readable, no matter Chinese or English.

		> 快速抓取网页文章标题和内容，适合node.js爬虫使用，服务于ElasticSearch。

		NOTES: the property `dataType` was changed to `output`, sorry for that.

		## Features
		- Fast speed base on Cheerio
		- Faster Than Any Readability Module
		- High Performance - Less memory
		- Automatic Read Title & Content
		- Follow Redirects
		- Automatic Decoding Content Encodings(Avoid Messy Codes, Especially Chinese)
		- Gzip/Deflate Encoding(Automatic Decompress)
		- Proxy
		- Gzip/Deflate Support
		- Proxy Support
		- Generate User-Agent

		## Performance
		In my case, the indexed data is about 400 thousand per day, 10 million per month, and the maximize indexing speed is 35/second, the memory cost are limited under 100 megabytes.

		Pictures don't lie:

		![image](screenshots/es.jpg)

		![image](screenshots/performance.jpg)

		![image](screenshots/mem.jpg)

		![image](screenshots/search.jpg)

		Maybe you wanna known:
		- All the spiders are managed by [PM2](https://github.com/Unitech/PM2) (I am currently working on that with friends, very welcome to use the amazing tool).
		- Loose coupling between Spiders, Indexers and Data, they're queued by NSQ.

		## Pure Example With High Availability
		[spider2](https://github.com/Tjatse/spider2)

		## Installation
		@@ -33,8 +52,8 @@ ```javascript

		read-art is designed to be the simplest way possible to make web-article scrape, it supports the definitions such as:
		It supports the definitions such as:

		* html/uri Html or Uri string.
		* options An optional options object, including:
		- output The data type of article content, including: html, text. see more from [Output](#output)
		- killBreaks A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one `<br />` or not, `true` as default.
		- output The data type of article content, including: `html`, `text` or `json`. see more from [Output](#output)
		- killBreaks A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one `<br />` or not, `true` by default.
		- options from [cheerio](https://github.com/cheeriojs/cheerio)
		@@ -46,24 +65,29 @@ - options from [req-fast](https://github.com/Tjatse/req-fast)

		Just try it
		### Simple Examples
		```javascript
		var read = require('read-art');
		// read from google could be
		read('http://google.com', { charset: 'utf8' }, function(err, art, options){
		if(err){
		throw err;
		}
		var title = art.title, // title of article
		content = art.content, // content of article
		html = art.html; // whole original innerHTML
		// read from google:
		read('http://google.com', function(err, art, options){
		if(err){
		throw err;
		}
		var title = art.title, // title of article
		content = art.content, // content of article
		html = art.html; // whole original innerHTML
		});
		// or
		read({ uri: 'http://google.com', charset: 'utf8' }, function(err, art, options){
		// or:
		read({
		uri: 'http://google.com',
		charset: 'utf8'
		}, function(err, art, options){

		});
		// what about html?
		read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', { charset: 'utf8' }, function(err, art, options){
		read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', function(err, art, options){

		});
		// of course could be
		read({ uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', charset: 'utf8' }, function(err, art, options){
		read({
		uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>'
		}, function(err, art, options){

		@@ -75,5 +99,14 @@ });
		## Output
		You can set different types to wrap the outputs
		You can wrap the content of article with different types, the `output` option could be:
		- String
		One of `text`, `html` and `json`, `html` by default.
		- Object
		Key-value pairs including:
		- type
		One of `text`, `html` and `json`.
		- stripSpaces
		A value indicates whether strip the tab symbols (\r\n\t) or not, `false` by default.

		### text
		Returns the inner text of article content(strip html tags), e.g.:
		Returns the inner text, e.g.:
		```javascript
		@@ -97,3 +130,3 @@ read('http://example.com', {
		### html
		Returns the inner HTML of article content, e.g.:
		Returns the inner HTML, e.g.:
		```javascript
		@@ -116,4 +149,6 @@ read('http://example.com', {

		Notes Videos could be scraped now, the domains currently are supported: youtube\|vimeo\|youku\|tudou\|56\|letv\|iqiyi\|sohu\|sina\|163.

		### json
		Returns the restful result of article content, e.g.:
		Returns the restful result, e.g.:
		```javascript
		@@ -132,5 +167,6 @@ read('http://example.com', {
		}, function(err, art){
		// art.content will be formatted as JSON
		// art.content will be formatted as Array
		});
		```

		The art.content will be an Array such as:
		@@ -143,12 +179,10 @@ ```json
		```
		There only two types are supported now: img and text

		As you see, the output could be defined in two ways:
		1. Simple String, should be one of text, html and json.
		2. Complex Object, including:
		- type: one of text, html and json, default as 'html'.
		- stripSpaces: a value indicating whether strip tab symbols(\r\t\n), default as false.
		Util now there are only two types - img and text, the `src` of `img` element is absolute even if the original is a relative one.

		## Features
		### Refrain from the crazy messy codes
		Notes The video sources of the sites are quite different, it's hard to fit all in a common way, I haven't find a good way to solve that, PRs are in demand.


		## You Should Known
		### Pass the charset manually to refrain from the crazy messy codes
		```javascript
		@@ -191,3 +225,3 @@ read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', {
		### [luin/node-readability](https://github.com/luin/node-readability)
		luin/node-readability is the first module which implements Readability in node.js, lots of hit points, easy to use, but the problem is - Too slow. It was based on `JSDOM`, the HTML must be written in strict mode, you can not make any mistake, e.g.:
		luin/node-readability is an old Readability that be transformed from Arc90, easy to use, but the problem is - TOO SLOW. It was based on `jsdom`, so, the HTML must be written in strict mode, which means you can not make any mistake, e.g.:

		@@ -199,9 +233,9 @@ ```html
		```
		All above will cause hiberarchy errors, and otherwise, `JSDOM` is a memory killer.

		All above will cause `hiberarchy errors`, more seriously, `jsdom` is a memory killer.

		### [bndr/node-read](https://github.com/bndr/node-read)
		bndr/node-read is good, and I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.
		I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.

		## TODO
		- [x] get video, img tags
		- [ ] get published time
		@@ -211,3 +245,2 @@ - [ ] get author
		- [ ] pagination
		- [x] more tests

		@@ -214,0 +247,0 @@ ## License

.npmignore

Sorry, the diff of this file is not supported yet

.travis.yml

Sorry, the diff of this file is not supported yet

		@@ -30,3 +30,3 @@ // Copyright 2014 Tjatse
		re_stopwords = /[\.。:：！;；](\s\|$)/,
		re_videos = /http:\/\/(?:www\.)?(?:youtube\|vimeo\|youku\|tudou\|56\|letv\|iqiyi)\.com/i,
		re_videos = /(youtube\|vimeo\|youku\|tudou\|56\|letv\|iqiyi\|sohu\|sina\|163)\.(com\|com\.cn\|cn\|net)/i,
		re_imgUrl = /\.(gif\|jpe?g\|png)$/i,
		@@ -33,0 +33,0 @@ re_commas = /[,，.。;；?？、]/g;

read-art - npm Package Compare versions

Improved metrics