read-art - npm Package Compare versions

Comparing version 0.4.1 to 0.4.3

examples/simple.js

		var read = require('../');

		read('http://www.cqn.com.cn/auto/news/73572.html', {
		read('http://news.sohu.com/20151228/n432833902.shtml', {
		timeout : 15000,
		output : {
		type : 'json',
		stripSpaces: true,
		break: true
		},
		minTextLength: 0,
		scoreRule: function(node){
		if (node.hasClass('w740')) {
		if (node.attr('itemprop') == 'articleBody') {
		return 100;
		@@ -14,0 +9,0 @@ }

HISTORY.md

		@@ -0,1 +1,4 @@
		# 2015/12/29
		- fix scoreRule on grandparent node

		# 2015/12/18
		@@ -2,0 +5,0 @@ - only fetch body when uri is provided but html is empty

lib/reader.js

		"use strict";

		var URI = require('URIjs'),
		var URI = require('urijs'),
		util = require('util'),
		@@ -539,3 +539,3 @@ entities = require('entities');
		if (grandParent && grandParent.length > 0) {
		scoreNode(grandParent, score / 2, cans);
		scoreNode(grandParent, score / 2, cans, options.scoreRule);
		}
		@@ -542,0 +542,0 @@ }

package.json

		{
		"name": "read-art",
		"version": "0.4.1",
		"version": "0.4.3",
		"description": "Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.",
		@@ -34,3 +34,3 @@ "main": "index.js",
		"req-fast": "^0.2.9",
		"URIjs": "^1.16.1",
		"urijs": "~1.17.0",
		"entities": "~1.1.1"
		@@ -37,0 +37,0 @@ },

README.md

		@@ -31,26 +31,20 @@ read-art [![NPM version](https://badge.fury.io/js/read-art.svg)](http://badge.fury.io/js/read-art) [![Build Status](https://travis-ci.org/Tjatse/node-readability.svg?branch=master)](https://travis-ci.org/Tjatse/node-readability)
		- Proxy Support
		- Generate User-Agent
		- Auto-generate User-Agent
		- Free and extensible

		<a name="perfs" />
		## Performance
		In my case, the indexed data is about 400 thousand per day, 10 million per month, and the maximize indexing speed is 35/second, the memory cost are limited under 100 megabytes.
		In my case, the speed of [spider](https://github.com/Tjatse/spider2) is about 700 thousands documents per day, 22 million per month, and the maximize crawling speed is 450 per minute, avg 80 per minute, the memory cost are about 200 megabytes on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing [Score Rules](#score_rule) or [Selectors](selectors). it's better than any other readability modules.

		Pictures don't lie:

		![image](screenshots/es.jpg)

		![image](screenshots/performance.jpg)

		![image](screenshots/mem.jpg)
		> Server infos:
		> * 20M bandwidth of fibre-optical
		> * 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus
		> * 32G memory

		![image](screenshots/search.jpg)

		Notes
		- All the spiders are managed by [PM2](https://github.com/Unitech/PM2) (I am currently working on that with friends, very welcome to use this amazing tool).
		- Loose coupling between Spiders, Indexers and Data, they're queued by NSQ.

		<a name="ins" />
		## Installation
		```javascript
		npm install read-art
		npm install read-art --production
		```
		@@ -124,3 +118,3 @@
		## Score Rule
		In some situations, we need to customize score rules to grab the correct content of article, such as BBS and QA forums.
		In some situations, we need to customize score rules to crawl the correct content of article, such as BBS and QA forums.
		There are two effective ways to do this:
		@@ -160,2 +154,4 @@ - minTextLength
		Some times we wanna extract article somehow, e.g.: pick the text of `.article>h3` as title, and pick `.article>.author` as the author data:

		### Example
		```javascript
		@@ -185,2 +181,6 @@ read({

		Properties:
		- selector the query selector, e.g.: `#article>.title`, `.articles:nth-child(3)`
		- extract the data that you wanna extract, could be `String`, `Array` or `Object`.

		Notes The binding data will be an object or array (object per item) if the `extract` option is an array object, `title` and `content` will override the default extracting methods, and the output of `content` depends on the `output` option.
		@@ -396,23 +396,2 @@

		## Other Library
		### [luin/node-readability](https://github.com/luin/node-readability)
		luin/node-readability is an old Readability that be transformed from Arc90, easy to use, but the problem is - TOO SLOW. It was based on `jsdom`, so, the HTML must be written in strict mode, which means you can not make any mistake, e.g.:

		```html
		<P>Paragraphs</p>
		<p>My book name is <read-art></p>
		<div><p>Hey, dude!</div>
		```

		All above will cause `hiberarchy errors`, more seriously, `jsdom` is a memory killer.

		### [bndr/node-read](https://github.com/bndr/node-read)
		I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.

		## TODO
		- [ ] get published time
		- [ ] get author
		- [ ] get source
		- [ ] pagination

		## License
		@@ -419,0 +398,0 @@ Licensed under the Apache License, Version 2.0 (the "License");

.travis.yml

Sorry, the diff of this file is not supported yet

read-art - npm Package Compare versions

Worsened metrics

Dependency changes