Comparing version 0.4.1 to 0.4.3
var read = require('../'); | ||
read('http://www.cqn.com.cn/auto/news/73572.html', { | ||
read('http://news.sohu.com/20151228/n432833902.shtml', { | ||
timeout : 15000, | ||
output : { | ||
type : 'json', | ||
stripSpaces: true, | ||
break: true | ||
}, | ||
minTextLength: 0, | ||
scoreRule: function(node){ | ||
if (node.hasClass('w740')) { | ||
if (node.attr('itemprop') == 'articleBody') { | ||
return 100; | ||
@@ -14,0 +9,0 @@ } |
@@ -0,1 +1,4 @@ | ||
# 2015/12/29 | ||
- fix scoreRule on grandparent node | ||
# 2015/12/18 | ||
@@ -2,0 +5,0 @@ - only fetch body when uri is provided but html is empty |
"use strict"; | ||
var URI = require('URIjs'), | ||
var URI = require('urijs'), | ||
util = require('util'), | ||
@@ -539,3 +539,3 @@ entities = require('entities'); | ||
if (grandParent && grandParent.length > 0) { | ||
scoreNode(grandParent, score / 2, cans); | ||
scoreNode(grandParent, score / 2, cans, options.scoreRule); | ||
} | ||
@@ -542,0 +542,0 @@ } |
{ | ||
"name": "read-art", | ||
"version": "0.4.1", | ||
"version": "0.4.3", | ||
"description": "Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.", | ||
@@ -34,3 +34,3 @@ "main": "index.js", | ||
"req-fast": "^0.2.9", | ||
"URIjs": "^1.16.1", | ||
"urijs": "~1.17.0", | ||
"entities": "~1.1.1" | ||
@@ -37,0 +37,0 @@ }, |
@@ -31,26 +31,20 @@ read-art [](http://badge.fury.io/js/read-art) [](https://travis-ci.org/Tjatse/node-readability) | ||
- Proxy Support | ||
- Generate User-Agent | ||
- Auto-generate User-Agent | ||
- Free and extensible | ||
<a name="perfs" /> | ||
## Performance | ||
In my case, the indexed data is about **400 thousand per day**, **10 million per month**, and the maximize indexing speed is **35/second**, the memory cost are limited **under 100 megabytes**. | ||
In my case, the speed of [spider](https://github.com/Tjatse/spider2) is about **700 thousands documents per day**, **22 million per month**, and the maximize crawling speed is **450 per minute**, **avg 80 per minute**, the memory cost are about **200 megabytes** on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing [Score Rules](#score_rule) or [Selectors](selectors). it's better than any other readability modules. | ||
**Pictures don't lie:** | ||
 | ||
 | ||
 | ||
> Server infos: | ||
> * 20M bandwidth of fibre-optical | ||
> * 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus | ||
> * 32G memory | ||
 | ||
**Notes** | ||
- All the spiders are managed by [PM2](https://github.com/Unitech/PM2) (I am currently working on that with friends, very welcome to use this amazing tool). | ||
- Loose coupling between Spiders, Indexers and Data, they're queued by NSQ. | ||
<a name="ins" /> | ||
## Installation | ||
```javascript | ||
npm install read-art | ||
npm install read-art --production | ||
``` | ||
@@ -124,3 +118,3 @@ | ||
## Score Rule | ||
In some situations, we need to customize score rules to grab the correct content of article, such as BBS and QA forums. | ||
In some situations, we need to customize score rules to crawl the correct content of article, such as BBS and QA forums. | ||
There are two effective ways to do this: | ||
@@ -160,2 +154,4 @@ - **minTextLength** | ||
Some times we wanna extract article somehow, e.g.: pick the text of `.article>h3` as title, and pick `.article>.author` as the author data: | ||
### Example | ||
```javascript | ||
@@ -185,2 +181,6 @@ read({ | ||
Properties: | ||
- **selector** the query selector, e.g.: `#article>.title`, `.articles:nth-child(3)` | ||
- **extract** the data that you wanna extract, could be `String`, `Array` or `Object`. | ||
**Notes** The binding data will be an object or array (object per item) if the `extract` option is an array object, `title` and `content` will override the default extracting methods, and the output of `content` depends on the `output` option. | ||
@@ -396,23 +396,2 @@ | ||
## Other Library | ||
### [luin/node-readability](https://github.com/luin/node-readability) | ||
luin/node-readability is an old Readability that be transformed from **Arc90**, easy to use, but the problem is - TOO SLOW. It was based on `jsdom`, so, the HTML must be written in strict mode, which means you can not make any mistake, e.g.: | ||
```html | ||
<P>Paragraphs</p> | ||
<p>My book name is <read-art></p> | ||
<div><p>Hey, dude!</div> | ||
``` | ||
All above will cause `hiberarchy errors`, more seriously, `jsdom` is a memory killer. | ||
### [bndr/node-read](https://github.com/bndr/node-read) | ||
I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own. | ||
## TODO | ||
- [ ] get published time | ||
- [ ] get author | ||
- [ ] get source | ||
- [ ] pagination | ||
## License | ||
@@ -419,0 +398,0 @@ Licensed under the Apache License, Version 2.0 (the "License"); |
Sorry, the diff of this file is not supported yet
67112
1628
406
+ Addedurijs@~1.17.0
+ Addedurijs@1.17.1(transitive)
- RemovedURIjs@^1.16.1
- RemovedURIjs@1.16.1(transitive)