New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

read-art

Package Overview
Dependencies
Maintainers
1
Versions
66
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

read-art - npm Package Compare versions

Comparing version 0.4.1 to 0.4.3

9

examples/simple.js
var read = require('../');
read('http://www.cqn.com.cn/auto/news/73572.html', {
read('http://news.sohu.com/20151228/n432833902.shtml', {
timeout : 15000,
output : {
type : 'json',
stripSpaces: true,
break: true
},
minTextLength: 0,
scoreRule: function(node){
if (node.hasClass('w740')) {
if (node.attr('itemprop') == 'articleBody') {
return 100;

@@ -14,0 +9,0 @@ }

@@ -0,1 +1,4 @@

# 2015/12/29
- fix scoreRule on grandparent node
# 2015/12/18

@@ -2,0 +5,0 @@ - only fetch body when uri is provided but html is empty

"use strict";
var URI = require('URIjs'),
var URI = require('urijs'),
util = require('util'),

@@ -539,3 +539,3 @@ entities = require('entities');

if (grandParent && grandParent.length > 0) {
scoreNode(grandParent, score / 2, cans);
scoreNode(grandParent, score / 2, cans, options.scoreRule);
}

@@ -542,0 +542,0 @@ }

{
"name": "read-art",
"version": "0.4.1",
"version": "0.4.3",
"description": "Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.",

@@ -34,3 +34,3 @@ "main": "index.js",

"req-fast": "^0.2.9",
"URIjs": "^1.16.1",
"urijs": "~1.17.0",
"entities": "~1.1.1"

@@ -37,0 +37,0 @@ },

@@ -31,26 +31,20 @@ read-art [![NPM version](https://badge.fury.io/js/read-art.svg)](http://badge.fury.io/js/read-art) [![Build Status](https://travis-ci.org/Tjatse/node-readability.svg?branch=master)](https://travis-ci.org/Tjatse/node-readability)

- Proxy Support
- Generate User-Agent
- Auto-generate User-Agent
- Free and extensible
<a name="perfs" />
## Performance
In my case, the indexed data is about **400 thousand per day**, **10 million per month**, and the maximize indexing speed is **35/second**, the memory cost are limited **under 100 megabytes**.
In my case, the speed of [spider](https://github.com/Tjatse/spider2) is about **700 thousands documents per day**, **22 million per month**, and the maximize crawling speed is **450 per minute**, **avg 80 per minute**, the memory cost are about **200 megabytes** on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing [Score Rules](#score_rule) or [Selectors](selectors). it's better than any other readability modules.
**Pictures don't lie:**
![image](screenshots/es.jpg)
![image](screenshots/performance.jpg)
![image](screenshots/mem.jpg)
> Server infos:
> * 20M bandwidth of fibre-optical
> * 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus
> * 32G memory
![image](screenshots/search.jpg)
**Notes**
- All the spiders are managed by [PM2](https://github.com/Unitech/PM2) (I am currently working on that with friends, very welcome to use this amazing tool).
- Loose coupling between Spiders, Indexers and Data, they're queued by NSQ.
<a name="ins" />
## Installation
```javascript
npm install read-art
npm install read-art --production
```

@@ -124,3 +118,3 @@

## Score Rule
In some situations, we need to customize score rules to grab the correct content of article, such as BBS and QA forums.
In some situations, we need to customize score rules to crawl the correct content of article, such as BBS and QA forums.
There are two effective ways to do this:

@@ -160,2 +154,4 @@ - **minTextLength**

Some times we wanna extract article somehow, e.g.: pick the text of `.article>h3` as title, and pick `.article>.author` as the author data:
### Example
```javascript

@@ -185,2 +181,6 @@ read({

Properties:
- **selector** the query selector, e.g.: `#article>.title`, `.articles:nth-child(3)`
- **extract** the data that you wanna extract, could be `String`, `Array` or `Object`.
**Notes** The binding data will be an object or array (object per item) if the `extract` option is an array object, `title` and `content` will override the default extracting methods, and the output of `content` depends on the `output` option.

@@ -396,23 +396,2 @@

## Other Library
### [luin/node-readability](https://github.com/luin/node-readability)
luin/node-readability is an old Readability that be transformed from **Arc90**, easy to use, but the problem is - TOO SLOW. It was based on `jsdom`, so, the HTML must be written in strict mode, which means you can not make any mistake, e.g.:
```html
<P>Paragraphs</p>
<p>My book name is <read-art></p>
<div><p>Hey, dude!</div>
```
All above will cause `hiberarchy errors`, more seriously, `jsdom` is a memory killer.
### [bndr/node-read](https://github.com/bndr/node-read)
I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.
## TODO
- [ ] get published time
- [ ] get author
- [ ] get source
- [ ] pagination
## License

@@ -419,0 +398,0 @@ Licensed under the Apache License, Version 2.0 (the "License");

Sorry, the diff of this file is not supported yet

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc