New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

read-art

Package Overview
Dependencies
Maintainers
1
Versions
66
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

read-art - npm Package Compare versions

Comparing version 0.3.0 to 0.3.1

4

HISTORY.md

@@ -0,1 +1,5 @@

# 2014/11/28
- RexExp of videos
- Update documentation
# 2014/11/05

@@ -2,0 +6,0 @@ - Decode HTML entities manually

2

lib/reader.js

@@ -30,3 +30,3 @@ // Copyright 2014 Tjatse

re_stopwords = /[\.。::!;;](\s|$)/,
re_videos = /http:\/\/(?:www\.)?(?:youtube|vimeo|youku|tudou|56|letv|iqiyi)\.com/i,
re_videos = /(youtube|vimeo|youku|tudou|56|letv|iqiyi|sohu|sina|163)\.(com|com\.cn|cn|net)/i,
re_imgUrl = /\.(gif|jpe?g|png)$/i,

@@ -33,0 +33,0 @@ re_commas = /[,,.。;;??、]/g;

{
"name": "read-art",
"version": "0.3.0",
"version": "0.3.1",
"description": "Scrape article from any page, automatically, make web page readability.",

@@ -5,0 +5,0 @@ "main": "index.js",

@@ -6,18 +6,37 @@ read-art [![NPM version](https://badge.fury.io/js/read-art.svg)](http://badge.fury.io/js/read-art) [![Build Status](https://travis-ci.org/Tjatse/node-readability.svg?branch=master)](https://travis-ci.org/Tjatse/node-readability)

1. Readability reference to Arc90's.
2. Scrape article from any page, automatically.
3. Make any web page readability, no matter Chinese or English,very useful for ElasticSearch data spider.
2. Scrape article from any page (automatically).
3. Make any web page readable, no matter Chinese or English.
> *快速抓取网页文章标题和内容,适合node.js爬虫使用,服务于ElasticSearch。*
**NOTES: the property `dataType` was changed to `output`, sorry for that.**
## Features
- Fast speed base on Cheerio
- Faster Than Any Readability Module
- High Performance - Less memory
- Automatic Read Title & Content
- Follow Redirects
- Automatic Decoding Content Encodings(Avoid Messy Codes, Especially Chinese)
- Gzip/Deflate Encoding(Automatic Decompress)
- Proxy
- Gzip/Deflate Support
- Proxy Support
- Generate User-Agent
## Performance
In my case, the indexed data is about **400 thousand per day**, **10 million per month**, and the maximize indexing speed is **35/second**, the memory cost are limited **under 100 megabytes**.
**Pictures don't lie:**
![image](screenshots/es.jpg)
![image](screenshots/performance.jpg)
![image](screenshots/mem.jpg)
![image](screenshots/search.jpg)
Maybe you wanna known:
- All the spiders are managed by [PM2](https://github.com/Unitech/PM2) (I am currently working on that with friends, very welcome to use the amazing tool).
- Loose coupling between Spiders, Indexers and Data, they're queued by NSQ.
## Pure Example With High Availability
[spider2](https://github.com/Tjatse/spider2)
## Installation

@@ -33,8 +52,8 @@ ```javascript

read-art is designed to be the simplest way possible to make web-article scrape, it supports the definitions such as:
It supports the definitions such as:
* **html/uri** Html or Uri string.
* **options** An optional options object, including:
- **output** The data type of article content, including: html, text. see more from [Output](#output)
- **killBreaks** A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one `<br />` or not, `true` as default.
- **output** The data type of article content, including: `html`, `text` or `json`. see more from [Output](#output)
- **killBreaks** A value indicating whether kill breaks, blanks, tab symbols(\r\t\n) into one `<br />` or not, `true` by default.
- **options from [cheerio](https://github.com/cheeriojs/cheerio)**

@@ -46,24 +65,29 @@ - **options from [req-fast](https://github.com/Tjatse/req-fast)**

Just try it
### Simple Examples
```javascript
var read = require('read-art');
// read from google could be
read('http://google.com', { charset: 'utf8' }, function(err, art, options){
if(err){
throw err;
}
var title = art.title, // title of article
content = art.content, // content of article
html = art.html; // whole original innerHTML
// read from google:
read('http://google.com', function(err, art, options){
if(err){
throw err;
}
var title = art.title, // title of article
content = art.content, // content of article
html = art.html; // whole original innerHTML
});
// or
read({ uri: 'http://google.com', charset: 'utf8' }, function(err, art, options){
// or:
read({
uri: 'http://google.com',
charset: 'utf8'
}, function(err, art, options){
});
// what about html?
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', { charset: 'utf8' }, function(err, art, options){
read('<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', function(err, art, options){
});
// of course could be
read({ uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>', charset: 'utf8' }, function(err, art, options){
read({
uri: '<title>node-art</title><body><div><p>hello, read-art!</p></div></body>'
}, function(err, art, options){

@@ -75,5 +99,14 @@ });

## Output
You can set different types to wrap the outputs
You can wrap the content of article with different types, the `output` option could be:
- **String**
One of `text`, `html` and `json`, `html` by default.
- **Object**
Key-value pairs including:
- **type**
One of `text`, `html` and `json`.
- **stripSpaces**
A value indicates whether strip the tab symbols (\r\n\t) or not, `false` by default.
### text
Returns the inner text of article content(strip html tags), e.g.:
Returns the inner text, e.g.:
```javascript

@@ -97,3 +130,3 @@ read('http://example.com', {

### html
Returns the inner HTML of article content, e.g.:
Returns the inner HTML, e.g.:
```javascript

@@ -116,4 +149,6 @@ read('http://example.com', {

**Notes** Videos could be scraped now, the domains currently are supported: *youtube|vimeo|youku|tudou|56|letv|iqiyi|sohu|sina|163*.
### json
Returns the restful result of article content, e.g.:
Returns the restful result, e.g.:
```javascript

@@ -132,5 +167,6 @@ read('http://example.com', {

}, function(err, art){
// art.content will be formatted as JSON
// art.content will be formatted as Array
});
```
The art.content will be an Array such as:

@@ -143,12 +179,10 @@ ```json

```
There only two types are supported now: *img* and *text*
As you see, the output could be defined in two ways:
1. Simple String, should be one of *text*, *html* and *json*.
2. Complex Object, including:
- type: one of *text*, *html* and *json*, default as 'html'.
- stripSpaces: a value indicating whether strip tab symbols(\r\t\n), default as false.
Util now there are only two types - *img* and *text*, the `src` of `img` element is absolute even if the original is a relative one.
## Features
### Refrain from the crazy messy codes
**Notes** The video sources of the sites are quite different, it's hard to fit all in a common way, I haven't find a good way to solve that, PRs are in demand.
## You Should Known
### Pass the charset manually to refrain from the crazy messy codes
```javascript

@@ -191,3 +225,3 @@ read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', {

### [luin/node-readability](https://github.com/luin/node-readability)
luin/node-readability is the first module which implements Readability in node.js, lots of hit points, easy to use, but the problem is - Too slow. It was based on `JSDOM`, the HTML must be written in strict mode, you can not make any mistake, e.g.:
luin/node-readability is an old Readability that be transformed from **Arc90**, easy to use, but the problem is - TOO SLOW. It was based on `jsdom`, so, the HTML must be written in strict mode, which means you can not make any mistake, e.g.:

@@ -199,9 +233,9 @@ ```html

```
All above will cause hiberarchy errors, and otherwise, `JSDOM` is a memory killer.
All above will cause `hiberarchy errors`, more seriously, `jsdom` is a memory killer.
### [bndr/node-read](https://github.com/bndr/node-read)
bndr/node-read is good, and I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.
I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.
## TODO
- [x] get video, img tags
- [ ] get published time

@@ -211,3 +245,2 @@ - [ ] get author

- [ ] pagination
- [x] more tests

@@ -214,0 +247,0 @@ ## License

Sorry, the diff of this file is not supported yet

Sorry, the diff of this file is not supported yet

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc