### Example
```javascript
read('http://club.autohome.com.cn/bbs/thread-c-66-37239726-1.html', {
minTextLength: 0,
scoreRule: function(node){
if (node.hasClass('w740')) {
return 100;
}
}
}, function(err, art){
});
<a name="selectors" />
Some times we wanna extract article somehow, e.g. pick the title of `.article>h3` as title, and pick `.article>.author` as the author data:
```javascript
read({
html: '<title>read-art</title><body><div class="article"><h3 title="--read-art--">Who Am I</h3><p class="section1">hi, dude, i am <b>readability</b></p><p class="section2">aka read-art...</p><small class="author" data-author="Tjatse X">Tjatse</small></div></body>',
selectors: {
title: {
selector: '.article>h3',
extract: ['text', 'title']
},
content: '.article p.section1',
author: {
selector: '.article>small.author',
extract: {
shot_name: 'text',
full_name: 'data-author'
}
}
},
}, function (err, art) {
// art.title === {text: 'Who Am I', title: '--read-art--'}
// art.content === 'hi, dude, i am <b>readability</b>'
// art.author === {shot_name: 'Tjatse', full_name: 'Tjatse X'}
});
Notes The binding data will be an object if the extract
option is an array object, but content
depends on the output
option.
## Customize Settings
We're using different regexps to iterates over elements (cheerio objects), and removing undesirable nodes.
```javascript
read.use(function(){
//[usage]
});
The `[usage]` could be one of following:
- `this.reset()`
Reset the settings to default.
- `this.skipTags([tags], [override])`
Remove useless elements by tagName, e.g. `this.skipTags('b,span')`, if `[override]` is set to `true`, `skiptags` will be `"b,span"`, otherwise it will be appended to the origin, i.e. :
aside,footer,label,nav,noscript,script,link,meta,style,select,textarea,iframe,b,span
- `this.regexps.positive([re], [override])`
If `positive` regexp test `id` + `className` of node success, it will be took as a candidate. `[re]` is a regexp, e.g. `/dv101|dv102/` will match the element likes `<div class="dv101">...` or `<div id="dv102">...`, if `[override]` is set to `true`, `positive` will be `/dv101|dv102/i`, otherwise it will be appended to the origin, i.e. :
/article|blog|body|content|entry|main|news|pag(?:e|ination)|post|story|text|dv101|dv102/i
- `this.regexps.negative([re], [override])`
If `negative` regexp test `id` + `className` of node success, it will not be took as a candidate. `[re]` is a regexp, e.g. `/dv101|dv102/` will match the element likes `<div class="dv101">...` or `<div id="dv102">...`, if `[override]` is set to `true`, `negative` will be `/dv101|dv102/i`, otherwise it will be appended to the origin, i.e. :
/com(?:bx|ment|-)|contact|comment|captcha|foot(?:er|note)?|link|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|util|shopping|tags|tool|widget|tip|dialog|copyright|bottom|dv101|dv102/i
- `this.regexps.unlikely([re], [override])`
If `unlikely` regexp test `id` + `className` of node success, it probably will not be took as a candidate. `[re]` is a regexp, e.g. `/dv101|dv102/` will match the element likes `<div class="dv101">...` or `<div id="dv102">...`, if `[override]` is set to `true`, `unlikely` will be `/dv101|dv102/i`, otherwise it will be appended to the origin, i.e. :
/agegate|auth?or|bookmark|cat|com(?:bx|ment|munity)|date|disqus|extra|foot|header|ignore|link|menu|nav|pag(?:er|ination)|popup|related|remark|rss|share|shoutbox|sidebar|similar|social|sponsor|teaserlist|time|tweet|twitter|\bad[\s_-]?\b|dv101|dv102/i
- `this.regexps.maybe([re], [override])`
If `maybe` regexp test `id` + `className` of node success, it probably will be took as a candidate. `[re]` is a regexp, e.g. `/dv101|dv102/` will match the element likes `<div class="dv101">...` or `<div id="dv102">...`, if `[override]` is set to `true`, `maybe` will be `/dv101|dv102/i`, otherwise it will be appended to the origin, i.e. :
/and|article|body|column|main|column|dv101|dv102/i
- `this.regexps.div2p([re], [override])`
If `div2p` regexp test `id` + `className` of node success, all divs that don't have children block level elements will be turned into p's. `[re]` is a regexp, e.g. `/<(span|label)/` will match the element likes `<span>...` or `<label>...`, if `[override]` is set to `true`, `div2p` will be `/<(span|label)/i`, otherwise it will be appended to the origin, i.e. :
/<(a|blockquote|dl|div|img|ol|p|pre|table|ul|span|label)/i
<a name="cus_sets_eg" />
### Example
```javascript
read.use(function(){
this.reset();
this.skipTags('b,span');
this.regexps.div2p(/<(span|b)/, true);
});
I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.