## Score Rule
In some situations, we need to custom score rules to grab the correct content of article, such as BBS and QA forums.
There are two effective ways to do this:
- **minTextLength**
It's useful to get rid of useless elements (`P` / `DIV`), e.g. `minTextLength: 100` will dump all the blocks that `node.text().length` is less than `100`.
-
scoreRule
You can custom the score rules manually, e.g.:
scoreRule: function(node){
if (node.hasClass('w740')) {
return 100;
}
}
The elements which have the w740
className will get 100
bonus points, that will make the node
to be the topCandidate, which means it's enough to make the text
of DIV/P.w740
to be the content of current article.
Example
read('http://club.autohome.com.cn/bbs/thread-c-66-37239726-1.html', {
minTextLength: 0,
scoreRule: function(node){
if (node.hasClass('w740')) {
return 100;
}
}
}, function(err, art){
});
Output
You can wrap the content of article with different types, the output
option could be:
- String
One of
text
, html
and json
, html
by default. - Object
Key-value pairs including:
- type
One of
text
, html
and json
. - stripSpaces
A value indicates whether strip the tab symbols (\r\n\t) or not,
false
by default.
text
Returns the inner text, e.g.:
read('http://example.com', {
output: 'text'
}, function(err, art){
});
read('http://example.com', {
output: {
type: 'text',
stripSpaces: true
}
}, function(err, art){
});
html
Returns the inner HTML, e.g.:
read('http://example.com', {
output: 'html'
}, function(err, art){
});
read('http://example.com', {
output: {
type: 'html',
stripSpaces: true
}
}, function(err, art){
});
Notes Videos could be scraped now, the domains currently are supported: youtube|vimeo|youku|tudou|56|letv|iqiyi|sohu|sina|163.
json
Returns the restful result, e.g.:
read('http://example.com', {
output: 'json'
}, function(err, art){
});
read('http://example.com', {
output: {
type: 'json',
stripSpaces: true
}
}, function(err, art){
});
The art.content will be an Array such as:
[
{ "type": "img", "value": "http://example.com/jpg/site1/20140519/00188b1996f214e3a25417.jpg" },
{ "type": "text", "value": "TEXT goes here..." }
]
Util now there are only two types - img and text, the src
of img
element is absolute even if the original is a relative one.
Notes The video sources of the sites are quite different, it's hard to fit all in a common way, I haven't find a good way to solve that, PRs are in demand.
You Should Known
Pass the charset manually to refrain from the crazy messy codes
read('http://game.163.com/14/0506/10/9RI8M9AO00314SDA.html', {
charset: 'gbk'
}, function(err, art){
});
Generate agent to simulate browsers
read('http://example.com', {
agent: true
}, function(err, art){
});
Use proxy to avoid being blocked.
read('http://example.com', {
proxy: {
host: 'http://myproxy.com/',
port: 8081,
proxyAuth: 'user:password'
}
}, function(err, art){
});
Test
npm test
Other Library
I've contributed on this for a while, but it's hard to communicate with Vadim(we are in a different timezone), and we have very different ideas. So I decided to write it on my own.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.