
Research
Two Malicious Rust Crates Impersonate Popular Logger to Steal Wallet Keys
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
A 2nd generation spider to crawl any article site, automatic reading title and content.
A 2nd generation spider to crawl any article site, automatic reading title and content.
In my case, the speed of spider is about 700 thousands documents per day, 22 million per month, and the maximize crawling speed is 450 per minute, avg 80 per minute, the memory cost are about 200 megabytes on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.
Server infos:
- 20M bandwidth of fibre-optical
- 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus
- 32G memory
It is not a single spider runs in a single thread. To take advantage of multi-core systems we maybe wanna launch
a cluster of processes to handle the load, this is exactly what spider2
does - Crawling fast and in order to
maximum performance.
Multi-core crawling feature is just make spiders work in a fork mode, but concurrency makes them work together in a same thread and at a same time!!!
The old school crawls links/articles in a manual mode, e.g.: request to server and get the response(HTML), then using
jQuery or something else to analyze links/articles by hard code, this feels sucks, currently, we just need to make a list
of websites that you wanna scrape, spider2
will handle anything else, take a cup of coffee, and wait to harvest the fruit.
All the jobs are managed by async queue, so you can keep pushing the urls which to be crawled/read.
npm install spider2 --production
var Spider = require('spider2');
var spider = Spider({
timeout: 5000,
debug: true,
domain: true,
workers: 7,
concurrency: 1
});
The options including:
10000
by default.false
by default, also it could be set with process.env.SP_DEBUG
.true
by default.1
by default.This event is emitted when an error has been caught, the arguments including:
err
Error objectreq
Request data, if req.worker
is defined an it is a number, means error is from the worker, req.worker
is the id of a worker, otherwise it is a normal error.
Example:spider.on('error', function (err, req) {
if (req.worker) {
console.error('worker #', req.worker, 'has an error:', err.message);
} else {
console.error(req.uri, err.message);
}
});
Data send by spider are obtained using this event, arguments including:
req
Request data.res
Response data, if req._type
equals Spider.type.LINK
, res
should be an array, including key-value pairs like {title: [ANCHOR_TITLE], uri: [ANCHOR_HREF]}
, and if equals Spider.type.ARTICLE
, res
should be an object, keys including title
and content
.
Example:spider.on('data', function (req, res) {
if (req._type == Spider.type.LINK) {
spider.read(_.filter(res, validLink));
} else if (req._type == Spider.type.ARTICLE) {
console.log(req.uri, res.title);
}
});
This event is emitted after all the spiders terminated abnormally, e.g.:
spider.on('end', function () {
console.log('[END]');
});
Crawl links, OPTION
could be one of below:
[String, String, ...]
and [Object, Object, ...]
will be fine.uri
property.e.g.:
spider.crawl([OPTION]);
Read title and content of article, OPTION
is same as above, e.g.:
spider.read([OPTION]);
Peaceful quit, e.g.:
spider.destroy();
Ping the spider and returns workers' status Array, e.g.:
var pong = spider.ping();
console.log(pong);
pong
will be printed like:
[
{id: 1, count: 12},
{id: 2, count: 90},
...
]
id
is the id of worker, and count
is the count of remaining jobs.
npm test
Head over to test/
or /examples
directory.
Copyright 2014 Tjatse
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
FAQs
A 2nd generation spider to crawl any article site, automatic reading title and content.
We found that spider2 demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
Research
A malicious package uses a QR code as steganography in an innovative technique.
Research
/Security News
Socket identified 80 fake candidates targeting engineering roles, including suspected North Korean operators, exposing the new reality of hiring as a security function.