For configuration, major changes have been made to each crawling API configuration, and the same API supports more crawling configuration methods, each of which has its own significance.
For the result, the result of each request will be wrapped in an object, which provides information about the result of this request, such as: id, result, success, maximum retry, number of retries, collected error information, etc. . Automatically determine whether the return value is wrapped in an array according to the configuration method you choose, and the type is perfectly matched in TS.
For obtaining results through the callback function, the callback is no longer executed after a single request is completed like the v4.x version, but will be executed sequentially after the crawling is completed, which will not block subsequent crawling.
🚀 Features
Added a retry mechanism, which can be set for all crawling requests, for a single crawling request, and for a single request.
A new priority queue is added to use priority crawling according to the priority of a single request.
For more configurations that may be reused, you can set the baseConfig settings passed in when requesting configuration, API crawling configuration, and generating crawler instances, such as: timeout, proxy, intervalTime, etc., and the weight is: requestConfig > APIConfig > baseConfig.
For crawlFile API, file path, name, suffix and other information can be set individually for each file. Added the beforeSave life cycle function before saving the file. You can get the file data of the Buffer type, and you can perform operations such as compression on the data in the callback. The returned new Buffer data will replace the original data and write it into the file.
Update the output of crawling on the console, and collect the error information generated by crawling into an error queue. After the crawling is completed, you can get the error message queue through the return value.
🚨 重大改变
对于配置,每个爬取 API 配置发生重大改变,同一个 API 支持更多爬取配置方式,每种方式都有其存在的意义。