The flexibility of crawlFile API has been upgraded again, mainly adjusting fileName, storeDir, and extension.
- The storeDir and extension of the advanced writing method (CrawlFileAdvancedConfig) are changed to storeDirs and extensions respectively, and the type is string or (string | null)[], and the fileNames option is added, and the type is (string | null)[] . If it is an array, it will be assigned to the crawling targets in order.
- The fileName of the detailed target notation (CrawlFileDetailTargetConfig) adds a null type, which is used to use the default file name instead of the fileName corresponding to the advanced notation (CrawlFileAdvancedConfig) fileNames.

🚀 特征

crawlFile API 灵活度再次升级，主要对 fileName、storeDir、extension 进行调整。
- 进阶写法 (CrawlFileAdvancedConfig) 的 storeDir 和 extension 分别更改为 storeDirs 和 extensions ，类型为 string 或者 (string | null)[]，同时新增 fileNames 选项，类型为 (string | null)[] 。如果是数组则会按顺序分配给爬取目标。
- 详细目标写法 (CrawlFileDetailTargetConfig) 的 fileName 新增 null 类型，用于使用默认文件名，不使用进阶写法 (CrawlFileAdvancedConfig) fileNames 对应的 fileName 。

7.0.1

Diff

coderhxl

published 7.0.1 • 2 years ago

Changelog

Source

v7.0.1 (2023-05-04)

🐞 Bug fixes

The params configuration option for the crawlData API is not working.

🐞 漏洞修复

crawlData API 的 params 配置选项不起作用。

7.0.0

Diff

coderhxl

published 7.0.0 • 2 years ago

Changelog

Source

v7.0.0 (2023-04-26)

🚨 Breaking Changes

Fingerprint upgrade:
- The fingerprint of the advanced writing method is renamed to fingerprints, which is an array writing method, which stores objects of the DetailTargetFingerprintCommon type, which is convenient for customization. Internally, the objects inside will be randomly assigned to the target.
- Adjustment of crawlPage fingerprint options: the maximum width and height of the fingerprint configuration of advanced writing and detailed target writing are changed to optional.
Proxy upgrade: create a crawler instance, change the proxy of the advanced writing method and the detailed target writing method to the object writing method, with three attributes: urls, switchByHttpStatus and switchByErrorCount, urls can set multiple proxy URLs, and the internal default uses the first one first, switchByHttpStatus Set which non-compliant response status codes need to switch the proxy, and switchByErrorCount sets how many times the proxy needs to be switched when errors such as timeouts arrive. The proxy rotation feature needs to be used with error retries.
Return value type adjustment: CrawlCommonRes, CrawlPageSingleRes, CrawlDataSingleRes and CrawlFileSingleRes are renamed to CrawlCommonResult, CrawlPageSingleResult, CrawlDataSingleResult and CrawlFileSingleResult respectively

🚀 Features

It is possible to cancel the configuration of the upper-level unified setting by setting null in the option.
The userAgent option in DetailTargetFingerprintCommon overrides the object notation and allows customization of the maximum and minimum values of the major version, minor version, and revision number inside. Each crawl target gets a new userAgent .
A new proxyDetails property is added to the crawling results to record the proxy status.
Added 'random' attribute value to mobile option of fingerprint configuration, allowing internal randomization.
Terminal prompts are simplified and color adjusted.

🐞 Bug fixes

Unable to create multiple levels of non-existent folders on linux systems.

🚨 重大改变

指纹升级：
- 进阶写法的 fingerprint 改名为 fingerprints ，为数组写法，里面存放 DetailTargetFingerprintCommon 类型的对象，方便定制。内部会将里面的对象随机分配给目标。
- crawlPage 的指纹选项调整：进阶写法和详细目标写法的指纹配置的最大宽高改为可选项。
代理升级：创建爬虫实例、进阶写法以及详细目标写法的 proxy 更改为对象写法, 拥有 urls、switchByHttpStatus 以及 switchByErrorCount 这三个属性，urls 可以设置多个代理 URL ，内部默认先采用第一个，switchByHttpStatus 设置遇到哪些不符合的响应状态码需要切换代理，switchByErrorCount 设置像超时等错误时到达多少次需要切换代理。该代理轮换功能需要配合错误重试才能使用。
返回值类型调整：CrawlCommonRes、CrawlPageSingleRes、CrawlDataSingleRes 以及 CrawlFileSingleRes 分别更名为 CrawlCommonResult、CrawlPageSingleResult、CrawlDataSingleResult 以及 CrawlFileSingleResult

🚀 特征

可以通过在选项设置为 null 取消上级统一设置的配置。
DetailTargetFingerprintCommon 里的 userAgent 选项改写对象写法，并允许定制里面的主版本、次版本以及修订号的最大值和最小值。每个爬取目标都会获取一个新的 userAgent 。
爬取结果新增 proxyDetails 属性，记录代理状态。
指纹配置的 mobile 选项添加 'random' 属性值，允许由内部随机决定。
终端提示信息进行简化以及颜色调整。

🐞 漏洞修复

在 linux 系统上无法创建多级不存在的文件夹。

6.0.1

Diff

coderhxl

published 6.0.1 • 2 years ago

Changelog

Source

v6.0.1 (2023-04-21)

🚀 Features

Perfect documentation.

🚀 特征

完善文档。

6.0.0

Diff

coderhxl

published 6.0.0 • 2 years ago

Changelog

Source

v6.0.0 (2023-04-19)

🚨 Breaking Changes

About the result processing of each crawling target: it will start processing after a single target is completed, saving time and improving performance. Originally, it waited for all targets to be completed before processing, and there would be free time during the crawling process.
About the execution timing of the second parameter callback function of the crawlPage, crawlData, and crawlFile APIs: it will be executed at the end, and the result obtained is the same as the result of the Promise method.
About the type: PageRequestConfig, DataRequestConfig and FileRequestConfig are changed to CrawlPageDetailTargetConfig, CrawlDataDetailTargetConfig and CrawlFileDetailTargetConfig respectively, the purpose is to not only add the configuration of the request, but also expand more, called detailed target usage. CrawlPageConfigObject, CrawlDataConfigObject, and CrawlFileConfigObject changed to CrawlPageAdvancedConfig, CrawlDataAdvancedConfig, and CrawlFileAdvancedConfig respectively, named Advanced Usage.
Configuration options in fileConfig of crawlFile: can be set directly in the root object configuration. The beforeSave lifecycle function changed to onBeforeSaveItemFile.
About the object results of crawlPage, crawlData and crawlFile: remove the crawlCount attribute, and get the number of times by retryCount + 1. errorQueue was renamed to crawlErrorQueue.

🚀 Features

Added device fingerprint to avoid identifying and tracking us from different locations through fingerprint recognition. You can use the default with a switch, and if you need to specify it, you can set it uniformly for all crawling targets in the advanced usage, or you can specify the settings through the detailed target usage.
Adding multiple attributes for each advanced usage can be configured in an advanced way to set the object uniformly, without having to set it repeatedly for each target configuration. The new onCrawlItemComplete lifecycle function will be executed after each crawling goal is completed, and the result of the crawling goal will be passed to the callback function.
Added crawlPage in the configuration of creating a crawler application, you can set the configuration of creating a browser in the crawlPage.launchBrowser option (type is PuppeteerLaunchOptions from Puppeteer).
crawlPage adds viewport option, which is used to set the viewport of the page.

🚨 重大改变

关于对每个爬取目标的结果处理：将会在单个目标完成后就开始进行处理，节省时间，提高性能。原先是等所有目标完成再处理，在爬过程中会有空闲时间。
关于 crawlPage、crawlData 以及 crawlFile 这三个 API 的第二个参数回调函数的执行时机：将移到最后执行，获取的结果跟 Promise 方式的结果相同。
关于类型：PageRequestConfig、DataRequestConfig 以及 FileRequestConfig 分别更改为 CrawlPageDetailTargetConfig、CrawlDataDetailTargetConfig 以及 CrawlFileDetailTargetConfig ，目的是为了不单单可以加请求的配置，也可以扩展更多，名为详细目标用法。CrawlPageConfigObject、 CrawlDataConfigObject 以及 CrawlFileConfigObject 分别更改为 CrawlPageAdvancedConfig、CrawlDataAdvancedConfig 以及 CrawlFileAdvancedConfig ，名为进阶用法。
关于 crawlFile 的 fileConfig 里面的配置选项：可以直接在根对象配置中设置。beforeSave 生命周期函数更改为 onBeforeSaveItemFile。
关于 crawlPage、crawlData 以及 crawlFile 的对象结果：移除 crawlCount 属性，可通过 retryCount + 1 获取次数。errorQueue 更名为 crawlErrorQueue。

🚀 特征

新增设备指纹，可避免通过指纹识别从不同位置识别并跟踪我们。可以通过一个开关使用默认的，如果需指定则可在进阶用法中为所有爬取目标统一设置，也可以通过详细目标用法指定设置。
每个进阶用法新增多个属性可以在进阶方式配置对象统一设置，不必为每个目标配置重复设置一遍。新增 onCrawlItemComplete 生命周期函数，将在每个爬取目标完成后执行，并且把爬取目标的结果传入回调函数。
在创建爬虫应用的配置新增 crawlPage ，可以在 crawlPage.launchBrowser 选项中设置创建浏览器的配置（类型为 PuppeteerLaunchOptions 来自 Puppeteer）。
crawlPage 新增 viewport 选项，用于设置页面的视口。