x-crawl
English | 简体中文
Crawl is a Nodejs multifunctional crawler library. Provide configuration to batch fetch HTML, JSON, images, etc.
Install
Take NPM as an example:
npm install x-crawl
example
Get the title of https://docs.github.com/zh/get-started as an example:
import XCrawl from 'x-crawl'
const docsXCrawl = new XCrawl({
baseUrl: 'https://docs.github.com',
timeout: 10000,
intervalTime: { max: 2000, min: 1000 }
})
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
Key concept
XCrawl
Create a crawler instance via new XCrawl.
class XCrawl {
private readonly baseConfig
constructor(baseConfig?: IXCrawlBaseConifg)
fetch<T = any>(config: IFetchConfig): Promise<T>
fetchFile(config: IFetchFileConfig): Promise<IFetchFile>
fetchHTML(url: string): Promise<JSDOM>
}
myXCrawl is the crawler instance of the following example.
const myXCrawl = new XCrawl({
baseUrl: 'https://xxx.com',
timeout: 10000,
intervalTime: {
max: 2000,
min: 1000
}
})
fetch
fetch is the method of the above myXCrawl instance, which is usually used to crawl APIs to obtain JSON data and so on.
function fetch<T = any>(config: IFetchConfig): Promise<T>
const requestConifg = [
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' }
]
myXCrawl.fetch({
requestConifg,
intervalTime: 800
}).then(res => {
console.log(res)
})
fetchFile
fetchFile is the method of the above myXCrawl instance, which is usually used to crawl files, such as pictures, pdf files, etc.
function fetchFile(config: IFetchFileConfig): Promise<IFetchFile>
const requestConifg = [
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' }
]
myXCrawl.fetchFile({
requestConifg,
fileConfig: {
storeDir: path.resolve(__dirname, './upload')
}
}).then(fileInfos => {
console.log(fileInfos)
})
fetchHTML
fetchHTML is the method of the above myXCrawl instance, usually used to crawl HTML.
function fetchHTML(url: string): Promise<JSDOM>
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
Types
interface IAnyObject extends Object {
[key: string | number | symbol]: any
}
export type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
export interface IRequestConfig {
url: string
method?: IMethod
headers?: IAnyObject
params?: IAnyObject
data?: any
timeout?: number
}
type IIntervalTime = number | {
max: number
min?: number
}
interface IFetchBaseConifg {
requestConifg: IRequestConfig | IRequestConfig[]
intervalTime?: IIntervalTime
}
type IFetchFile = {
fileName: string
mimeType: string
size: number
filePath: string
}[]
interface IXCrawlBaseConifg {
baseUrl?: string
timeout?: number
intervalTime?: IIntervalTime
}
interface IFetchConfig extends IFetchBaseConifg {
}
interface IFetchFileConfig extends IFetchBaseConifg {
fileConfig: {
storeDir: string
}
}
More
If you have any questions or needs , please submit Issues in https://github.com/coder-hxl/x-crawl .
x-crawl
English | 简体中文
Crawl 是 Nodejs 多功能爬虫库。提供配置即可批量抓取 HTML 、JSON、图片等等。
安装
以 NPM 为例:
npm install x-crawl
示例
获取 https://docs.github.com/zh/get-started 的标题为例:
import XCrawl from 'x-crawl'
const docsXCrawl = new XCrawl({
baseUrl: 'https://docs.github.com',
timeout: 10000,
intervalTime: { max: 2000, min: 1000 }
})
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
核心概念
XCrawl
通过 new XCrawl 创建一个爬虫实例。
class XCrawl {
private readonly baseConfig
constructor(baseConfig?: IXCrawlBaseConifg)
fetch<T = any>(config: IFetchConfig): Promise<T>
fetchFile(config: IFetchFileConfig): Promise<IFetchFile>
fetchHTML(url: string): Promise<JSDOM>
}
myXCrawl 为后面示例的爬虫实例。
const myXCrawl = new XCrawl({
baseUrl: 'https://xxx.com',
timeout: 10000,
intervalTime: {
max: 2000,
min: 1000
}
})
fetch
fetch 是上面 myXCrawl 实例的方法,通常用于爬取 API ,可获取 JSON 数据等等。
function fetch<T = any>(config: IFetchConfig): Promise<T>
const requestConifg = [
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' }
]
myXCrawl.fetch({
requestConifg,
intervalTime: 800
}).then(res => {
console.log(res)
})
fetchFile
fetchFile 是上面 myXCrawl 实例的方法,通常用于爬取文件,可获取图片、pdf 文件等等。
function fetchFile(config: IFetchFileConfig): Promise<IFetchFile>
const requestConifg = [
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' }
]
myXCrawl.fetchFile({
requestConifg,
fileConfig: {
storeDir: path.resolve(__dirname, './upload')
}
}).then(fileInfos => {
console.log(fileInfos)
})
fetchHTML
fetchHTML 是上面 myXCrawl 实例的方法,通常用于爬取 HTML 。
function fetchHTML(url: string): Promise<JSDOM>
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
类型
interface IAnyObject extends Object {
[key: string | number | symbol]: any
}
export type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
export interface IRequestConfig {
url: string
method?: IMethod
headers?: IAnyObject
params?: IAnyObject
data?: any
timeout?: number
}
type IIntervalTime = number | {
max: number
min?: number
}
interface IFetchBaseConifg {
requestConifg: IRequestConfig | IRequestConfig[]
intervalTime?: IIntervalTime
}
type IFetchFile = {
fileName: string
mimeType: string
size: number
filePath: string
}[]
interface IXCrawlBaseConifg {
baseUrl?: string
timeout?: number
intervalTime?: IIntervalTime
}
interface IFetchConfig extends IFetchBaseConifg {
}
interface IFetchFileConfig extends IFetchBaseConifg {
fileConfig: {
storeDir: string
}
}
更多
如有 问题 或 需求 请在 https://github.com/coder-hxl/x-crawl 中提 Issues 。