Security News
Input Validation Vulnerabilities Dominate MITRE's 2024 CWE Top 25 List
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resources, etc. through simple configuration.
English | 简体中文
XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resources, etc. through simple configuration.
Take NPM as an example:
npm install x-crawl
Get the title of https://docs.github.com/zh/get-started as an example:
// Import module ES/CJS
import XCrawl from 'x-crawl'
// Create a crawler instance
const docsXCrawl = new XCrawl({
baseUrl: 'https://docs.github.com',
timeout: 10000,
intervalTime: { max: 2000, min: 1000 }
})
// Call fetchHTML API to crawl
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
Create a crawler instance via new XCrawl.
class XCrawl {
private readonly baseConfig
constructor(baseConfig?: IXCrawlBaseConifg)
fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
}
myXCrawl is the crawler instance of the following example.
const myXCrawl = new XCrawl({
baseUrl: 'https://xxx.com',
timeout: 10000,
// The interval of the next request, multiple requests are valid
intervalTime: {
max: 2000,
min: 1000
}
})
fetchHTML is the method of the above myXCrawl instance, usually used to crawl HTML.
function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
fetchData is the method of the above myXCrawl instance, which is usually used to crawl APIs to obtain JSON data and so on.
function fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
const requestConifg = [
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' }
]
myXCrawl.fetchData({
requestConifg, // Request configuration, can be IRequestConfig | IRequestConfig[]
intervalTime: 800 // Interval between next requests, multiple requests are valid
}).then(res => {
console.log(res)
})
fetchFile is the method of the above myXCrawl instance, which is usually used to crawl files, such as pictures, pdf files, etc.
function fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
const requestConifg = [
{ url: '/xxxx' },
{ url: '/xxxx' },
{ url: '/xxxx' }
]
myXCrawl.fetchFile({
requestConifg,
fileConfig: {
storeDir: path.resolve(__dirname, './upload') // storage folder
}
}).then(fileInfos => {
console.log(fileInfos)
})
interface IAnyObject extends Object {
[key: string | number | symbol]: any
}
type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
interface IRequestConfig {
url: string
method?: IMethod
headers?: IAnyObject
params?: IAnyObject
data?: any
timeout?: number
}
type IIntervalTime = number | {
max: number
min?: number
}
interface IFetchBaseConifg {
requestConifg: IRequestConfig | IRequestConfig[]
intervalTime?: IIntervalTime
}
type IFetchCommon<T> = {
id: number
statusCode: number | undefined
headers: IncomingHttpHeaders // node:http type
data: T
}[]
IFileInfo {
fileName: string
mimeType: string
size: number
filePath: string
}
interface IXCrawlBaseConifg {
baseUrl?: string
timeout?: number
intervalTime?: IIntervalTime
}
interface IFetchHTMLConfig extends IRequestConfig {}
interface IFetchDataConfig extends IFetchBaseConifg {
}
interface IFetchFileConfig extends IFetchBaseConifg {
fileConfig: {
storeDir: string
}
}
If you have any questions or needs , please submit Issues in https://github.com/coder-hxl/x-crawl/issues .
English | 简体中文
XCrawl 是 Nodejs 多功能爬虫库。只需简单的配置即可抓取 HTML 、JSON、文件资源等等。
以 NPM 为例:
npm install x-crawl
获取 https://docs.github.com/zh/get-started 的标题为例:
// 导入模块 ES/CJS
import XCrawl from 'x-crawl'
// 创建一个爬虫实例
const docsXCrawl = new XCrawl({
baseUrl: 'https://docs.github.com',
timeout: 10000,
intervalTime: { max: 2000, min: 1000 }
})
// 调用 fetchHTML API 爬取
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
通过 new XCrawl 创建一个爬虫实例。
class XCrawl {
private readonly baseConfig
constructor(baseConfig?: IXCrawlBaseConifg)
fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
}
myXCrawl 为后面示例的爬虫实例。
const myXCrawl = new XCrawl({
baseUrl: 'https://xxx.com',
timeout: 10000,
// 下次请求的间隔时间, 多个请求才有效
intervalTime: {
max: 2000,
min: 1000
}
})
fetch 是上面 myXCrawl 实例的方法,通常用于爬取 API ,可获取 JSON 数据等等。
function fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
const requestConifg = [
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' }
]
myXCrawl.fetchData({
requestConifg, // 请求配置, 可以是 IRequestConfig | IRequestConfig[]
intervalTime: 800 // 下次请求的间隔时间, 多个请求才有效
}).then(res => {
console.log(res)
})
fetchHTML 是上面 myXCrawl 实例的方法,通常用于爬取 HTML 。
function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
fetchFile 是上面 myXCrawl 实例的方法,通常用于爬取文件,可获取图片、pdf 文件等等。
function fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
const requestConifg = [
{ url: '/xxxx' },
{ url: '/xxxx' },
{ url: '/xxxx' }
]
myXCrawl.fetchFile({
requestConifg,
fileConfig: {
storeDir: path.resolve(__dirname, './upload') // 存放文件夹
}
}).then(fileInfos => {
console.log(fileInfos)
})
interface IAnyObject extends Object {
[key: string | number | symbol]: any
}
type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
interface IRequestConfig {
url: string
method?: IMethod
headers?: IAnyObject
params?: IAnyObject
data?: any
timeout?: number
}
type IIntervalTime = number | {
max: number
min?: number
}
interface IFetchBaseConifg {
requestConifg: IRequestConfig | IRequestConfig[]
intervalTime?: IIntervalTime
}
type IFetchCommon<T> = {
id: number
statusCode: number | undefined
headers: IncomingHttpHeaders // node:http type
data: T
}[]
interface IFileInfo {
fileName: string
mimeType: string
size: number
filePath: string
}
interface IXCrawlBaseConifg {
baseUrl?: string
timeout?: number
intervalTime?: IIntervalTime
}
interface IFetchHTMLConfig extends IRequestConfig {}
interface IFetchDataConfig extends IFetchBaseConifg {
}
interface IFetchFileConfig extends IFetchBaseConifg {
fileConfig: {
storeDir: string
}
}
如有 问题 或 需求 请在 https://github.com/coder-hxl/x-crawl/issues 中提 Issues 。
FAQs
x-crawl is a flexible Node.js AI-assisted crawler library.
The npm package x-crawl receives a total of 106 weekly downloads. As such, x-crawl popularity was classified as not popular.
We found that x-crawl demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.
Research
Security News
A threat actor's playbook for exploiting the npm ecosystem was exposed on the dark web, detailing how to build a blockchain-powered botnet.