Security News
Input Validation Vulnerabilities Dominate MITRE's 2024 CWE Top 25 List
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resources, etc. through simple configuration.
English | 简体中文
XCrawl is a Nodejs multifunctional crawler library. Crawl HTML, JSON, file resources, etc. through simple configuration.
Take NPM as an example:
npm install x-crawl
Get the title of https://docs.github.com/zh/get-started as an example:
// Import module ES/CJS
import XCrawl from 'x-crawl'
// Create a crawler instance
const docsXCrawl = new XCrawl({
baseUrl: 'https://docs.github.com',
timeout: 10000,
intervalTime: { max: 2000, min: 1000 }
})
// Call fetchHTML API to crawl
docsXCrawl.fetchHTML('/zh/get-started').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
Create a crawler instance via new XCrawl.
class XCrawl {
private readonly baseConfig
constructor(baseConfig?: IXCrawlBaseConifg)
fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
}
myXCrawl is the crawler instance of the following example.
const myXCrawl = new XCrawl({
baseUrl: 'https://xxx.com',
timeout: 10000,
// The interval between requests, multiple requests are valid
intervalTime: {
max: 2000,
min: 1000
}
})
The mode option defaults to async .
If there is an interval time set, it is necessary to wait for the interval time to end before sending the request.
fetchHTML is the method of the above myXCrawl instance, usually used to crawl HTML.
function fetchHTML(config: string | IFetchHTMLConfig): Promise<JSDOM>
myXCrawl.fetchHTML('/xxx').then((jsdom) => {
console.log(jsdom.window.document.querySelector('title')?.textContent)
})
fetchData is the method of the above myXCrawl instance, which is usually used to crawl APIs to obtain JSON data and so on.
function fetchData<T = any>(config: IFetchDataConfig): Promise<IFetchCommon<T>>
const requestConifg = [
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' },
{ url: '/xxxx', method: 'GET' }
]
myXCrawl.fetchData({
requestConifg, // Request configuration, can be IRequestConfig | IRequestConfig[]
intervalTime: 800 // Interval between next requests, multiple requests are valid
}).then(res => {
console.log(res)
})
fetchFile is the method of the above myXCrawl instance, which is usually used to crawl files, such as pictures, pdf files, etc.
function fetchFile(config: IFetchFileConfig): Promise<IFetchCommon<IFileInfo>>
const requestConifg = [
{ url: '/xxxx' },
{ url: '/xxxx' },
{ url: '/xxxx' }
]
myXCrawl.fetchFile({
requestConifg,
fileConfig: {
storeDir: path.resolve(__dirname, './upload') // storage folder
}
}).then(fileInfos => {
console.log(fileInfos)
})
interface IAnyObject extends Object {
[key: string | number | symbol]: any
}
type IMethod = 'get' | 'GET' | 'delete' | 'DELETE' | 'head' | 'HEAD' | 'options' | 'OPTIONS' | 'post' | 'POST' | 'put' | 'PUT' | 'patch' | 'PATCH' | 'purge' | 'PURGE' | 'link' | 'LINK' | 'unlink' | 'UNLINK'
interface IRequestConfig {
url: string
method?: IMethod
headers?: IAnyObject
params?: IAnyObject
data?: any
timeout?: number
}
type IIntervalTime = number | {
max: number
min?: number
}
interface IFetchBaseConifg {
requestConifg: IRequestConfig | IRequestConfig[]
intervalTime?: IIntervalTime
}
type IFetchCommon<T> = {
id: number
statusCode: number | undefined
headers: IncomingHttpHeaders // node:http type
data: T
}[]
interface IFileInfo {
fileName: string
mimeType: string
size: number
filePath: string
}
interface IXCrawlBaseConifg {
baseUrl?: string
timeout?: number
intervalTime?: IIntervalTime
mode?: 'async' | 'sync' // default: 'async'
}
interface IFetchHTMLConfig extends IRequestConfig {}
interface IFetchDataConfig extends IFetchBaseConifg {
}
interface IFetchFileConfig extends IFetchBaseConifg {
fileConfig: {
storeDir: string
}
}
If you have any questions or needs , please submit Issues in https://github.com/coder-hxl/x-crawl/issues .
FAQs
x-crawl is a flexible Node.js AI-assisted crawler library.
The npm package x-crawl receives a total of 106 weekly downloads. As such, x-crawl popularity was classified as not popular.
We found that x-crawl demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.
Research
Security News
A threat actor's playbook for exploiting the npm ecosystem was exposed on the dark web, detailing how to build a blockchain-powered botnet.