crawler-ts-fetch
Advanced tools
Comparing version 1.1.0 to 1.1.1
{ | ||
"name": "crawler-ts-fetch", | ||
"version": "1.1.0", | ||
"version": "1.1.1", | ||
"description": "Lightweight crawler written in TypeScript using ES6 generators.", | ||
"keywords": [ | ||
"crawl", | ||
"crawler", | ||
"crawling-framework", | ||
"crawling", | ||
"es6-generators", | ||
"typescript", | ||
"web-crawler", | ||
"web-crawling" | ||
], | ||
"author": { | ||
@@ -5,0 +16,0 @@ "name": "Gillis Van Ginderacter", |
# crawler-ts | ||
<p align="center"> | ||
Crawler written in TypeScript using ES6 generators. | ||
</p> | ||
Lightweight crawler written in TypeScript using ES6 generators. | ||
<p align="center"> | ||
<a href="https://www.npmjs.com/package/crawler-ts"> | ||
<img alt="npm" src="https://img.shields.io/npm/v/crawler-ts.svg?color=green"/> | ||
</a> | ||
<a href="https://bundlephobia.com/result?p=crawler-ts"> | ||
<img alt="bundle size" src="https://img.shields.io/bundlephobia/minzip/crawler-ts?label=bundle size"/> | ||
</a> | ||
<img alt="license" src="https://img.shields.io/npm/l/crawler-ts?label=license&color=green"/> | ||
</p> | ||
<a href="https://www.npmjs.com/package/crawler-ts"> | ||
<img alt="npm" src="https://img.shields.io/npm/v/crawler-ts.svg?color=green"/> | ||
</a> | ||
<a href="https://bundlephobia.com/result?p=crawler-ts"> | ||
<img alt="bundle size" src="https://img.shields.io/bundlephobia/minzip/crawler-ts?label=bundle size"/> | ||
</a> | ||
<img alt="license" src="https://img.shields.io/npm/l/crawler-ts?label=license&color=green"/> | ||
@@ -25,40 +21,41 @@ ## Installation | ||
- [Crawl NASA Mars News](./examples/mars-news/src/index.ts) | ||
- [Crawl Hacker News](./examples/hacker-news/src/index.ts) | ||
- [Crawl the file system](./examples/fs/src/index.ts) | ||
- [Crawl Github](./examples/http/src/index.ts) | ||
## API | ||
The `crawl` function expects the following configuration as the first parameter. | ||
The `createCrawler` function expects the following options as the first parameter. | ||
```typescript | ||
/** | ||
* @type {Location} The type of the locations to crawl, e.g. `URL` or `string` that represents a path. | ||
* @type {Response} The type of the response at the location that is crawler, e.g. Cheerio object, file system `fs.Stats`. | ||
* @type {Result} The intermediate result that can be parsed from the response and generated by the crawler. | ||
* @type {L} The type of the locations to crawl, e.g. `URL` or `string` that represents a path. | ||
* @type {R} The type of the response at the location that is crawler, e.g. Cheerio object, file system `fs.Stats`. | ||
* @type {P} The intermediate parsed result that can be parsed from the response and generated by the crawler. | ||
*/ | ||
export interface Config<Location, Response, Result> { | ||
interface Options<L, R, P> { | ||
/** | ||
* This function should return the response for the given location. | ||
*/ | ||
requester(loc: Location): Response | Promise<Response | undefined>; | ||
requester(location: L): ValueOrPromise<R | undefined>; | ||
/** | ||
* This function should return true if the crawler should parse the response, or false if not. | ||
*/ | ||
shouldParse(loc: Location, response: Response): boolean | Promise<boolean>; | ||
shouldParse(props: PreParseProps<L, R>): ValueOrPromise<boolean>; | ||
/** | ||
* This function should parse the response and convert the response to the result type. | ||
* This function should parse the response and convert the response to the parsed type. | ||
*/ | ||
parser(loc: Location, response: Response): Result | Promise<Result | undefined>; | ||
parser(props: PreParseProps<L, R>): ValueOrPromise<P | undefined>; | ||
/** | ||
* This function should return true if the crawler should yield the result, or false if not. | ||
* This function should return true if the crawler should yield the parsed result, or false if not. | ||
*/ | ||
shouldYield(result: Result): boolean | Promise<boolean>; | ||
shouldYield(props: PostParseProps<L, R, P>): ValueOrPromise<boolean>; | ||
/** | ||
* This function should yield all the locations to follow in the given result. | ||
* This function should yield all the locations to follow in the given parsed result. | ||
*/ | ||
follower(result: Result): AsyncGenerator<Location>; | ||
follower(props: PostParseProps<L, R, P>): AsyncGenerator<L>; | ||
/** | ||
* This function should return true if the crawler should queue the location for crawling, or false if not. | ||
*/ | ||
shouldQueue(loc: Location): boolean | Promise<boolean>; | ||
shouldQueue(props: { location: L; origin: L; response: R; parsed: P }): ValueOrPromise<boolean>; | ||
/** | ||
@@ -71,2 +68,13 @@ * The logger can be set to `console` to output debug information to the `console`. | ||
} | ||
interface PreParseProps<L, R> { | ||
location: L; | ||
response: R; | ||
} | ||
interface PostParseProps<L, R, P> extends PreParseProps<L, R> { | ||
parsed: P; | ||
} | ||
type ValueOrPromise<T> = T | Promise<T>; | ||
``` | ||
@@ -89,3 +97,3 @@ | ||
This module implements a requester that uses `node-fetch` to request content over HTTP. | ||
This module implements a `requester` that uses `node-fetch` to request content over HTTP. | ||
@@ -105,6 +113,19 @@ See [modules/crawler-ts-fetch](./modules/crawler-ts-fetch). | ||
This module implements a requester, parser and follower for HTML. The requester uses `crawler-ts-fetch` to request content over HTTP. The parser uses `htmlparser2` to parse HTML files. The follower uses the parser result to find `<a>` anchor elements and yields its `href` properties. | ||
This module implements a `requester`, `parser` and `follower` for HTML. The `requester` uses `crawler-ts-fetch` to request content over HTTP. The `parser` uses `htmlparser2` to parse HTML files. The `follower` uses the parser result to find `<a>` anchor elements and yields its `href` properties. | ||
See [modules/crawler-ts-htmlparser2](./modules/crawler-ts-htmlparser2). | ||
### crawler-ts-fs | ||
<p> | ||
<a href="https://www.npmjs.com/package/crawler-ts-fs"> | ||
<img alt="npm" src="https://img.shields.io/npm/v/crawler-ts-fs.svg?color=green"/> | ||
</a> | ||
<a href="https://bundlephobia.com/result?p=crawler-ts-fs"> | ||
<img alt="bundle size" src="https://img.shields.io/bundlephobia/minzip/crawler-ts-fs?label=bundle size"/> | ||
</a> | ||
</p> | ||
This module implements a `requester`, `parser` and `follower` for the file system. The `requester` uses `fs.stat` to request file information. The `parser` by default just returns the response from the `requester`. The `follower` follows directories. | ||
## Author | ||
@@ -111,0 +132,0 @@ |
46947
134