@crawlee/types
Advanced tools
Comparing version 3.0.0-beta.11 to 3.0.0-beta.12
{ | ||
"name": "@crawlee/types", | ||
"version": "3.0.0-beta.11", | ||
"version": "3.0.0-beta.12", | ||
"description": "Shared types for the crawlee projects", | ||
@@ -5,0 +5,0 @@ "engines": { |
103
README.md
@@ -10,7 +10,7 @@ # Crawlee: The scalable web crawling and scraping library for JavaScript | ||
Apify SDK simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. | ||
Crawlee simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. | ||
It provides tools to manage and automatically scale a pool of headless browsers, | ||
to maintain queues of URLs to crawl, store crawling results to a local filesystem or into the cloud, | ||
rotate proxies and much more. | ||
The SDK is available as the [`apify`](https://www.npmjs.com/package/apify) NPM package. | ||
The SDK is available as the [`crawlee`](https://www.npmjs.com/package/crawlee) NPM package. | ||
It can be used either stand-alone in your own applications | ||
@@ -20,5 +20,5 @@ or in [actors](https://docs.apify.com/actor) | ||
**View full documentation, guides and examples on the [Apify SDK project website](https://sdk.apify.com)** | ||
**View full documentation, guides and examples on the [Crawlee project website](https://apify.github.io/apify-ts/)** | ||
> Would you like to work with us on Apify SDK or similar projects? [We are hiring!](https://apify.com/jobs#senior-node.js-engineer) | ||
> Would you like to work with us on Crawlee or similar projects? [We are hiring!](https://apify.com/jobs#senior-node.js-engineer) | ||
@@ -40,3 +40,3 @@ ## Motivation | ||
The goal of the Apify SDK is to fill this gap and provide a toolbox for generic web scraping, crawling and automation tasks in JavaScript. So don't | ||
The goal of Crawlee is to fill this gap and provide a toolbox for generic web scraping, crawling and automation tasks in JavaScript. So don't | ||
reinvent the wheel every time you need data from the web, and focus on writing code specific to the target website, rather than developing | ||
@@ -47,25 +47,27 @@ commonalities. | ||
The Apify SDK is available as the [`apify`](https://www.npmjs.com/package/apify) NPM package and it provides the following tools: | ||
Crawlee is available as the [`crawlee`](https://www.npmjs.com/package/crawlee) NPM package and is also available via `@crawlee/*` packages. It provides the following tools: | ||
- [`CheerioCrawler`](https://sdk.apify.com/docs/api/cheerio-crawler) - Enables the parallel crawling of a large | ||
[//]: # (TODO add links to the docs about `@crawlee/` packages and the `crawlee` metapackage) | ||
- [`CheerioCrawler`](https://apify.github.io/apify-ts/api/cheerio-crawler/class/CheerioCrawler) - Enables the parallel crawling of a large | ||
number of web pages using the [cheerio](https://www.npmjs.com/package/cheerio) HTML parser. This is the most | ||
efficient web crawler, but it does not work on websites that require JavaScript. | ||
- [`PuppeteerCrawler`](https://sdk.apify.com/docs/api/puppeteer-crawler) - Enables the parallel crawling of | ||
- [`PuppeteerCrawler`](https://apify.github.io/apify-ts/api/puppeteer-crawler/class/PuppeteerCrawler) - Enables the parallel crawling of | ||
a large number of web pages using the headless Chrome browser and [Puppeteer](https://github.com/puppeteer/puppeteer). | ||
The pool of Chrome browsers is automatically scaled up and down based on available system resources. | ||
- [`PlaywrightCrawler`](https://sdk.apify.com/docs/api/playwright-crawler) - Unlike `PuppeteerCrawler` | ||
- [`PlaywrightCrawler`](https://apify.github.io/apify-ts/api/playwright-crawler/class/PlaywrightCrawler) - Unlike `PuppeteerCrawler` | ||
you can use [Playwright](https://github.com/microsoft/playwright) to manage almost any headless browser. | ||
It also provides a cleaner and more mature interface while keeping the ease of use and advanced features. | ||
- [`BasicCrawler`](https://sdk.apify.com/docs/api/basic-crawler) - Provides a simple framework for the parallel | ||
- [`BasicCrawler`](https://apify.github.io/apify-ts/api/basic-crawler/class/BasicCrawler) - Provides a simple framework for the parallel | ||
crawling of web pages whose URLs are fed either from a static list or from a dynamic queue of URLs. This class | ||
serves as a base for the more specialized crawlers above. | ||
- [`RequestList`](https://sdk.apify.com/docs/api/request-list) - Represents a list of URLs to crawl. | ||
- [`RequestList`](https://apify.github.io/apify-ts/api/core/class/RequestList) - Represents a list of URLs to crawl. | ||
The URLs can be passed in code or in a text file hosted on the web. The list persists its state so that crawling | ||
can resume when the Node.js process restarts. | ||
- [`RequestQueue`](https://sdk.apify.com/docs/api/request-queue) - Represents a queue of URLs to crawl, | ||
- [`RequestQueue`](https://apify.github.io/apify-ts/api/core/class/RequestQueue) - Represents a queue of URLs to crawl, | ||
which is stored either on a local filesystem or in the [Apify Cloud](https://apify.com). The queue is used | ||
@@ -75,25 +77,22 @@ for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. | ||
- [`Dataset`](https://sdk.apify.com/docs/api/dataset) - Provides a store for structured data and enables their export | ||
- [`Dataset`](https://apify.github.io/apify-ts/api/core/class/Dataset) - Provides a store for structured data and enables their export | ||
to formats like JSON, JSONL, CSV, XML, Excel or HTML. The data is stored on a local filesystem or in the Apify Cloud. | ||
Datasets are useful for storing and sharing large tabular crawling results, such as a list of products or real estate offers. | ||
- [`KeyValueStore`](https://sdk.apify.com/docs/api/key-value-store) - A simple key-value store for arbitrary data | ||
- [`KeyValueStore`](https://apify.github.io/apify-ts/api/core/class/KeyValueStore) - A simple key-value store for arbitrary data | ||
records or files, along with their MIME content type. It is ideal for saving screenshots of web pages, PDFs | ||
or to persist the state of your crawlers. The data is stored on a local filesystem or in the Apify Cloud. | ||
- [`AutoscaledPool`](https://sdk.apify.com/docs/api/autoscaled-pool) - Runs asynchronous background tasks, | ||
- [`AutoscaledPool`](https://apify.github.io/apify-ts/api/core/class/AutoscaledPool) - Runs asynchronous background tasks, | ||
while automatically adjusting the concurrency based on free system memory and CPU usage. This is useful for running | ||
web scraping tasks at the maximum capacity of the system. | ||
- [`Browser Utils`](https://sdk.apify.com/docs/api/puppeteer) - Provides several helper functions useful | ||
for web scraping. For example, to inject jQuery into web pages or to hide browser origin. | ||
Additionally, the package provides various helper functions to simplify running your code on the Apify Cloud and thus | ||
take advantage of its pool of proxies, job scheduler, data storage, etc. | ||
For more information, see the [Apify SDK Programmer's Reference](https://sdk.apify.com). | ||
For more information, see the [Crawlee Programmer's Reference](https://apify.github.io/apify-ts/). | ||
## Quick Start | ||
This short tutorial will set you up to start using Apify SDK in a minute or two. | ||
If you want to learn more, proceed to the [Getting Started](https://sdk.apify.com/docs/guides/getting-started) | ||
This short tutorial will set you up to start using Crawlee in a minute or two. | ||
If you want to learn more, proceed to the [Getting Started](https://apify.github.io/apify-ts/docs/guides/getting-started) | ||
tutorial that will take you step by step through creating your first scraper. | ||
@@ -103,50 +102,38 @@ | ||
Apify SDK requires [Node.js](https://nodejs.org/en/) 15.10 or later. | ||
Add Apify SDK to any Node.js project by running: | ||
Crawlee requires [Node.js](https://nodejs.org/en/) 16 or later. | ||
Add Crawlee to any Node.js project by running: | ||
```bash | ||
npm install apify playwright | ||
npm install @crawlee/playwright playwright | ||
``` | ||
> Neither `playwright` nor `puppeteer` are bundled with the SDK to reduce install size and allow greater | ||
> flexibility. That's why we install it with NPM. You can choose one, both, or neither. | ||
> Neither `playwright` nor `puppeteer` are bundled with the SDK to reduce install size and allow greater flexibility. That's why we install it with NPM. You can choose one, both, or neither. | ||
Run the following example to perform a recursive crawl of a website using Playwright. For more examples showcasing various features of the Apify SDK, | ||
[see the Examples section of the documentation](https://sdk.apify.com/docs/examples/crawl-multiple-urls). | ||
Run the following example to perform a recursive crawl of a website using Playwright. For more examples showcasing various features of Crawlee, | ||
[see the Examples section of the documentation](https://apify.github.io/apify-ts/docs/examples/crawl-multiple-urls). | ||
```javascript | ||
const Apify = require('apify'); | ||
import { PlaywrightCrawler } from '@crawlee/playwright'; | ||
// Apify.main is a helper function, you don't need to use it. | ||
Apify.main(async () => { | ||
const requestQueue = await Apify.openRequestQueue(); | ||
// Choose the first URL to open. | ||
await requestQueue.addRequest({ url: 'https://www.iana.org/' }); | ||
const crawler = new PlaywrightCrawler({ | ||
async requestHandler({ request, page, enqueueLinks }) { | ||
// Extract HTML title of the page. | ||
const title = await page.title(); | ||
console.log(`Title of ${request.url}: ${title}`); | ||
const crawler = new Apify.PlaywrightCrawler({ | ||
requestQueue, | ||
handlePageFunction: async ({ request, page }) => { | ||
// Extract HTML title of the page. | ||
const title = await page.title(); | ||
console.log(`Title of ${request.url}: ${title}`); | ||
// Add URLs from the same subdomain. | ||
await enqueueLinks(); | ||
}, | ||
}); | ||
// Add URLs that match the provided pattern. | ||
await Apify.utils.enqueueLinks({ | ||
page, | ||
requestQueue, | ||
pseudoUrls: ['https://www.iana.org/[.*]'], | ||
}); | ||
}, | ||
}); | ||
await crawler.run(); | ||
}); | ||
// Choose the first URL to open and run the crawler. | ||
await crawler.addRequests(['https://www.iana.org/']); | ||
await crawler.run(); | ||
``` | ||
When you run the example, you should see Apify SDK automating a Chrome browser. | ||
When you run the example, you should see Crawlee automating a Chrome browser. | ||
![Chrome Scrape](https://sdk.apify.com/img/chrome_scrape.gif) | ||
![Chrome Scrape](https://apify.github.io/apify-ts/img/chrome_scrape.gif) | ||
By default, Apify SDK stores data to `./apify_storage` in the current working directory. You can override this behavior by setting either the | ||
`APIFY_LOCAL_STORAGE_DIR` or `APIFY_TOKEN` environment variable. For details, see [Environment variables](https://sdk.apify.com/docs/guides/environment-variables), [Request storage](https://sdk.apify.com/docs/guides/request-storage) and [Result storage](https://sdk.apify.com/docs/guides/result-storage). | ||
By default, Crawlee stores data to `./crawlee_storage` in the current working directory. You can override this directory via `CRAWLEE_STORAGE_DIR` env var. For details, see [Environment variables](https://apify.github.io/apify-ts/docs/guides/environment-variables), [Request storage](https://apify.github.io/apify-ts/docs/guides/request-storage) and [Result storage](https://apify.github.io/apify-ts/docs/guides/result-storage). | ||
@@ -156,3 +143,3 @@ ### Local usage with Apify command-line interface (CLI) | ||
To avoid the need to set the environment variables manually, to create a boilerplate of your project, and to enable pushing and running your code on | ||
the [Apify platform](https://sdk.apify.com/docs/guides/apify-platform), you can use the [Apify command-line interface (CLI)](https://github.com/apify/apify-cli) tool. | ||
the [Apify platform](https://apify.github.io/apify-ts/docs/guides/apify-platform), you can use the [Apify command-line interface (CLI)](https://github.com/apify/apify-cli) tool. | ||
@@ -197,3 +184,3 @@ Install the CLI by running: | ||
You can also develop your web scraping project in an online code editor directly on the [Apify platform](https://sdk.apify.com/docs/guides/apify-platform). | ||
You can also develop your web scraping project in an online code editor directly on the [Apify platform](https://apify.github.io/apify-ts/docs/guides/apify-platform). | ||
You'll need to have an Apify Account. Go to [Actors](https://console.apify.com/actors), page in the Apify Console, click <i>Create new</i> | ||
@@ -206,3 +193,3 @@ and then go to the <i>Source</i> tab and start writing your code or paste one of the examples from the Examples section. | ||
If you find any bug or issue with the Apify SDK, please [submit an issue on GitHub](https://github.com/apify/apify-js/issues). | ||
If you find any bug or issue with Crawlee, please [submit an issue on GitHub](https://github.com/apify/apify-js/issues). | ||
For questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify) or contact support@apify.com | ||
@@ -209,0 +196,0 @@ |
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
80253
204