CrawleeOne
The web scraping framework you can't refuse.
CrawleeOne is a feature-rich and highly configurable web scraping framework that empowers both scraper developers and their users.
It is built on top of Crawlee and Apify*. Read here for the recap of how Crawlee and Apify work.
The appeal of CrawleeOne is that it works seamlessly with Apify platforn,
but can also be easily re-purposed to work with other web scraping platforms or your custom services.
When deployed to Apify, or otherwise made available to be used by others,
the users of your scraper will have the freedom to transform, filter, limit, or otherwise
modify both the scraped data and the requests to scrape.
CrawleeOne should be especially your choice if:
- You're developing a long-lasting integrations.
- Or your scraper will be part of a data pipeline.
- Or you wish to make your scrapers available to others in your team / org, whether it's programmatically or via Apify UI.
NOTE: crawleeOne
allows you to easily switch between different implementations - Playwright, Cheerio, Puppeteer, ...
However, you still need to write data extraction logic that's specific to the implementation.
To make the transition between different implementations seamless, you can use portadom
,
which offers a single interface across all these implementations.
Pre-requirements
To make the most of CrawleeOne, you should be familiar with:
- Crawlee (AKA how to scrape data).
- Apify platform (AKA how to manage a scraped dataset and request queue).
Table of contents
Minimal example
Following example defines a CheerioCrawler
scraper with 2 routes (mainPage
and otherPage
) that process the incoming URLs either based on the URL,
or on the page HTML.
pushData
is used to save the scraped data, while pushRequests
enqueues more URLs to be scraped.
import { crawleeOne } from 'crawlee-one';
await crawleeOne({
type: 'cheerio',
routes: {
mainPage: {
match: /example\.com\/home/i,
handler: async (ctx) => {
const { $, request, pushData, pushRequests } = ctx;
const data = [
];
await pushData(data, {
privacyMask: { author: true },
});
const reqs = ['https://...'].map((url) => ({ url }));
await pushRequests(reqs);
},
},
otherPage: {
match: (url, ctx) => url.startsWith('/') && ctx.$('.author').length,
handler: async (ctx) => {
},
},
},
hooks: {
onReady: async (inst) => {
await inst.runCrawler(['https://...']);
},
},
});
If you're familiar with Crawlee,
the minimal example above is roughly equivalent to:
import { Actor } from 'apify';
import { CheerioCrawler, createCheerioRouter } from 'crawlee';
await Actor.main(async () => {
const rawInput = await Actor.getInput();
const input = {
...rawInput,
...(await fetchInput(rawInput.inputFromUrl)),
...(await runFunc(rawInput.inputFromFunc)),
};
const router = createCheerioRouter();
router.addHandler('mainPage', async (ctx) => {
await onBeforeHandler(ctx);
const data = [
];
const finalData = await transformAndFilterDataWithUserInput(data, ctx, input);
const dataset = await Actor.openDataset(input.datasetId);
await dataset.pushData(data);
const reqs = ['https://...'].map((url) => ({ url }));
const finalReqs = await transformAndFilterReqsWithUserInput(reqs, ctx, input);
const queue = await Actor.openRequestQueue(input.requestQueueId);
await queue.addRequests(finalReqs);
await onAfterHandler(ctx);
});
router.addDefaultHandler(async (ctx) => {
await onBeforeHandler(ctx);
const url = ctx.request.loadedUrl | ctx.request.url;
if (url.match(/example\.com\/home/i)) {
const req = { url, userData: { label: 'mainPage' } };
const finalReqs = await transformAndFilterReqsWithUserInput([req], ctx, input);
const queue = await Actor.openRequestQueue(input.requestQueueId);
await queue.addRequests(finalReqs);
}
await onAfterHandler(ctx);
});
const crawler = new CheerioCrawler({
...input,
requestHandler: router,
});
if (onReadyFn) await onReadyFn({ crawler, router, input });
else crawler.run(['https://...']);
});
As you can see, there's a lot going on behind the scenes, and that's far from everything.
* Apify can be replaced with your own implementation, so the data can be sent elsewhere, not just to Apify. This is set by the io
options.
What can CrawleeOne do?
Beside the main crawleeOne
function for running crawlers,
CrawleeOne also includes helpers and types for:
- Actor boilterplating
- Code generation
- Configuring logging and error handling
- E.g. Save errors to separate dataset or send to telemetry
- Data and request filtering and post-processing
- E.g. Enrich data with metadata
- Routing
- Testing actors
- Actor migration (conceptually similar to database migration)
- CLI utility for updating actors via apify-client
- Privacy compliance
- Metamorphing
CrawleeOne supports many common and advanced web scraping use cases. See the Use cases for the overview of the use cases.
See the section Usage (for end users) for how CrawleeOne looks from user's perspective.
Playbook & Use cases
Web crawlers written with CrawleeOne can be configured via their input
field to handle following advanced use cases:
Actor input reference
See here the full list of all possible input options that a CrawleeOne crawler can have.
CrawleeOne allows you to configure the following via the input
:
Usage (for developers)
Let's revisit the previous example this time with more options and explanations:
import { Actor } from 'apify';
import { crawleeOne, apifyIO, createSentryTelemetry } from 'crawlee-one';
await crawleeOne({
type: 'cheerio',
input: {
outputTransform: (item) => { ... },
},
inputDefaults: {
},
mergeInput: true,
mergeInput: ({ defaults, overrides, env }) => ({ ...defaults, ...env, ...overrides, });,
crawlerConfig: {
maxRequestsPerMinute: 120,
requestHandlerTimeoutSecs: 180,
headless: true,
},
crawlerConfigDefaults: {
},
routes: {
mainPage: {
match: /example\.com\/home/i,
handler: async (ctx) => {
const { $, request, pushData, pushRequests } = ctx;
const data = [ ... ];
await pushData(data, {
privacyMask: { author: true },
});
const reqs = ['https://...'].map((url) => ({ url }));
await pushRequests(reqs);
},
},
},
hooks: {
onReady: async (inst) => {
const startUrls: string[] = [];
if (!actor.startUrls.length && actor.input?.datasetType) {
startUrls.push(datasetTypeToUrl[actor.input?.datasetType]);
}
await actor.runCrawler(startUrls);
},
onBeforeHandler: (ctx) => { },
onAfterHandler: (ctx) => { },
validateInput: (input) => {
const schema = Joi.object({ ... });
Joi.assert(input, schema);
},
},
proxy: Actor.createProxyConfiguration({ ... }),
telemetry: createSentryTelemetry({
dsn: 'https://xxxxxxxxxxxxxxxxxxxxxxx@yyyyyyy.ingest.sentry.io/zzzzzzzzzzzzzzzzzzzzz',
tracesSampleRate: 1.0,
serverName: 'myCrawler',
}),
io: apifyIO,
router: myCustomRouter(),
});
You can find the full type definition of crawleeOne
and its arguments here:
To learn more about pushData
and pushRequests
, see:
- pushData
- NOTE: When you use
pushData
from within a handler, you omit the first argument (ctx
).
- pushRequests
Route handler context
Each route handler receives a context object, as defined by Crawlee Router.
CrawleeOne extends this context object with extra properties:
await crawleeOne({
routes: {
mainPage: {
match: /example\.com\/page/i,
handler: (ctx) => {
ctx.log('bla bla...')
const url = ctx.request.loadedUrl || ctx.request.url;
ctx.response
const $ = ctx.parseWithCheerio();
await ctx.actor.pushData(scrapedItems);
const id = Math.floor(Math.random() * 100);
const url = `https://example.com/resource/${id}`;
await ctx.actor.pushRequests([{ url }]);
const dataset = await ctx.actor.io.openDataset();
const reqQueue = await ctx.actor.io.openRequestQueue();
const keyValStore = await ctx.actor.io.openKeyValueStore();
if (ctx.actor.input.myCustomInput) {
}
if (ctx.actor.startUrls.length) {
}
ctx.actor.state.myVar = 1;
await ctx.pushData(scrapedItems)
await ctx.pushRequests(urlsToScrape)
await ctx.metamorph('nextCrawlerId', ...)
},
}
},
});
The actor
object is integral to CrawleeOne.
See here the full list of properties.
Deploying to Apify
See either of the two projects as examples:
1. Write the crawler with CrawleeOne
Either use the example projects above or use your own boilerplate project, but remember that Apify requires you to Dockerize the
project in order to be deployed on their platform.
Remember to install crawlee-one
package.
2. Define the crawler's input
You need to tell Apify what kind of input can be passed to your crawler.
This is done by defining the
actor.json
file.
You need to set this if you want to support the described use cases.
For that, you will need to:
-
Install apify-actor-config
as a dev dependency:
npm i -D apify-actor-config
apify-actor-config
is a sister package focused solely on working with and generating
Apify's actor.json
config files.
-
Write a JS/TS file where you will only define your config and export it as the default export.
See here the example config file from Profesia.sk Scraper.
Note that to make use of the CrawleeOne inputs, we need to import allActorInputs
and pass it to
properties
field of createActorInputSchema
.
import { allActorInputs } from 'crawlee-one';
import { createActorConfig, createActorInputSchema } from 'apify-actor-config';
const inputSchema = createActorInputSchema({
schemaVersion: 1,
properties: {
...customActorInput,
...allActorInputs,
},
});
const config = createActorConfig({
actorSpecification: 1,
input: inputSchema,
});
export default config;
Also note that we are able to override the defaults set in allActorInputs
by directly
modifying the object:
allActorInputs.requestHandlerTimeoutSecs.prefill = 60 * 3;
-
Build / transpile the config to vanilla JS if necessary.
In Profesia.sk Scraper, the config is defined as a TypeScript file, but apify-actor-config
currently supports only JS files.
So if you are also using anything other than plain JavaScript, then you will need to build / transpile your project. Do so only once you're happy with the input fields and their defaults.
-
Generate actor.json
file
Run the npx apify-actor-config gen
command and point it to the config JS file:
npx apify-actor-config gen -c ./path/to/dist/config.js
Optionally, set this as a script in package.json
.
The command should generate a config file in ./actor/actor.json
, with all the inputs from crawlee-one
. 🚀
-
Deploy the project to Apify.
Now head over to Apify to deploy the crawler there. See their docs on deployment.
-
Verify that the crawler offers all the inputs.
When you now go to see your crawler on Apify, you should see
that you can configure all kinds of various inputs. Congrats, you've got it working! 🚀
See the screenshot in the next section (Usage (for end users)) to see how the input looks like in the Apify UI.
Usage (for end users)
As a user of a crawler that was written with CrawleeOne, you have the option to
configure the crawler, and transform, filter & limit the scraped data and the "requests" (URLs to scrape).
CrawleeOne crawlers allow you to do literally anything with the scraped data.
See the common use cases here.
See here for how to use a CrawleeOne web scrapers through Apify platform.
Codegen & Config file
With CrawleeOne, you can generate TypeScript types and helper functions to create new instances of CrawleeOne with full type support.
With these types:
- You get fully-typed scraper definition.
- You can easily split the project across multiple files, as the corresponding types can be imported.
The final result can look like this:
import { profesiaRoute } from './__generated__/crawler';
const otherPageRoute: profesiaRoute = {
match: (url) => url.match(/example\.com\/home/i),
handler: async (ctx) => {
await ctx.pushData(...);
},
};
import { profesiaCrawler, profesiaRoute } from './__generated__/crawler';
import { otherPageRoute } from './routes';
await profesiaCrawler({
hooks: {
validateInput,
},
routes: {
mainPage: {
match: /example\.com\/home/i,
handler: (ctx) => {
ctx.parseWithCheerio();
},
},
otherPage: otherPageRoute,
},
});
1. Define the crawler schema in a config
To get started, you need to define the scraper schema.
Config may look like this:
module.exports = {
version: 1,
schema: {
crawlers: {
mainCrawler: {
type: 'playwright',
routes: ['listingPage', 'detailPage'],
},
},
},
};
Here is an example if we wrote the config in YAML and defined multiple crawlers:
version: 1
schema:
crawlers:
main:
type: 'playwright',
routes: ['listingPage', 'detailPage'],
other:
type: 'cheerio',
routes: ['someNoJSPage'],
CrawleeOne uses
cosmiconfig to import the config. This means that you can define the config as any of the following:
crawlee-one
property in package.json
.crawlee-onerc
file in JSON or YAML format.crawlee-onerc.json
, .crawlee-onerc.yaml
, .crawlee-onerc.yml
, .crawlee-onerc.js
, .crawlee-onerc.ts
, .crawlee-onerc.mjs
, or .crawlee-onerc.cjs
filecrawlee-onerc
, crawlee-onerc.json
, crawlee-onerc.yaml
, crawlee-onerc.yml
, crawlee-onerc.js
, crawlee-onerc.ts
or crawlee-onerc.cjs
file inside a .config
subdirectorycrawlee-one.config.js
, crawlee-one.config.ts
, crawlee-one.config.mjs
, or crawlee-one.config.cjs
file
2. Generate types
To generate the types from the config, run the generate
command:
npx crawlee-one generate -o ./path/to/__generated__/file.ts
3. Use generated types
Once generated, we can use the types right away:
import { mainCrawler } from './__generated__/file.ts';
await mainCrawler({
routes: {
listingPage: {
match: /example\.com\/home/i,
handler: (ctx) => {
ctx.parseWithCheerio();
},
},
detailPage: {
},
},
});
Or we can even run multiple crawlers simultaneously. This can be useful in cases where for some pages you need browser automation like Playwright, whereas for other you don't.
import { mainCrawler, otherCrawler } from './__generated__/file.ts';
const mainPromise = mainCrawler({
routes: {
listingPage: {
match: /example\.com\/home/i,
handler: (ctx) => {
ctx.page.locator('...');
await ctx.pushRequests([{ url: ... }], { requestQueueId: 'crawleeQueue' });
},
},
detailPage: {
},
},
});
const otherPromise = otherCrawler({
input: {
requestQueueId: 'crawleeQueue',
},
routes: {
someNoJSPage: {
match: /example\.com\/home/i,
handler: (ctx) => {
ctx.parseWithCheerio();
await ctx.pushData(...)
},
},
},
});
await Promise.all([mainPromise, otherPromise]);
Custom telemetry integration (CrawleeOneTelemetry)
You may want to track errors to a custom service. In that case, you can define and pass
a custom telemetry instance to the telemetry
argument of
crawleeOne
.
The instance needs to implement the
CrawleeOneTelemetry
interface:
interface CrawleeOneTelemetry {
setup: (actor: CrawleeOneActorInst) => Promise<void> | void;
onSendErrorToTelemetry: (
error: Error,
report: object,
options: {
io?: CrawleeOneIO;
allowScreenshot?: boolean;
reportingDatasetId?: string;
},
ctx: CrawleeOneCtx;
) => Promise<void> | void;
}
See existing integrations for inspiration:
Based on the above, here's an example of a custom telemetry implementation
that saves the errors to the local file system:
import fs from 'fs';
import type { CrawleeOneCtx, CrawleeOneTelemetry } from 'crawlee-one';
export const createFsTelemetry = <T extends CrawleeOneTelemetry<CrawleeOneCtx>>() => {
const timestamp = new Date().getTime();
let errors = 0;
return {
setup: async (actor) => {
await fs.promises.mkdir('./temp/error');
},
onSendErrorToTelemetry: async (error, report, options, ctx) => {
const filename = timestamp + '_' + (errors++).toString().padStart(5, '0') + '.json';
const data = JSON.stringify({ error, report });
await fs.promises.writeFile(filename, data, 'utf-8');
},
} as T;
};
await crawleeOne({
telemetry: createFsTelemetry(),
});
Custom platform and storage integration (CrawleeOneIO)
By default, CrawleeOne uses
Apify
to manage datasets, request queue, and other platform-specific features.
In most of the cases, this should be fine, because Apify uses local file system
when the crawler is not running inside Apify's cloud platform.
Sometimes, you may want to send the data to a custom dataset, or use a shared service
for accessing requests or cache storage, or otherwise override the default behaviour.
In those cases, you can define and pass a custom
CrawleeOneIO
instance to the io
argument of
crawleeOne
.
The instance needs to implement the
CrawleeOneIO
interface:
interface CrawleeOneIO {
openDataset: (id?: string | null) => MaybePromise<CrawleeOneDataset>;
openRequestQueue: (id?: string | null) => MaybePromise<CrawleeOneRequestQueue>;
openKeyValueStore: (id?: string | null) => MaybePromise<CrawleeOneKeyValueStore>;
getInput: () => Promise<Input | null>;
triggerDownstreamCrawler: (
targetActorId: string,
input?: TInput,
options?: {
build?: string;
}
) => Promise<void>;
runInContext: (userFunc: () => MaybePromise<unknown>, options?: ExitOptions) => Promise<void>;
createDefaultProxyConfiguration: (
input?: T | Readonly<T>
) => MaybePromise<ProxyConfiguration | undefined>;
isTelemetryEnabled: () => MaybePromise<boolean>;
generateErrorReport: (
input: CrawleeOneErrorHandlerInput,
options: PickRequired<CrawleeOneErrorHandlerOptions, 'io'>
) => MaybePromise<object>;
generateEntryMetadata: (ctx: Ctx) => MaybePromise<TMetadata>;
}
See existing integrations for inspiration:
Based on the above, here's an example of a custom CrawleeOneIO implementation
that overrides the datasets to send them to a custom HTTP endpoint.
import type { CrawleeOneIO, apifyIO } from 'crawlee-one';
export const createCustomIO = (baseUrl: string) => {
const createDatasetIO = (id?: string) => {
const fetchAllItems = () => {
const endpoint = `${baseUrl}/dataset/${id ?? 'default'}/all`;
return fetch(endpoint).then((d) => d.json());
};
const postItems = (items: any[]) => {
const endpoint = `${baseUrl}/dataset/${id ?? 'default'}`;
return fetch(endpoint, {
method: 'POST',
body: JSON.stringify(items),
}).then((d) => d.json());
};
return {
pushData: postItems,
getItems: fetchAllItems,
getItemsCount: () => fetchAllItems().then((d) => d.length),
};
};
return {
...apifyIO,
openDataset: createDatasetIO,
} as CrawleeOneIO;
};
await crawleeOne({
io: createCustomIO(),
});
Example projects
Contributing
Found a bug or hav a feature request? Please open a new issue.
When contributing with your code, please follow the standard best practices:
- Make a fork with your changes, then make a Merge Request to merge it
- Be polite
Supporting CrawleeOne
CrawleeOne is a labour of love. If you like what I do, you can support me on BuyMeACoffee.