@xapp/arachne
An extremely simple web crawler, based on puppeteer.
Usage in a Lambda
Chromium is required for puppeteer and is typically the limiting factor when trying to get it to run in a Lambda due to its size. This can be overcome with a Lambda Layer, specifically this community maintained layer.
You can include this layer directly in your SLS framework file or SAM Policy template.
A SLS framework example:
functions:
eventReceiver:
handler: dist/index.receiver
layers:
- "arn:aws:lambda:us-east-1:764866452798:layer:chrome-aws-lambda:31"
In your Lambda source:
import { Browser, LaunchOptions, BrowserConnectOptions, BrowserLaunchArgumentOptions } from "puppeteer";
import { Arachne, ArachnePage, ArachneRequest, MemoryRequestQueue } from "@xapp/arachne";
let browser: Pick<Browser, "close" | "newPage">;
try {
log().debug('Looking for chrome-aws-lambda');
const chromium = require('@sparticuz/chrome-aws-lambda');
browser = await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless,
ignoreHTTPSErrors: true,
});
} catch (e) {
log().debug("Could not find chrome-aws-lambda layer");
console.error(e);
}
const crawler = Arachne.crawler({
stealth: true,
launchOptions,
queue,
browser,
pageHandler: async (page: ArachnePage, request: ArachneRequest) => {
}
});
Lambda Layer Resources