Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

@crawlee/browser-pool

Package Overview
Dependencies
Maintainers
1
Versions
1253
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@crawlee/browser-pool

Rotate multiple browsers using popular automation libraries such as Playwright or Puppeteer.

  • 3.0.0-alpha.3
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
23K
decreased by-4.26%
Maintainers
1
Weekly downloads
 
Created
Source

Browser Pool - the headless browser manager

Browser Pool is a small, but powerful and extensible library, that allows you to seamlessly control multiple headless browsers at the same time with only a little configuration, and a single function call. Currently, it supports Puppeteer, Playwright, and it can be easily extended with plugins.

We created Browser Pool because we regularly needed to execute tasks concurrently in many headless browsers and their pages, but we did not want to worry about launching browsers, closing browsers, restarting them after crashes and so on. We also wanted to easily and reliably manage the whole browser / page lifecycle.

You can use Browser Pool for scraping the internet at scale, testing your website in multiple browsers at the same time or launching web automation robots.

Installation

Use NPM or Yarn to install @crawlee/browser-pool. Note that @crawlee/browser-pool does not come preinstalled with browser automation libraries. This allows you to choose your own libraries and their versions, and it also makes @crawlee/browser-pool much smaller.

Run this command to install @crawlee/browser-pool and the playwright browser automation library.

npm install @crawlee/browser-pool playwright

Usage

This simple example shows how to open a page in a browser using Browser Pool. We use the provided PlaywrightPlugin to wrap a Playwright installation of your own. By calling browserPool.newPage() you launch a new Firefox browser and open a new page in that browser.

const { BrowserPool, PlaywrightPlugin } = require('@crawlee/browser-pool');
const playwright = require('playwright');

const browserPool = new BrowserPool({
    browserPlugins: [new PlaywrightPlugin(playwright.chromium)],
});

// An asynchronous IIFE (immediately invoked function expression)
// allows us to use the 'await' keyword.
(async () => {
    // Launches Chromium with Playwright and returns a Playwright Page.
    const page1 = await browserPool.newPage();
    // You can interact with the page as you're used to.
    await page1.goto('https://example.com');
    // When you're done, close the page.
    await page1.close();

    // Opens a second page in the same browser.
    const page2 = await browserPool.newPage();
    // When everything's finished, tear down the pool.
    await browserPool.destroy();
})();

Browser Pool uses the same asynchronous API as the underlying automation libraries which means extensive use of Promises and the async / await pattern. Visit MDN to learn more.

Launching multiple browsers

The basic example shows how to launch a single browser, but the purpose of Browser Pool is to launch many browsers. This is done automatically in the background. You only need to provide the relevant plugins and call browserPool.newPage().

const { BrowserPool, PlaywrightPlugin } = require('@crawlee/browser-pool');
const playwright = require('playwright');

const browserPool = new BrowserPool({
    browserPlugins: [
        new PlaywrightPlugin(playwright.chromium),
        new PlaywrightPlugin(playwright.firefox),
        new PlaywrightPlugin(playwright.webkit),
    ],
});

(async () => {
    // Open 4 pages in 3 browsers. The browsers are launched
    // in a round-robin fashion based on the plugin order.
    const chromiumPage = await browserPool.newPage();
    const firefoxPage = await browserPool.newPage();
    const webkitPage = await browserPool.newPage();
    const chromiumPage2 = await browserPool.newPage();

    // Don't forget to close pages / destroy pool when you're done.
})();

This round-robin way of opening pages may not be useful for you, if you need to consistently run tasks in multiple environments. For that, there's the newPageWithEachPlugin function.

const { BrowserPool, PlaywrightPlugin, PuppeteerPlugin } = require('@crawlee/browser-pool');
const playwright = require('playwright');
const puppeteer = require('puppeteer');

const browserPool = new BrowserPool({
    browserPlugins: [
        new PlaywrightPlugin(playwright.chromium),
        new PuppeteerPlugin(puppeteer),
    ],
});

(async () => {
    const pages = await browserPool.newPageWithEachPlugin();
    const promises = pages.map(async page => {
        // Run some task with each page
        // pages are in order of plugins:
        // [playwrightPage, puppeteerPage]
        await page.close();
    });
    await Promise.all(promises);

    // Continue with some more work.
})();

Features

Besides a simple interface for launching browsers, Browser Pool includes other helpful features that make browser management more convenient.

Simple configuration

You can easily set the maximum number of pages that can be open in a given browser and also the maximum number of pages to process before a browser is retired.

const browserPool = new BrowserPool({
    maxOpenPagesPerBrowser: 20,
    retireBrowserAfterPageCount: 100,
});

You can configure the browser launch options either right in the plugins:

const playwrightPlugin = new PlaywrightPlugin(playwright.chromium, {
    launchOptions: {
        headless: true,
    }
})

Or dynamically in pre-launch hooks:

const browserPool = new BrowserPool({
    preLaunchHooks: [(pageId, launchContext) => {
        if (pageId === 'headful') {
            launchContext.launchOptions.headless = false;
        }
    }]
});

Proxy management

When scraping at scale or testing websites from multiple geolocations, one often needs to use proxy servers. Setting up an authenticated proxy in Puppeteer can be cumbersome, so we created a helper that does all the heavy lifting for you. Simply provide a proxy URL with authentication credentials, and you're done. It works the same for Playwright too.

const puppeteerPlugin = new PuppeteerPlugin(puppeteer, {
    proxyUrl: 'http://<username>:<password>@proxy.com:8000'
});

We plan to extend this by adding a proxy-per-page functionality, allowing you to rotate proxies per page, rather than per browser.

Lifecycle management with hooks

Browser Pool allows you to manage the full browser / page lifecycle by attaching hooks to the most important events. Asynchronous hooks are supported, and their execution order is guaranteed.

The first parameter of each hook is either a pageId for the hooks executed before a page is created or a page afterwards. This is useful to keep track of which hook was triggered by which newPage() call.

const browserPool = new BrowserPool({
    browserPlugins: [
        new PlaywrightPlugin(playwright.chromium),
    ],
    preLaunchHooks: [(pageId, launchContext) => {
        // You can use pre-launch hooks to make dynamic changes
        // to the launchContext, such as changing a proxyUrl
        // or updating the browser launchOptions

        pageId === 'my-page' // true
    }],
    postPageCreateHooks: [(page, browserController) => {
        // It makes sense to make global changes to pages
        // in post-page-create hooks. For example, you can
        // inject some JavaScript library, such as jQuery.

        browserPool.getPageId(page) === 'my-page' // true
    }]
});

await browserPool.newPage({ id: 'my-page' });

See the API Documentation for all hooks and their arguments.

Manipulating playwright context using pageOptions or launchOptions

Playwright allows customizing multiple browser attributes by browser context. You can customize some of them once the context is created, but some need to be customized within its creation. This part of the documentation should explain how you can effectively customize the browser context.

First of all, let's take a look at what kind of context strategy you chose. You can choose between two strategies by useIncognitoPages LaunchContext option.

Suppose you decide to keep useIncognitoPages default false and create a shared context across all pages launched by one browser. In this case, you should pass the contextOptions as a launchOptions since the context is created within the new browser launch. The launchOptions corresponds to these playwright options. As you can see, these options contain not only ordinary playwright launch options but also the context options.

If you set useIncognitoPages to true, you will create a new context within each new page, which allows you to handle each page its cookies and application data. This approach allows you to pass the context options as pageOptions because a new context is created once you create a new page. In this case, the pageOptions corresponds to these playwright options.

Changing context options with LaunchContext:

This will only work if you keep the default value for useIncognitoPages (false).

const browserPool = new BrowserPool({
    browserPlugins: [
        new PlaywrightPlugin(
            playwright.chromium,
            {
                launchOptions: {
                    deviceScaleFactor: 2,
                },
            },
        ),
    ],

});

Changing context options with browserPool.newPage options:

const browserPool = new BrowserPool({
     browserPlugins: [
        new PlaywrightPlugin(
            playwright.chromium,
            {
                useIncognitoPages: true, // You must turn on incognito pages.
                launchOptions: {
                    // launch options
                    headless: false,
                    devtools: true,
                },
            },
        ),
    ],
});

(async () => {
    // Launches Chromium with Playwright and returns a Playwright Page.
    const page = await browserPool.newPage({
        pageOptions: {
            // context options
            deviceScaleFactor: 2,
            colorScheme: 'light',
            locale: 'de-DE',
        },
    });
})();

Changing context options with prePageCreateHooks options:

const browserPool = new BrowserPool({
    browserPlugins: [
        new PlaywrightPlugin(
            playwright.chromium,
            {
                useIncognitoPages: true,
                launchOptions: {
                // launch options
                    headless: false,
                    devtools: true,
                },
            },
        ),
    ],
    prePageCreateHooks: [
        (pageId, browserController, pageOptions) => {
            pageOptions.deviceScaleFactor = 2;
            pageOptions.colorScheme = 'dark';
            pageOptions.locale = 'de-DE';

            // You must modify the 'pageOptions' object, not assign to the variable.
            // pageOptions = {deviceScaleFactor: 2, ...etc} => This will not work!
        },
    ],
});

(async () => {
    // Launches Chromium with Playwright and returns a Playwright Page.
    const page = await browserPool.newPage();
})();

Single API for common operations

Puppeteer and Playwright handle some things differently. Browser Pool attempts to remove those differences for the most common use-cases.

// Playwright
const cookies = await context.cookies();
await context.addCookies(cookies);

// Puppeteer
const cookies = await page.cookies();
await page.setCookie(...cookies);

// BrowserPool uses the same API for all plugins
const cookies = await browserController.getCookies(page);
await browserController.setCookies(page, cookies);

Graceful browser closing

With Browser Pool, browsers are not closed, but retired. A retired browser will no longer open new pages, but it will wait until the open pages are closed, allowing your running tasks to finish. If a browser gets stuck in limbo, it will be killed after a timeout to prevent hanging browser processes.

Changing browser fingerprints a.k.a. browser signatures

Changing browser fingerprints is beneficial for avoiding getting blocked and simulating real user browsers. With Browser Pool, you can do this otherwise complicated technique by enabling the useFingerprints option. The fingerprints are by default tied to the respective proxy urls to not use the same unique fingerprint from various IP addresses. You can disable this behavior in the fingerprintOptions. In the fingerprintsOptions, You can also control which fingerprints are generated. You can control parameters as browser, operating system, and browser versions.

(UNSTABLE) Extensibility with plugins

A new super cool browser automation library appears? No problem, we add a simple plugin to Browser Pool, and it automagically works.

The BrowserPlugin and BrowserController interfaces are unstable and may change if we find some implementation to be suboptimal.

API Reference

All public classes, methods and their parameters can be inspected in this API reference.

@crawlee/browser-pool

The @crawlee/browser-pool module exports three constructors. One for BrowserPool itself and two for the included Puppeteer and Playwright plugins.

Example:

const {
 BrowserPool,
 PuppeteerPlugin,
 PlaywrightPlugin
} = require('@crawlee/browser-pool');
const puppeteer = require('puppeteer');
const playwright = require('playwright');

const browserPool = new BrowserPool({
    browserPlugins: [
        new PuppeteerPlugin(puppeteer),
        new PlaywrightPlugin(playwright.chromium),
    ]
});

Properties

NameType
BrowserPoolBrowserPool
PuppeteerPluginPuppeteerPlugin
PlaywrightPluginPlaywrightPlugin

BrowserPool

The BrowserPool class is the most important class of the @crawlee/browser-pool module. It manages opening and closing of browsers and their pages and its constructor options allow easy configuration of the browsers' and pages' lifecycle.

The most important and useful constructor options are the various lifecycle hooks. Those allow you to sequentially call a list of (asynchronous) functions at each stage of the browser / page lifecycle.

Example:

const { BrowserPool, PlaywrightPlugin } = require('@crawlee/browser-pool');
const playwright = require('playwright');

const browserPool = new BrowserPool({
    browserPlugins: [ new PlaywrightPlugin(playwright.chromium)],
    preLaunchHooks: [(pageId, launchContext) => {
        // do something before a browser gets launched
        launchContext.launchOptions.headless = false;
    }],
    postLaunchHooks: [(pageId, browserController) => {
        // manipulate the browser right after launch
        console.dir(browserController.browser.contexts());
    }],
    prePageCreateHooks: [(pageId, browserController) => {
        if (pageId === 'my-page') {
            // make changes right before a specific page is created
        }
    }],
    postPageCreateHooks: [async (page, browserController) => {
        // update some or all new pages
        await page.evaluate(() => {
            // now all pages will have 'foo'
            window.foo = 'bar'
        })
    }],
    prePageCloseHooks: [async (page, browserController) => {
        // collect information just before a page closes
        await page.screenshot();
    }],
    postPageCloseHooks: [(pageId, browserController) => {
        // clean up or log after a job is done
        console.log('Page closed: ', pageId)
    }]
});

new BrowserPool(options)
ParamTypeDefaultDescription
optionsobject
options.browserPluginsArray.<BrowserPlugin>Browser plugins are wrappers of browser automation libraries that allow BrowserPool to control browsers with those libraries. @crawlee/browser-pool comes with a PuppeteerPlugin and a PlaywrightPlugin.
[options.maxOpenPagesPerBrowser]number20Sets the maximum number of pages that can be open in a browser at the same time. Once reached, a new browser will be launched to handle the excess.
[options.retireBrowserAfterPageCount]number100Browsers tend to get bloated after processing a lot of pages. This option configures the number of processed pages after which the browser will automatically retire and close. A new browser will launch in its place.
[options.operationTimeoutSecs]number15As we know from experience, async operations of the underlying libraries, such as launching a browser or opening a new page, can get stuck. To prevent BrowserPool from getting stuck, we add a timeout to those operations and you can configure it with this option.
[options.closeInactiveBrowserAfterSecs]number300Browsers normally close immediately after their last page is processed. However, there could be situations where this does not happen. Browser Pool makes sure all inactive browsers are closed regularly, to free resources.
[options.preLaunchHooks]Array.<function()>Pre-launch hooks are executed just before a browser is launched and provide a good opportunity to dynamically change the launch options. The hooks are called with two arguments: pageId: string and launchContext: LaunchContext
[options.postLaunchHooks]Array.<function()>Post-launch hooks are executed as soon as a browser is launched. The hooks are called with two arguments: pageId: string and browserController: BrowserController To guarantee order of execution before other hooks in the same browser, the BrowserController methods cannot be used until the post-launch hooks complete. If you attempt to call await browserController.close() from a post-launch hook, it will deadlock the process. This API is subject to change.
[options.prePageCreateHooks]Array.<function()>Pre-page-create hooks are executed just before a new page is created. They are useful to make dynamic changes to the browser before opening a page. The hooks are called with two arguments: pageId: string, browserController: BrowserController and pageOptions: `object
[options.postPageCreateHooks]Array.<function()>Post-page-create hooks are called right after a new page is created and all internal actions of Browser Pool are completed. This is the place to make changes to a page that you would like to apply to all pages. Such as injecting a JavaScript library into all pages. The hooks are called with two arguments: page: Page and browserController: BrowserController
[options.prePageCloseHooks]Array.<function()>Pre-page-close hooks give you the opportunity to make last second changes in a page that's about to be closed, such as saving a snapshot or updating state. The hooks are called with two arguments: page: Page and browserController: BrowserController
[options.postPageCloseHooks]Array.<function()>Post-page-close hooks allow you to do page related clean up. The hooks are called with two arguments: pageId: string and browserController: BrowserController
[options.useFingerprints]booleanfalseIf true the Browser pool will automatically generate and inject fingerprints to browsers.
[options.fingerprintsOptions]FingerprintOptions Fingerprints options that allows customizing the fingerprinting behavior.
[options.fingerprintsOptions.fingerprintGeneratorOptions]See the Fingerprint generator documentation.
[options.fingerprintsOptions.useFingerprintPerProxyCache]booleantrueFingerprints are autimatically assigned to an IP address so 1 IP equals 1 fingerprint. You can disable this behavior by settings this property to false.
[options.fingerprintsOptions.fingerprintPerProxyCacheSize]number10000Maximum number of IP to fingerprint pairs.

browserPool.newPage(options)Promise.<Page>

Opens a new page in one of the running browsers or launches a new browser and opens a page there, if no browsers are active, or their page limits have been exceeded.

ParamTypeDescription
optionsobject
[options.id]stringAssign a custom ID to the page. If you don't a random string ID will be generated.
[options.pageOptions]objectSome libraries (Playwright) allow you to open new pages with specific options. Use this property to set those options.
[options.browserPlugin]BrowserPluginChoose a plugin to open the page with. If none is provided, one of the pool's available plugins will be used. It must be one of the plugins browser pool was created with. If you wish to start a browser with a different configuration, see the newPageInNewBrowser function.

browserPool.newPageInNewBrowser(options)Promise.<Page>

Unlike newPage, newPageInNewBrowser always launches a new browser to open the page in. Use the launchOptions option to configure the new browser.

ParamTypeDescription
optionsobject
[options.id]stringAssign a custom ID to the page. If you don't a random string ID will be generated.
[options.pageOptions]objectSome libraries (Playwright) allow you to open new pages with specific options. Use this property to set those options.
[options.launchOptions]objectOptions that will be used to launch the new browser.
[options.browserPlugin]BrowserPluginProvide a plugin to launch the browser. If none is provided, one of the pool's available plugins will be used. If you configured BrowserPool to rotate multiple libraries, such as both Puppeteer and Playwright, you should always set the browserPlugin when using the launchOptions option. The plugin will not be added to the list of plugins used by the pool. You can either use one of those, to launch a specific browser, or provide a completely new configuration.

browserPool.newPageWithEachPlugin(optionsList)Promise.<Array.<Page>>

Opens new pages with all available plugins and returns an array of pages in the same order as the plugins were provided to BrowserPool. This is useful when you want to run a script in multiple environments at the same time, typically in testing or website analysis.

Example:

const browserPool = new BrowserPool({
    browserPlugins: [
        new PlaywrightPlugin(playwright.chromium),
        new PlaywrightPlugin(playwright.firefox),
        new PlaywrightPlugin(playwright.webkit),
        new PuppeteerPlugin(puppeteer),
    ]
});

const pages = await browserPool.newPageWithEachPlugin();
const [chromiumPage, firefoxPage, webkitPage, puppeteerPage] = pages;
ParamType
optionsListArray.<object>

browserPool.getBrowserControllerByPage(page)BrowserController

Retrieves a BrowserController for a given page. This is useful when you're working only with pages and need to access the browser manipulation functionality.

You could access the browser directly from the page, but that would circumvent BrowserPool and most likely cause weird things to happen, so please always use BrowserController to control your browsers. The function returns undefined if the browser is closed.

ParamTypeDescription
pagePageBrowser plugin page

browserPool.getPage(id)Page

If you provided a custom ID to one of your pages or saved the randomly generated one, you can use this function to retrieve the page. If the page is no longer open, the function will return undefined.

ParamType
idstring

browserPool.getPageId(page)string

Page IDs are used throughout BrowserPool as a method of linking events. You can use a page ID to track the full lifecycle of the page. It is created even before a browser is launched and stays with the page until it's closed.

ParamType
pagePage

browserPool.retireBrowserController(browserController)

Removes a browser controller from the pool. The underlying browser will be closed after all its pages are closed.

ParamType
browserControllerBrowserController

browserPool.retireBrowserByPage(page)

Removes a browser from the pool. It will be closed after all its pages are closed.

ParamType
pagePage

browserPool.retireAllBrowsers()

Removes all active browsers from the pool. The browsers will be closed after all their pages are closed.


browserPool.closeAllBrowsers()Promise.<void>

Closes all managed browsers without waiting for pages to close.


browserPool.destroy()Promise.<void>

Closes all managed browsers and tears down the pool.


BrowserController

The BrowserController serves two purposes. First, it is the base class that specialized controllers like PuppeteerController or PlaywrightController extend. Second, it defines the public interface of the specialized classes which provide only private methods. Therefore, we do not keep documentation for the specialized classes, because it's the same for all of them.

Properties

NameTypeDescription
idstring
browserPluginBrowserPluginThe BrowserPlugin instance used to launch the browser.
browserBrowserBrowser representation of the underlying automation library.
launchContextLaunchContextThe configuration the browser was launched with.

browserController.close()Promise.<void>

Gracefully closes the browser and makes sure there will be no lingering browser processes.

Emits 'browserClosed' event.


browserController.kill()Promise.<void>

Immediately kills the browser process.

Emits 'browserClosed' event.


browserController.setCookies(page, cookies)Promise.<void>
ParamType
pageObject
cookiesArray.<object>

browserController.getCookies(page)Promise.<Array.<object>>
ParamType
pageObject

BrowserPlugin

The BrowserPlugin serves two purposes. First, it is the base class that specialized controllers like PuppeteerPlugin or PlaywrightPlugin extend. Second, it allows the user to configure the automation libraries and feed them to BrowserPool for use.

Properties

NameTypeDefaultDescription
[useIncognitoPages]booleanfalseBy default pages share the same browser context. If set to true each page uses its own context that is destroyed once the page is closed or crashes.
[userDataDir]objectPath to a User Data Directory, which stores browser session data like cookies and local storage.

new BrowserPlugin(library, [options])
ParamTypeDescription
libraryobjectEach plugin expects an instance of the object with the .launch() property. For Puppeteer, it is the puppeteer module itself, whereas for Playwright it is one of the browser types, such as puppeteer.chromium. BrowserPlugin does not include the library. You can choose any version or fork of the library. It also keeps @crawlee/browser-pool installation small.
[options]object
[options.launchOptions]objectOptions that will be passed down to the automation library. E.g. puppeteer.launch(launchOptions);. This is a good place to set options that you want to apply as defaults. To dynamically override those options per-browser, see the preLaunchHooks of BrowserPool.
[options.proxyUrl]stringAutomation libraries configure proxies differently. This helper allows you to set a proxy URL without worrying about specific implementations. It also allows you use an authenticated proxy without extra code.

LaunchContext

LaunchContext holds information about the launched browser. It's useful to retrieve the launchOptions, the proxy the browser was launched with or any other information user chose to add to the LaunchContext by calling its extend function. This is very useful to keep track of browser-scoped values, such as session IDs.

Properties

NameTypeDescription
idstringTo make identification of LaunchContext easier, BrowserPool assigns the LaunchContext an id that's equal to the id of the page that triggered the browser launch. This is useful, because many pages share a single launch context (single browser).
browserPluginBrowserPluginThe BrowserPlugin instance used to launch the browser.
launchOptionsobjectThe actual options the browser was launched with, after changes. Those changes would be typically made in pre-launch hooks.
[useIncognitoPages]booleanBy default pages share the same browser context. If set to true each page uses its own context that is destroyed once the page is closed or crashes.
[userDataDir]objectPath to a User Data Directory, which stores browser session data like cookies and local storage.

launchContext.proxyUrl

Sets a proxy URL for the browser. Use undefined to unset existing proxy URL.

ParamType
urlstring

launchContext.proxyUrlstring

Returns the proxy URL of the browser.


launchContext.extend(fields)

Extend the launch context with any extra fields. This is useful to keep state information relevant to the browser being launched. It ensures that no internal fields are overridden and should be used instead of property assignment.

ParamType
fieldsobject

FAQs

Package last updated on 22 Apr 2022

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc