
Product
Introducing Socket MCP for Claude Desktop
Add secure dependency scanning to Claude Desktop with Socket MCP, a one-click extension that keeps your coding conversations safe from malicious packages.
Easy to use comprehensive wrapper for brightdata *scrapers, web unlocker, browserapi) APIs with async support
Package |
pip install brightdata
→ one import away from grabbing JSON rows
from Amazon, Instagram, LinkedIn, Tiktok, Youtube, X, Reddit and more in a production-grade way.
(Scroll down in https://brightdata.com/products/web-scraper to see all specialized scrapers )
Note: This is an unofficial SDK
scrape_url
method provides simplest yet most prod ready scraping experience
fallback_to_browser_api
boolean parameter. When used, if no specialized scraper is found, it uses brightdata BrowserAPI to scrape the website.scrape_urls
method for multiple link scraping. It is built with native asyncio support which means all urls can scraped at same time asycnrenously. And also ``fallback_to_browser_api` parameter available.
Supports Brightdata discovery and search APIs as well
To enable agentic workflows package contains a Json file which contains information about all scrapers and their methods
Obtain BRIGHTDATA_TOKEN
from brightdata.com
Create .env
file and paste the token like this
BRIGHTDATA_TOKEN=AJKSHKKJHKAJ… # your token
install brightdata package via PyPI
pip install brightdata
brightdata.auto.scrape_url
looks at the domain of a URL and
returns the scraper class that declared itself responsible for that domain.
With that you can all you have to do is feed the url.
from brightdata import trigger_scrape_url, scrape_url
# trigger+wait and get the actual data
rows = scrape_url("https://www.amazon.com/dp/B0CRMZHDG8")
# just get the snapshot ID so you can collect the data later
snap = trigger_scrape_url("https://www.amazon.com/dp/B0CRMZHDG8")
it also works for sites which brightdata exposes several distinct “collect” endpoints.
LinkedInScraper
is a good example:
LinkedIn dataset | method exposed by the scraper |
---|---|
people profile – collect by URL | collect_people_by_url() |
company page – collect by URL | collect_company_by_url() |
job post – collect by URL | collect_jobs_by_url() |
In each scraper there is a smart dispatcher method which calls the right method based on link structure.
from brightdata import scrape_url
links_with_different_types = [
"https://www.linkedin.com/in/enes-kuzucu/",
"https://www.linkedin.com/company/105448508/",
"https://www.linkedin.com/jobs/view/4231516747/",
]
for link in links_with_different_types:
rows = scrape_url(link, bearer_token=TOKEN)
print(rows)
Note:
trigger_scrape_url, scrape_url
methods only covers the “collect by URL” use-case.
Discovery-endpoints (keyword, category, …) are still called directly on a specific scraper class.
import os
from dotenv import load_dotenv
from brightdata.ready_scrapers.amazon import AmazonScraper
from brightdata.utils.poll import poll_until_ready # blocking helper
import sys
load_dotenv()
TOKEN = os.getenv("BRIGHTDATA_TOKEN")
if not TOKEN:
sys.exit("Set BRIGHTDATA_TOKEN environment variable first")
scraper = AmazonScraper(bearer_token=TOKEN)
snap = scraper.collect_by_url([
"https://www.amazon.com/dp/B0CRMZHDG8",
"https://www.amazon.com/dp/B07PZF3QS3",
])
rows = poll_until_ready(scraper, snap).data # list[dict]
print(rows[0]["title"])
With fetch_snapshot_async
you can trigger 1000 snapshots and each polling task yields control whenever it’s waiting
All polls share one aiohttp.ClientSession
(connection pool), so you’re not tearing down TCP connections for every check.
fetch_snapshots_async is a convenience helper that wraps all the boilerplate needed when you fire off hundreds or thousands of scraping jobs—so you don’t have to manually spawn tasks and gather their results.It preserves the order of your snapshot list. It surfaces all ScrapeResults in a single list, so you can correlate inputs → outputs easily.
import asyncio
from brightdata.ready_scrapers.amazon import AmazonScraper
from brightdata.utils.async_poll import fetch_snapshots_async
# token comes from your .env
scraper = AmazonScraper(bearer_token=TOKEN)
# kick-off 100 keyword-discover jobs (all return snapshot-ids)
keywords = ["dog food", "ssd", ...] # 100 items
snapshots = [scraper.discover_by_keyword([kw]) # one per call
for kw in keywords]
# wait for *all* snapshots to finish (poll every 15 s, 10 min timeout)
results = asyncio.run(
fetch_snapshots_async(scraper, snapshots, poll=15, timeout=600)
)
# split outcome
ready = [r.data for r in results if r.status == "ready"]
errors = [r for r in results if r.status != "ready"]
print("ready :", len(ready))
print("errors:", len(errors))
Memory footprint: few kB per job → thousands of parallel polls on a single VM.
Need fire-and-forget?
brightdata.utils.thread_poll.PollWorker
(one line to start) runs in a
daemon thread, writes the JSON to disk or fires a callback and never blocks
your main code.
Brightdata supports batch triggering. Which means you can do something like this
# trigger all 1 000 keywords at once ----------------------------
payload = [{"keyword": kw} for kw in keywords] # 1 000 items
snap_id = scraper.discover_by_keyword(payload) # ONE call
# the rest is the same as before
results = asyncio.run(
fetch_snapshot_async(scraper, snap_id, poll=15, timeout=600)
)
rows = results.data
from brightdata.utils.concurrent_trigger import trigger_keywords_concurrently
from brightdata.utils.async_poll import fetch_snapshots_async
scraper = AmazonScraper(bearer_token=TOKEN)
# 1) trigger – now takes seconds, not minutes
snapshot_map = trigger_keywords_concurrently(scraper, keywords, max_workers=64)
# 2) poll the 1 000 snapshot-ids in parallel
results = asyncio.run(
fetch_snapshots_async(scraper,
list(snapshot_map.values()),
poll=15, timeout=600)
)
# 3) reconnect keyword ↔︎ result if you need to
kw_to_result = {
kw: res
for kw, sid in snapshot_map.items()
for res in results
if res.input_snapshot_id == sid # you can add that attribute yourself
}
Dataset family | Ready-made class | Implemented methods |
---|---|---|
Amazon products / search | AmazonScraper | collect_by_url , discover_by_keyword , discover_by_category , search_products |
Digi-Key parts | DigiKeyScraper | collect_by_url , discover_by_category |
Mouser parts | MouserScraper | collect_by_url |
LinkedInScraper | collect_people_by_url , discover_people_by_name , collect_company_by_url , collect_jobs_by_url , discover_jobs_by_keyword |
Each call returns a snapshot_id
string (sync_mode = async).
Use one of the helpers to fetch the final data:
brightdata.utils.poll.poll_until_ready()
– blocking, linearbrightdata.utils.async_poll.wait_ready()
– single coroutinebrightdata.utils.async_poll.monitor_snapshots()
– fan-out hundreds using asyncio
+ aiohttp
ready_scrapers/<dataset>/tests.py
.FAQs
Easy to use comprehensive wrapper for brightdata *scrapers, web unlocker, browserapi) APIs with async support
We found that brightdata demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Add secure dependency scanning to Claude Desktop with Socket MCP, a one-click extension that keeps your coding conversations safe from malicious packages.
Product
Socket now supports Scala and Kotlin, bringing AI-powered threat detection to JVM projects with easy manifest generation and fast, accurate scans.
Application Security
/Security News
Socket CEO Feross Aboukhadijeh and a16z partner Joel de la Garza discuss vibe coding, AI-driven software development, and how the rise of LLMs, despite their risks, still points toward a more secure and innovative future.