🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

Demo Install Sign in

desync-search

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

desync-search

API for the internet

0.2.25

PyPI

Maintainers: 1

Desync Search Documentation

Overview

Desync Search is a next-generation Python library engineered for fast, stealthy, and scalable web data extraction. It combines low-detectability techniques, massive concurrency, and ease of integration to deliver the best performance and pricing in the market.

Key Features:

Stealth Mode:
Operates with minimal detection, even on pages protected against bot traffic.
Massive Concurrency:
Supports up to 50,000 concurrent operations, with any additional requests automatically queued.

Minimal Integration:
Start using Desync Search in just three lines of code:

import desync_search
client = desync_search.DesyncClient(user_api_key="YOUR_API_KEY")
result = client.search("https://example.com")

Best-in-Class Pricing:
Enjoy highly competitive pricing that offers exceptional value for high-volume operations.
Low Latency:
Experience quick response times and efficient data extraction with consistently low latency.

Installation & Setup

1. Installing the Library

To install Desync Search via pip, run:

pip install desync_search

2. Setting Up Your API Key

Desync Search uses your API key to authenticate requests. The DesyncClient automatically checks for an environment variable named DESYNC_API_KEY if you don't pass the key directly. This ensures secure and convenient usage.

Setting the Environment Variable

Unix/Linux/MacOS (bash):

export DESYNC_API_KEY="your_api_key_here"

Windows (Command Prompt):
```
set DESYNC_API_KEY=your_api_key_here
```

Windows (PowerShell):

$env:DESYNC_API_KEY="your_api_key_here"

3. Initializing the Client

Once your API key is set, you can initialize the client without specifying the API key:

from desync_search import DesyncClient

client = DesyncClient()

Alternatively, you can pass a different API key directly:

client = DesyncClient(user_api_key="your_api_key_here")

Quickstart

Below are ready-to-run code examples that demonstrate the core features of Desync Search. Simply copy these snippets into your IDE, update your API key if necessary (or set it in your environment), and run!

1. Performing a Single Search

What It Does:
Searches a single URL and returns detailed page data—including the URL, links, and content length—packaged in a PageData object.

from desync_search import DesyncClient

client = DesyncClient()
target_url = "https://example.com"
result = client.search(target_url)

print("URL:", result.url)
print("Internal Links:", len(result.internal_links))
print("External Links:", len(result.external_links))
print("Text Content Length:", len(result.text_content))

2. Crawling an Entire Domain

What It Does:
Recursively crawls a website. The starting page is considered "depth 0". Any link on that page (pointing to the same domain) is considered "depth 1", links from those pages are "depth 2", and so on. This continues until the maximum depth is reached or no new unique pages are found.

from desync_search import DesyncClient

client = DesyncClient()

pages = client.crawl(
    start_url="https://example.com",
    max_depth=2,
    scrape_full_html=False,
    remove_link_duplicates=True
)

print(f"Discovered {len(pages)} pages.")
for page in pages:
    print("URL:", page.url, "| Depth:", getattr(page, "depth", "N/A"))

3. Initiating a Bulk Search

What It Does:
Processes a list of URLs asynchronously in one operation. Up to 1000 URLs can be processed per bulk search. This method returns metadata including a unique bulk search ID that you can later use to retrieve the complete results.

from desync_search import DesyncClient

client = DesyncClient()
urls = [
    "https://example.com",
    "https://another-example.com",
    # Add additional URLs here (up to 1000 per bulk search)
]

bulk_info = client.bulk_search(target_list=urls, extract_html=False)
print("Bulk Search ID:", bulk_info.get("bulk_search_id"))
print("Total Links Scheduled:", bulk_info.get("total_links"))

Note: Once you have the bulk_search_id, you can retrieve the results asynchronously using the collect_results method. For a fully managed experience, consider using simple_bulk_search.

4. Collecting Bulk Search Results

What It Does:
After initiating a bulk search, this snippet polls for and collects the complete results. The method waits until a specified fraction of the URLs have been processed (or a timeout is reached) and then retrieves the full page data.

from desync_search import DesyncClient

client = DesyncClient()
urls = [
    "https://example.com",
    "https://another-example.com",
    # Add more URLs as needed
]

# Initiate a bulk search
bulk_info = client.bulk_search(target_list=urls, extract_html=False)

# Poll and collect results once enough pages are complete
results = client.collect_results(
    bulk_search_id=bulk_info["bulk_search_id"],
    target_links=urls,
    wait_time=30.0,
    poll_interval=2.0,
    completion_fraction=0.975
)

print(f"Retrieved {len(results)} pages from the bulk search.")
for result in results:
    print("URL:", result.url)

5. Using Simple Bulk Search

What It Does:
For large lists of URLs (even exceeding 1000 elements), the simple_bulk_search method splits the list into manageable chunks, starts a bulk search for each chunk, and then aggregates all the results. This provides a fully managed bulk search experience.

from desync_search import DesyncClient

client = DesyncClient()
urls = [
    "https://example.com",
    "https://another-example.com",
    # Add as many URLs as needed; this method handles splitting automatically.
]

results = client.simple_bulk_search(
    target_list=urls,
    extract_html=False,
    poll_interval=2.0,
    wait_time=30.0,
    completion_fraction=1
)

print(f"Retrieved {len(results)} pages using simple_bulk_search.")
for result in results:
    print("URL:", result.url)

API Reference

DesyncClient Class

The DesyncClient class provides a high-level interface to the Desync Search API, managing individual searches, bulk operations, domain crawling, and credit balance checks.

`init(user_api_key="", developer_mode=False)`

Signature:

def __init__(self, user_api_key="", developer_mode=False)

Description:
Initializes the client with the provided API key or reads it from the DESYNC_API_KEY environment variable. If developer_mode is True, the client uses a test endpoint; otherwise, it uses the production endpoint.

Parameters:

user_api_key (str, optional): Your Desync API key.
developer_mode (bool, optional): Toggle between test and production endpoints.

Example:

from desync_search import DesyncClient

client = DesyncClient(user_api_key="YOUR_API_KEY", developer_mode=False)

`search(url, search_type="stealth_search", scrape_full_html=False, remove_link_duplicates=True) -> PageData`

Signature:

def search(self, url, search_type="stealth_search", scrape_full_html=False, remove_link_duplicates=True) -> PageData

Description:
Performs a single search on a specified URL and returns a PageData object containing the page’s text, links, timestamps, and other metadata.

Parameters:

url (str): The URL to scrape.
search_type (str): Either "stealth_search" (default) or "test_search".
scrape_full_html (bool): If True, returns the full HTML content.
remove_link_duplicates (bool): If True, removes duplicate links from the results.

Example:

result = client.search("https://example.com")
print(result.text_content)

`bulk_search(target_list, extract_html=False) -> dict`

Signature:

def bulk_search(self, target_list, extract_html=False) -> dict

Description:
Initiates an asynchronous bulk search on up to 1000 URLs at once. Returns a dictionary containing a bulk_search_id and other metadata.

Parameters:

target_list (list[str]): List of URLs to process.
extract_html (bool): If True, includes the full HTML content in results.

Example:

bulk_info = client.bulk_search(target_list=["https://example.com", "https://another-example.net"])
print(bulk_info["bulk_search_id"])

`list_available(url_list=None, bulk_search_id=None) -> list`

Signature:

def list_available(self, url_list=None, bulk_search_id=None) -> list

Description:
Retrieves minimal data about previously collected search results (IDs, domains, timestamps, etc.). Returns a list of PageData objects with limited fields.

Parameters:

url_list (list[str], optional): Filters results by specific URLs.
bulk_search_id (str, optional): Filters results by a particular bulk search ID.

Example:

partial_records = client.list_available(bulk_search_id="some-bulk-id")
for rec in partial_records:
    print(rec.url, rec.complete)

`pull_data(record_id=None, url=None, domain=None, timestamp=None, bulk_search_id=None, search_type=None, latency_ms=None, complete=None, created_at=None) -> list`

Signature:

def pull_data(self, record_id=None, url=None, domain=None, timestamp=None, bulk_search_id=None, search_type=None, latency_ms=None, complete=None, created_at=None) -> list

Description:
Retrieves full data (including text and optional HTML content) for one or more records matching the provided filters. Returns a list of PageData objects.

Example:

detailed_records = client.pull_data(url="https://example.com")
for record in detailed_records:
    print(record.html_content)

`pull_credits_balance() -> dict`

Signature:

def pull_credits_balance(self) -> dict

Description:
Checks the user’s current credit balance and returns it as a dictionary.

Example:

balance_info = client.pull_credits_balance()
print(balance_info["credits_balance"])

`collect_results(bulk_search_id: str, target_links: list, wait_time=30.0, poll_interval=2.0, completion_fraction=0.975) -> list`

Signature:

def collect_results(self, bulk_search_id: str, target_links: list, wait_time=30.0, poll_interval=2.0, completion_fraction=0.975) -> list

Description:
Polls periodically for bulk search completion until a specified fraction of pages are done or a maximum wait time elapses, then retrieves full data. Returns a list of PageData objects.

Parameters:

bulk_search_id (str): The unique identifier for the bulk search.
target_links (list[str]): The list of URLs in the bulk job.
wait_time (float): Maximum polling duration in seconds.
poll_interval (float): Interval between status checks.
completion_fraction (float): Fraction of completed results needed to stop polling.

Example:

results = client.collect_results(
    bulk_search_id="bulk-id-123",
    target_links=["https://example.com", "https://another.com"]
)
print(len(results))

`simple_bulk_search(target_list: list, extract_html=False, poll_interval=2.0, wait_time=30.0, completion_fraction=1) -> list`

Signature:

def simple_bulk_search(self, target_list: list, extract_html=False, poll_interval=2.0, wait_time=30.0, completion_fraction=1) -> list

Description:
Splits a large list of URLs into chunks (up to 1000 URLs each), initiates a bulk search for each chunk, then collects and aggregates the results.

Example:

all_pages = client.simple_bulk_search(
    target_list=["https://site1.com", "https://site2.com", ...],
    extract_html=False
)
print(len(all_pages))

`crawl(start_url: str, max_depth=2, scrape_full_html=False, remove_link_duplicates=True, poll_interval=2.0, wait_time_per_depth=30.0, completion_fraction=0.975) -> list`

Signature:

def crawl(self, start_url: str, max_depth=2, scrape_full_html=False, remove_link_duplicates=True, poll_interval=2.0, wait_time_per_depth=30.0, completion_fraction=0.975) -> list

Description:
Recursively crawls the specified start_url up to max_depth levels. It performs a stealth search on the starting page, collects same-domain links, and uses bulk searches to fetch pages at each depth.
Think of it this way: the starting page is "depth 0". Any same-domain link on that page is "depth 1", links on depth 1 pages become "depth 2", and so on until the maximum depth is reached or no new pages are found.

Example:

crawled_pages = client.crawl(
    start_url="https://example.com",
    max_depth=3,
    scrape_full_html=False
)
print(len(crawled_pages))

`_post_and_parse(payload)`

Signature:

def _post_and_parse(self, payload)

Description:
An internal helper method that sends the given payload to the API, parses the JSON response, and raises an error if the request fails.

PageData Class

The PageData class packages all the information extracted from a web page during a search. It includes both details about the page itself and metadata about the search operation (such as timestamps and latency).

Attributes

id (int):
A unique identifier for the search result.
url (str):
The URL targeted by the search, often referred to as the "target URL" or "target page" (e.g., abc.com/news).
domain (str):
The domain of the targeted URL (e.g., if the URL is abc.com/news, the domain is abc.com).
timestamp (int):
A Unix timestamp marking when the result was received.
bulk_search_id (str):
A unique identifier for the bulk search batch this result belongs to. May be NONE if not part of a bulk search.
search_type (str):
Indicates the type of search performed. Options include:
- stealth_search (default): Uses JavaScript rendering and stealth techniques.
- test_search: Does not render JavaScript; intended for prototyping.
text_content (str):
The text extracted from the page’s DOM, ideal for data extraction.
html_content (str):
The full HTML content of the page (optional and not returned by default to save bandwidth).
internal_links (list[str]):
A list of URLs on the page that point to the same domain.
external_links (list[str]):
A list of URLs on the page that point to different domains.
latency_ms (int):
The time in milliseconds between the start of the search and when the results were collected.
complete (bool):
Indicates whether the search operation is complete.
created_at (int):
A Unix timestamp marking when the search was initiated on the client-side.

This documentation provides you with everything you need to get started with Desync Search—from installation and quickstart examples to detailed API reference for both the client and the page data structure. Enjoy building your web data extraction projects!

FAQs

What is desync-search?

Is desync-search well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

desync-search

Desync Search Documentation

Overview

Installation & Setup

1. Installing the Library

2. Setting Up Your API Key

Setting the Environment Variable

3. Initializing the Client

Quickstart

1. Performing a Single Search

2. Crawling an Entire Domain

3. Initiating a Bulk Search

4. Collecting Bulk Search Results

5. Using Simple Bulk Search

API Reference

DesyncClient Class

__init__(user_api_key="", developer_mode=False)

search(url, search_type="stealth_search", scrape_full_html=False, remove_link_duplicates=True) -> PageData

bulk_search(target_list, extract_html=False) -> dict

list_available(url_list=None, bulk_search_id=None) -> list

pull_data(record_id=None, url=None, domain=None, timestamp=None, bulk_search_id=None, search_type=None, latency_ms=None, complete=None, created_at=None) -> list

pull_credits_balance() -> dict

collect_results(bulk_search_id: str, target_links: list, wait_time=30.0, poll_interval=2.0, completion_fraction=0.975) -> list

simple_bulk_search(target_list: list, extract_html=False, poll_interval=2.0, wait_time=30.0, completion_fraction=1) -> list

crawl(start_url: str, max_depth=2, scrape_full_html=False, remove_link_duplicates=True, poll_interval=2.0, wait_time_per_depth=30.0, completion_fraction=0.975) -> list

_post_and_parse(payload)

PageData Class

Attributes

Related posts

PyPI Package Disguised as Instagram Growth Tool Harvests User Credentials

Socket Now Supports pylock.toml Files

Destructive npm Packages Disguised as Utilities Enable Remote System Wipe

`init(user_api_key="", developer_mode=False)`

`search(url, search_type="stealth_search", scrape_full_html=False, remove_link_duplicates=True) -> PageData`

`bulk_search(target_list, extract_html=False) -> dict`

`list_available(url_list=None, bulk_search_id=None) -> list`

`pull_data(record_id=None, url=None, domain=None, timestamp=None, bulk_search_id=None, search_type=None, latency_ms=None, complete=None, created_at=None) -> list`

`pull_credits_balance() -> dict`

`collect_results(bulk_search_id: str, target_links: list, wait_time=30.0, poll_interval=2.0, completion_fraction=0.975) -> list`

`simple_bulk_search(target_list: list, extract_html=False, poll_interval=2.0, wait_time=30.0, completion_fraction=1) -> list`

`crawl(start_url: str, max_depth=2, scrape_full_html=False, remove_link_duplicates=True, poll_interval=2.0, wait_time_per_depth=30.0, completion_fraction=0.975) -> list`

`_post_and_parse(payload)`