🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more

desync-search

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

desync-search

API for the internet

0.2.25
Maintainers
1

Desync Search Documentation

Overview

Desync Search is a next-generation Python library engineered for fast, stealthy, and scalable web data extraction. It combines low-detectability techniques, massive concurrency, and ease of integration to deliver the best performance and pricing in the market.

Key Features:

  • Stealth Mode:
    Operates with minimal detection, even on pages protected against bot traffic.

  • Massive Concurrency:
    Supports up to 50,000 concurrent operations, with any additional requests automatically queued.

  • Minimal Integration:
    Start using Desync Search in just three lines of code:

    import desync_search
    client = desync_search.DesyncClient(user_api_key="YOUR_API_KEY")
    result = client.search("https://example.com")
    
  • Best-in-Class Pricing:
    Enjoy highly competitive pricing that offers exceptional value for high-volume operations.

  • Low Latency:
    Experience quick response times and efficient data extraction with consistently low latency.

Installation & Setup

1. Installing the Library

To install Desync Search via pip, run:

pip install desync_search

2. Setting Up Your API Key

Desync Search uses your API key to authenticate requests. The DesyncClient automatically checks for an environment variable named DESYNC_API_KEY if you don't pass the key directly. This ensures secure and convenient usage.

Setting the Environment Variable

  • Unix/Linux/MacOS (bash):
    export DESYNC_API_KEY="your_api_key_here"
    
  • Windows (Command Prompt):
    set DESYNC_API_KEY=your_api_key_here
    
  • Windows (PowerShell):
    $env:DESYNC_API_KEY="your_api_key_here"
    

3. Initializing the Client

Once your API key is set, you can initialize the client without specifying the API key:

from desync_search import DesyncClient

client = DesyncClient()

Alternatively, you can pass a different API key directly:

client = DesyncClient(user_api_key="your_api_key_here")

Quickstart

Below are ready-to-run code examples that demonstrate the core features of Desync Search. Simply copy these snippets into your IDE, update your API key if necessary (or set it in your environment), and run!

What It Does:
Searches a single URL and returns detailed page data—including the URL, links, and content length—packaged in a PageData object.

from desync_search import DesyncClient

client = DesyncClient()
target_url = "https://example.com"
result = client.search(target_url)

print("URL:", result.url)
print("Internal Links:", len(result.internal_links))
print("External Links:", len(result.external_links))
print("Text Content Length:", len(result.text_content))

2. Crawling an Entire Domain

What It Does:
Recursively crawls a website. The starting page is considered "depth 0". Any link on that page (pointing to the same domain) is considered "depth 1", links from those pages are "depth 2", and so on. This continues until the maximum depth is reached or no new unique pages are found.

from desync_search import DesyncClient

client = DesyncClient()

pages = client.crawl(
    start_url="https://example.com",
    max_depth=2,
    scrape_full_html=False,
    remove_link_duplicates=True
)

print(f"Discovered {len(pages)} pages.")
for page in pages:
    print("URL:", page.url, "| Depth:", getattr(page, "depth", "N/A"))

What It Does:
Processes a list of URLs asynchronously in one operation. Up to 1000 URLs can be processed per bulk search. This method returns metadata including a unique bulk search ID that you can later use to retrieve the complete results.

from desync_search import DesyncClient

client = DesyncClient()
urls = [
    "https://example.com",
    "https://another-example.com",
    # Add additional URLs here (up to 1000 per bulk search)
]

bulk_info = client.bulk_search(target_list=urls, extract_html=False)
print("Bulk Search ID:", bulk_info.get("bulk_search_id"))
print("Total Links Scheduled:", bulk_info.get("total_links"))

Note: Once you have the bulk_search_id, you can retrieve the results asynchronously using the collect_results method. For a fully managed experience, consider using simple_bulk_search.

4. Collecting Bulk Search Results

What It Does:
After initiating a bulk search, this snippet polls for and collects the complete results. The method waits until a specified fraction of the URLs have been processed (or a timeout is reached) and then retrieves the full page data.

from desync_search import DesyncClient

client = DesyncClient()
urls = [
    "https://example.com",
    "https://another-example.com",
    # Add more URLs as needed
]

# Initiate a bulk search
bulk_info = client.bulk_search(target_list=urls, extract_html=False)

# Poll and collect results once enough pages are complete
results = client.collect_results(
    bulk_search_id=bulk_info["bulk_search_id"],
    target_links=urls,
    wait_time=30.0,
    poll_interval=2.0,
    completion_fraction=0.975
)

print(f"Retrieved {len(results)} pages from the bulk search.")
for result in results:
    print("URL:", result.url)

What It Does:
For large lists of URLs (even exceeding 1000 elements), the simple_bulk_search method splits the list into manageable chunks, starts a bulk search for each chunk, and then aggregates all the results. This provides a fully managed bulk search experience.

from desync_search import DesyncClient

client = DesyncClient()
urls = [
    "https://example.com",
    "https://another-example.com",
    # Add as many URLs as needed; this method handles splitting automatically.
]

results = client.simple_bulk_search(
    target_list=urls,
    extract_html=False,
    poll_interval=2.0,
    wait_time=30.0,
    completion_fraction=1
)

print(f"Retrieved {len(results)} pages using simple_bulk_search.")
for result in results:
    print("URL:", result.url)

API Reference

DesyncClient Class

The DesyncClient class provides a high-level interface to the Desync Search API, managing individual searches, bulk operations, domain crawling, and credit balance checks.

__init__(user_api_key="", developer_mode=False)

Signature:

def __init__(self, user_api_key="", developer_mode=False)

Description:
Initializes the client with the provided API key or reads it from the DESYNC_API_KEY environment variable. If developer_mode is True, the client uses a test endpoint; otherwise, it uses the production endpoint.

Parameters:

  • user_api_key (str, optional): Your Desync API key.
  • developer_mode (bool, optional): Toggle between test and production endpoints.

Example:

from desync_search import DesyncClient

client = DesyncClient(user_api_key="YOUR_API_KEY", developer_mode=False)

Signature:

def search(self, url, search_type="stealth_search", scrape_full_html=False, remove_link_duplicates=True) -> PageData

Description:
Performs a single search on a specified URL and returns a PageData object containing the page’s text, links, timestamps, and other metadata.

Parameters:

  • url (str): The URL to scrape.
  • search_type (str): Either "stealth_search" (default) or "test_search".
  • scrape_full_html (bool): If True, returns the full HTML content.
  • remove_link_duplicates (bool): If True, removes duplicate links from the results.

Example:

result = client.search("https://example.com")
print(result.text_content)

bulk_search(target_list, extract_html=False) -> dict

Signature:

def bulk_search(self, target_list, extract_html=False) -> dict

Description:
Initiates an asynchronous bulk search on up to 1000 URLs at once. Returns a dictionary containing a bulk_search_id and other metadata.

Parameters:

  • target_list (list[str]): List of URLs to process.
  • extract_html (bool): If True, includes the full HTML content in results.

Example:

bulk_info = client.bulk_search(target_list=["https://example.com", "https://another-example.net"])
print(bulk_info["bulk_search_id"])

list_available(url_list=None, bulk_search_id=None) -> list

Signature:

def list_available(self, url_list=None, bulk_search_id=None) -> list

Description:
Retrieves minimal data about previously collected search results (IDs, domains, timestamps, etc.). Returns a list of PageData objects with limited fields.

Parameters:

  • url_list (list[str], optional): Filters results by specific URLs.
  • bulk_search_id (str, optional): Filters results by a particular bulk search ID.

Example:

partial_records = client.list_available(bulk_search_id="some-bulk-id")
for rec in partial_records:
    print(rec.url, rec.complete)

pull_data(record_id=None, url=None, domain=None, timestamp=None, bulk_search_id=None, search_type=None, latency_ms=None, complete=None, created_at=None) -> list

Signature:

def pull_data(self, record_id=None, url=None, domain=None, timestamp=None, bulk_search_id=None, search_type=None, latency_ms=None, complete=None, created_at=None) -> list

Description:
Retrieves full data (including text and optional HTML content) for one or more records matching the provided filters. Returns a list of PageData objects.

Example:

detailed_records = client.pull_data(url="https://example.com")
for record in detailed_records:
    print(record.html_content)

pull_credits_balance() -> dict

Signature:

def pull_credits_balance(self) -> dict

Description:
Checks the user’s current credit balance and returns it as a dictionary.

Example:

balance_info = client.pull_credits_balance()
print(balance_info["credits_balance"])

Signature:

def collect_results(self, bulk_search_id: str, target_links: list, wait_time=30.0, poll_interval=2.0, completion_fraction=0.975) -> list

Description:
Polls periodically for bulk search completion until a specified fraction of pages are done or a maximum wait time elapses, then retrieves full data. Returns a list of PageData objects.

Parameters:

  • bulk_search_id (str): The unique identifier for the bulk search.
  • target_links (list[str]): The list of URLs in the bulk job.
  • wait_time (float): Maximum polling duration in seconds.
  • poll_interval (float): Interval between status checks.
  • completion_fraction (float): Fraction of completed results needed to stop polling.

Example:

results = client.collect_results(
    bulk_search_id="bulk-id-123",
    target_links=["https://example.com", "https://another.com"]
)
print(len(results))

simple_bulk_search(target_list: list, extract_html=False, poll_interval=2.0, wait_time=30.0, completion_fraction=1) -> list

Signature:

def simple_bulk_search(self, target_list: list, extract_html=False, poll_interval=2.0, wait_time=30.0, completion_fraction=1) -> list

Description:
Splits a large list of URLs into chunks (up to 1000 URLs each), initiates a bulk search for each chunk, then collects and aggregates the results.

Example:

all_pages = client.simple_bulk_search(
    target_list=["https://site1.com", "https://site2.com", ...],
    extract_html=False
)
print(len(all_pages))

Signature:

def crawl(self, start_url: str, max_depth=2, scrape_full_html=False, remove_link_duplicates=True, poll_interval=2.0, wait_time_per_depth=30.0, completion_fraction=0.975) -> list

Description:
Recursively crawls the specified start_url up to max_depth levels. It performs a stealth search on the starting page, collects same-domain links, and uses bulk searches to fetch pages at each depth.
Think of it this way: the starting page is "depth 0". Any same-domain link on that page is "depth 1", links on depth 1 pages become "depth 2", and so on until the maximum depth is reached or no new pages are found.

Example:

crawled_pages = client.crawl(
    start_url="https://example.com",
    max_depth=3,
    scrape_full_html=False
)
print(len(crawled_pages))

_post_and_parse(payload)

Signature:

def _post_and_parse(self, payload)

Description:
An internal helper method that sends the given payload to the API, parses the JSON response, and raises an error if the request fails.

PageData Class

The PageData class packages all the information extracted from a web page during a search. It includes both details about the page itself and metadata about the search operation (such as timestamps and latency).

Attributes

  • id (int):
    A unique identifier for the search result.

  • url (str):
    The URL targeted by the search, often referred to as the "target URL" or "target page" (e.g., abc.com/news).

  • domain (str):
    The domain of the targeted URL (e.g., if the URL is abc.com/news, the domain is abc.com).

  • timestamp (int):
    A Unix timestamp marking when the result was received.

  • bulk_search_id (str):
    A unique identifier for the bulk search batch this result belongs to. May be NONE if not part of a bulk search.

  • search_type (str):
    Indicates the type of search performed. Options include:

    • stealth_search (default): Uses JavaScript rendering and stealth techniques.
    • test_search: Does not render JavaScript; intended for prototyping.
  • text_content (str):
    The text extracted from the page’s DOM, ideal for data extraction.

  • html_content (str):
    The full HTML content of the page (optional and not returned by default to save bandwidth).

  • internal_links (list[str]):
    A list of URLs on the page that point to the same domain.

  • external_links (list[str]):
    A list of URLs on the page that point to different domains.

  • latency_ms (int):
    The time in milliseconds between the start of the search and when the results were collected.

  • complete (bool):
    Indicates whether the search operation is complete.

  • created_at (int):
    A Unix timestamp marking when the search was initiated on the client-side.

This documentation provides you with everything you need to get started with Desync Search—from installation and quickstart examples to detailed API reference for both the client and the page data structure. Enjoy building your web data extraction projects!

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts