
Security News
vlt Launches "reproduce": A New Tool Challenging the Limits of Package Provenance
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
The Spider Cloud Python SDK offers a toolkit for straightforward website scraping, crawling at scale, and other utilities like extracting links and taking screenshots, enabling you to collect data formatted for compatibility with language models (LLMs). It features a user-friendly interface for seamless integration with the Spider Cloud API.
To install the Spider Cloud Python SDK, you can use pip:
pip install spider-client
SPIDER_API_KEY
or pass it as a parameter to the Spider
class.Here's an example of how to use the SDK:
from spider import Spider
# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')
# Scrape a single URL
url = 'https://spider.cloud'
scraped_data = app.scrape_url(url)
# Crawl a website
crawler_params = {
'limit': 1,
'proxy_enabled': True,
'store_data': False,
'metadata': False,
'request': 'http'
}
crawl_result = app.crawl_url(url, params=crawler_params)
To scrape data from a single URL:
url = 'https://example.com'
scraped_data = app.scrape_url(url)
To automate crawling a website:
url = 'https://example.com'
crawl_params = {
'limit': 200,
'request': 'smart_mode'
}
crawl_result = app.crawl_url(url, params=crawl_params)
Stream crawl the website in chunks to scale.
def handle_json(json_obj: dict) -> None:
assert json_obj["url"] is not None
url = 'https://example.com'
crawl_params = {
'limit': 200,
'store_data': False
}
response = app.crawl_url(
url,
params=params,
stream=True,
callback=handle_json,
)
Perform a search for websites to crawl or gather search results:
query = 'a sports website'
crawl_params = {
'request': 'smart_mode',
'search_limit': 5,
'limit': 5,
'fetch_page_content': True
}
crawl_result = app.search(query, params=crawl_params)
Extract all links from a specified URL:
url = 'https://example.com'
links = app.links(url)
Transform HTML to markdown or text lightning fast:
data = [ { 'html': '<html><body><h1>Hello world</h1></body></html>' } ]
params = {
'readability': False,
'return_format': 'markdown',
}
result = app.transform(data, params=params)
Capture a screenshot of a given URL:
url = 'https://example.com'
screenshot = app.screenshot(url)
Extract contact details from a specified URL:
url = 'https://example.com'
contacts = app.extract_contacts(url)
Label the data extracted from a particular URL:
url = 'https://example.com'
labeled_data = app.label(url)
You can check the crawl state of the website:
url = 'https://example.com'
state = app.get_crawl_state(url)
You can download the results of the website:
url = 'https://example.com'
params = {
'page': 0,
'limit': 100,
'expiresIn': 3600 # Optional, add if needed
}
stream = True
state = app.create_signed_url(url, params, stream)
You can check the remaining credits on your account:
credits = app.get_credits()
The Spider client can now interact with specific data tables to create, retrieve, and delete data.
To fetch data from a specified table by applying query parameters:
table_name = 'pages'
query_params = {'limit': 20 }
response = app.data_get(table_name, query_params)
print(response)
To delete data from a specified table based on certain conditions:
table_name = 'websites'
delete_params = {'domain': 'www.example.com'}
response = app.data_delete(table_name, delete_params)
print(response)
If you need to stream the request use the third param:
url = 'https://example.com'
crawler_params = {
'limit': 1,
'proxy_enabled': True,
'store_data': False,
'metadata': False,
'request': 'http'
}
links = app.links(url, crawler_params, True)
The following Content-type headers are supported using the fourth param:
application/json
text/csv
application/xml
application/jsonl
url = 'https://example.com'
crawler_params = {
'limit': 1,
'proxy_enabled': True,
'store_data': False,
'metadata': False,
'request': 'http'
}
# stream json lines back to the client
links = app.crawl(url, crawler_params, True, "application/jsonl")
The SDK handles errors returned by the Spider Cloud API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
Contributions to the Spider Cloud Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
The Spider Cloud Python SDK is open-source and released under the MIT License.
FAQs
Python SDK for Spider Cloud API
We found that spider-client demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
Research
Security News
Socket researchers uncovered a malicious PyPI package exploiting Deezer’s API to enable coordinated music piracy through API abuse and C2 server control.
Research
The Socket Research Team discovered a malicious npm package, '@ton-wallet/create', stealing cryptocurrency wallet keys from developers and users in the TON ecosystem.