
Security News
Vite Releases Technical Preview of Rolldown-Vite, a Rust-Based Bundler
Vite releases Rolldown-Vite, a Rust-based bundler preview offering faster builds and lower memory usage as a drop-in replacement for Vite.
arc-crawler
is a flexible Python module designed to simplify complex web scraping tasks.
It focuses on efficient, resumable data acquisition, structured output management,
and customizable data processing.
IndexReader
.You can easily install arc-crawler using pip:
pip install arc-crawler
This example shows you how to quickly crawl a few Wikipedia pages
and save only their <body>
contents using the built-in html_body_processor
.
from arc_crawler import Crawler, html_body_processor
def crawl_wikipedia():
urls_to_fetch = [
"https://en.wikipedia.org/wiki/JavaScript",
"https://en.wikipedia.org/wiki/Go_(programming_language)",
"https://en.wikipedia.org/wiki/Python_(programming_language)",
"https://en.wikipedia.org/wiki/Node.js",
]
# Initialize the crawler, outputting to a new './output' directory
crawler = Crawler(out_file_path="./output")
# Fetch URLs, process responses to save only the <body> tag, and save to 'wiki-programming' files
reader = crawler.get(
urls_to_fetch,
out_file_name="wiki-programming",
response_processor=html_body_processor, # Use the built-in processor
)
print(f"Successfully gathered {len(reader)} Wikipedia entries.")
print(f"Dataset file is located at: {reader.path}")
if __name__ == "__main__":
crawl_wikipedia()
For more detailed examples demonstrating advanced features like custom response/request processors, metadata indexing, and handling json content, please refer to the basic.py and advanced.py file within the repository.
Additionally, for comprehensive API documentation, including all available parameters, return types,
and internal workings, explore the detailed docstrings within the arc-crawler module's source code
(e.g., use help(Crawler)
or help(Crawler.get)
in your Python interpreter or access docs via IDE Tools).
For optimal efficiency and maintainability in web scraping, it is strongly recommended
to prioritize saving raw, unaltered response data first within your response_processor
callbacks.
Avoid implementing complex parsing logic directly during the fetching phase.
This approach offers significant advantages:
In essence, with arc-crawler, focus on saving the raw source data. Parsing is a distinct, subsequent step best performed after fetching.
FAQs
Configurable crawler for web-scraping
We found that arc-crawler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Ā It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Vite releases Rolldown-Vite, a Rust-based bundler preview offering faster builds and lower memory usage as a drop-in replacement for Vite.
Research
Security News
A malicious npm typosquat uses remote commands to silently delete entire project directories after a single mistyped install.
Research
Security News
Malicious PyPI package semantic-types steals Solana private keys via transitive dependency installs using monkey patching and blockchain exfiltration.