
Security News
ECMAScript 2025 Finalized with Iterator Helpers, Set Methods, RegExp.escape, and More
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
github.com/mkralla11/quick_crawler
QuickCrawler is a Rust crate that provides a completely async, declarative web crawler with domain-specific request rate-limiting built-in.
Let's say you are trying to crawl a subset of pages for a given domain:
https://bike-site.com/search?q=red-bikes
and a regular GET request will return:
<html>
<body>
<div>
<a class="bike-item" href="https://bike-site.com/red-bike-1">
cool red bike 1
</a><a class="bike-item" href="https://bike-site.com/red-bike-2">
cool red bike 2
</a>
<a class="bike-item" href="https://bike-site.com/red-bike-3">
cool red bike 3
</a>
<div>
<a class="bike-other-item" href="https://bike-site.com/other-red-bike-4">
other cool red bike 4
</a>
</div>
</div>
</body>
</html>
and when navigating to links 1 through 3 on that page, EACH PAGE returns:
<html>
<body>
<div class='awesome-bike'>
<div class='bike-info'>
The best bike ever.
</div>
<ul class='bike-specs'>
<li>
Super fast.
</li>
<li>
Jumps high.
</li>
</ul>
</div>
</body>
</html>
and when navigating to the last link on that page, it returns:
<html>
<body>
<div class='other-bike'>
<div class='other-bike-info'>
The best bike ever.
</div>
<ul class='other-bike-specs'>
<li>
Super slow.
</li>
<li>
Doesn't jump.
</li>
</ul>
</div>
</body>
</html>
QuickCrawler declaratively helps you crawl, and scrape data from each of the given pages with ease:
use quick_crawler::{
QuickCrawler,
QuickCrawlerBuilder,
limiter::Limiter,
scrape::{
ResponseLogic::Parallel,
StartUrl,
Scrape,
ElementUrlExtractor,
ElementDataExtractor
}
};
fn main() {
let mut builder = QuickCrawlerBuilder::new();
let start_urls = vec![
StartUrl::new()
.url("https://bike-site.com/search?q=red-bikes")
.method("GET")
.response_logic(Parallel(vec![
// All Scrapers below will be provided the html page response body
Scrape::new()
.find_elements_with_urls(".bike-item")
.extract_urls_from_elements(ElementUrlExtractor::Attr("href".to_string()))
// now setup the logic to execute on each of the return html pages
.response_logic(Parallel(vec![
Scrape::new()
.find_elements_with_data(".awesome-bike .bike-info")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store bike info in DB: {:?}", vec);
}),
Scrape::new()
.find_elements_with_data(".bike-specs li")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store bike specs in DB: {:?}", vec);
}),
])),
Scrape::new()
.find_elements_with_urls(".bike-other-item")
.extract_urls_from_elements(ElementUrlExtractor::Attr("href".to_string()))
.response_logic(Parallel(vec![
Scrape::new()
.find_elements_with_data(".other-bike .other-bike-info")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store other bike info in DB: {:?}", vec);
}),
Scrape::new()
.find_elements_with_data(".other-bike-specs li")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("store other bike specs in DB: {:?}", vec);
}),
]))
])
)
// more StartUrl::new 's if you feel ambitious
] ;
// It's smart to use a limiter - for now automatically set to 3 request per second per domain.
// This will soon be configurable.
let limiter = Limiter::new();
builder
.with_start_urls(
start_urls
)
.with_limiter(
limiter
)
// Optionally configure how to make a request and return an html string
.with_request_handler(
|config: RequestHandlerConfig| async move {
// ... use any request library, like reqwest
surf::get(config.url.clone()).recv_string().await.map_err(|_| QuickCrawlerError::RequestErr)
}
);
let crawler = builder.finish().map_err(|_| "Builder could not finish").expect("no error");
// QuickCrawler is async, so choose your favorite executor.
// (Tested and working for both async-std and tokio)
let res = async_std::task::block_on(async {
crawler.process().await
});
}
clone the repo.
To run tests:
cargo watch -x check -x 'test -- --nocapture'
See the src/lib.rs
test for example usage.
If you use this crate and you found it helpful to your project, please star it!
MIT
FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Security News
A new Node.js homepage button linking to paid support for EOL versions has sparked a heated discussion among contributors and the wider community.
Research
North Korean threat actors linked to the Contagious Interview campaign return with 35 new malicious npm packages using a stealthy multi-stage malware loader.