WebCollector
WebCollector is an open source web crawler framework based on Java.It provides
some simple interfaces for crawling the Web,you can setup a
multi-threaded web crawler in less than 5 minutes.
In addition to a general crawler framework, WebCollector also integrates CEPF, a well-designed state-of-the-art web content extraction algorithm proposed by Wu, et al.:
- Wu GQ, Hu J, Li L, Xu ZH, Liu PC, Hu XG, Wu XD. Online Web news extraction via tag path feature fusion. Ruan Jian Xue Bao/Journal of Software, 2016,27(3):714-735 (in Chinese). http://www.jos.org.cn/1000-9825/4868.htm
HomePage
https://github.com/CrawlScript/WebCollector
Installation
Using Maven
<dependency>
<groupId>cn.edu.hfut.dmic.webcollector</groupId>
<artifactId>WebCollector</artifactId>
<version>2.73-alpha</version>
</dependency>
Without Maven
WebCollector jars are available on the HomePage.
- webcollector-version-bin.zip contains core jars.
Example Index
Annotation versions are named with DemoAnnotatedxxxxxx.java
.
Basic
CrawlDatum and MetaData
Http Request and Javascript
NextFilter
Quickstart
Lets crawl some news from github news.This demo prints out the titles and contents extracted from news of github news.
Automatically Detecting URLs
DemoAutoNewsCrawler.java:
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;
public class DemoAutoNewsCrawler extends BreadthCrawler {
public DemoAutoNewsCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
this.addSeed("https://blog.github.com/");
for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
this.addSeed(seedUrl);
}
this.addRegex("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}-[^/]+/");
setThreads(50);
getConf().setTopN(100);
}
@Override
public void visit(Page page, CrawlDatums next) {
String url = page.url();
if (page.matchUrl("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}[^/]+/")) {
String title = page.select("h1[class=lh-condensed]").first().text();
String content = page.selectText("div.content.markdown-body");
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
}
}
public static void main(String[] args) throws Exception {
DemoAutoNewsCrawler crawler = new DemoAutoNewsCrawler("crawl", true);
crawler.start(4);
}
}
Manually Detecting URLs
DemoManualNewsCrawler.java:
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;
public class DemoManualNewsCrawler extends BreadthCrawler {
public DemoManualNewsCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
this.addSeedAndReturn("https://blog.github.com/").type("list");
for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
this.addSeed(seedUrl, "list");
}
setThreads(50);
getConf().setTopN(100);
}
@Override
public void visit(Page page, CrawlDatums next) {
String url = page.url();
if (page.matchType("list")) {
next.add(page.links("h1.lh-condensed>a")).type("content");
}else if(page.matchType("content")) {
String title = page.select("h1[class=lh-condensed]").first().text();
String content = page.selectText("div.content.markdown-body");
title = getConf().getString("title_prefix") + title;
content = content.substring(0, getConf().getInteger("content_length_limit"));
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
}
}
public static void main(String[] args) throws Exception {
DemoManualNewsCrawler crawler = new DemoManualNewsCrawler("crawl", false);
crawler.getConf().setExecuteInterval(5000);
crawler.getConf().set("title_prefix","PREFIX_");
crawler.getConf().set("content_length_limit", 20);
crawler.start(4);
}
}
CrawlDatum
CrawlDatum is an important data structure in WebCollector, which corresponds to url of webpages. Both crawled urls and detected urls are maintained as CrawlDatums.
There are some differences between CrawlDatum and url:
- A CrawlDatum contains a key and a url. The key is the url by default. You can set the key manually by CrawlDatum.key("xxxxx") so that CrawlDatums with the same url may have different keys. This is very useful in some tasks like crawling data by api, which often request different data by the same url with different post parameters.
- A CrawlDatum may contain metadata, which could maintain some information besides the url.
Manually Detecting URLs
In both void visit(Page page, CrawlDatums next)
and void execute(Page page, CrawlDatums next)
, the second parameter CrawlDatum next
is a container which you should put the detected URLs in:
next.add("detected URL");
next.add("detected URL", "type");
next.add(new CrawlDatum("detected URL"));
next.add("detected URL list");
next.add(("detected URL list","type");
next.add(new CrawlDatums("detected URL list"));
next.addAndReturn("detected URL").key("key").type("type");
next.addAndReturn("detected URL list").type("type").meta("page_num",10);
next.add("detected URL").type("type").meta("page_num", 10);
next.add("detected URL list").type("type").meta("page_num", 10);
You don't need to consider how to filter duplicated URLs, the crawler will filter them automatically.
Plugins
Plugins provide a large part of the functionality of WebCollector. There are several kinds of plugins:
- Executor: Plugins which define how to download webpages, how to parse webpages and how to detect new CrawlDatums(urls)
- DBManager: Plugins which maintains the crawling history
- GeneratorFilter: Plugins which generate CrawlDatums(urls) which will be crawled
- NextFilter: Plugins which filter CrawlDatums(urls) which detected by the crawler
Some BreadthCrawler and RamCrawler are the most used crawlers which extends AutoParseCrawler. The following plugins only work in crawlers which extend AutoParseCrawler:
- Requester: Plugins which define how to do http request
- Visitor: Plugins which define how to parse webpages and how to detect new CrawlDatums(urls)
Plugins can be mounted as follows:
crawler.setRequester(xxxxx);
crawler.setVisitor(xxxxx);
crawler.setNextFilter(xxxxx);
crawler.setGeneratorFilter(xxxxx);
crawler.setExecutor(xxxxx);
crawler.setDBManager(xxxxx);
AutoParseCrawler is also an Executor plugin, a Requester plugin and a Visitor plugin. By default it use itsself as the Executor plugin, Request Plugin and Visitor plugin. So if you want to write a plugin for AutoParseCrawler, you have two ways:
- Just override the corresponding methods of your AutoParseCrawler. For example, if you are using BreadthCrawler, all you have to do is override the
Page getResponse(CrawlDatum crawlDatum)
method. - Create a new class which implements Requester interface and implement the
Page getResponse(CrawlDatum crawlDatum)
method of the class. Instantiate the class and use crawler.setRequester(the instance)
to mount the plugin to the crawler.
Customizing Requester Plugin
Creating a Requester plugin is very easy. You just need to create a new class which implements Requester interface and implement the Page getResponse(CrawlDatum crawlDatum)
method of the class. OkHttpRequester is a Requester Plugin provided by WebCollector. You can find the code here: OkHttpRequester.class.
Most of the time, you don't need to write a Requester plugin from the scratch. Creating a Requester plugin by extending the OkHttpRequester is a convenient way.
Configuration Details
Configuration mechanism of WebCollector is redesigned in version 2.70. The above example ManualNewsCrawler.java also shows how to use configuration to customize your crawler.
Before version 2.70, configuration is maintained by static variables in class cn.edu.hfut.dmic.webcollector.util.Config
, hence it's cumbersome to assign different configurations to different crawlers.
Since version 2.70, each crawler can has its own configuration. You can use crawler.getConf()
to get it or crawler.setConf(xxx)
to set it. By default, all crawlers use a singleton default configuration, which could be get by Configuration.getDefault(). So in the above example ManualNewsCrawler.java, crawler.getConf().set("xxx", "xxx")
would affect the default configuration, which may be used by other crawlers.
If you want to change the configuration of a crawler without affecting other crawlers, you should manually create a configuration and specify it to the crawler. For example:
Configuration conf = Configuration.copyDefault();
conf.set("test_string_key", "test_string_value");
crawler.getConf().setReadTimeout(1000 * 5);
crawler.setConf(conf);
crawler.getConf().set("test_int_key", 10);
crawler.getConf().setConnectTimeout(1000 * 5);
Configuration.copyDefault()
is suggested, because it creates a copy of the singleton default configuration, which contains some necessary key-value pairs, while new Configuration()
creates an empty configuration.
Resumable Crawling
If you want to stop a crawler and continue crawling the next time, you should do two things:
- Add
crawler.setResumable(true)
to your code. - Don't delete the history directory generated by the crawler, which is specified by the crawlPath parameter.
When you call crawler.start(depth)
, the crawler will delete the history if you set resumable to false, which is false by default. So if you forget to put 'crawler.setResumable(true)' in your code before the first time you start your crawler, it doesn't matter, because you have no history directory.
WebCollector could automatically extract content from news web-pages:
News news = ContentExtractor.getNewsByHtml(html, url);
News news = ContentExtractor.getNewsByHtml(html);
News news = ContentExtractor.getNewsByUrl(url);
String content = ContentExtractor.getContentByHtml(html, url);
String content = ContentExtractor.getContentByHtml(html);
String content = ContentExtractor.getContentByUrl(url);
Element contentElement = ContentExtractor.getContentElementByHtml(html, url);
Element contentElement = ContentExtractor.getContentElementByHtml(html);
Element contentElement = ContentExtractor.getContentElementByUrl(url);
WeChat Group