New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

com.blogspot.geekabyte.webmuncher:webmuncher

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

com.blogspot.geekabyte.webmuncher:webmuncher

A general purpose web crawler written in Java

Version published: 8 years ago

Maintainers: 1

Source

webmuncher is a tool that can be used to easily retrieve all the contents of a website. More accurately, contents under a single domain. This is the perfect use case which reflects the original need for which it was created: Read more about that [here] (http://geekabyte.blogspot.be/2014/12/a-web-scrapercrawler-in-java-krwkrw.html)

webmuncher is available via Maven central, and you can easily drop it into your project with this coordinates:

Maven:

<dependency>
<groupid>com.blogspot.geekabyte.webmuncher</groupid>
<artifactid>webmuncher</artifactid>
<version>${webmuncher.version}</version>
</dependency>

Gradle:

dependencies {
    compile "com.blogspot.geekabyte.webmuncher:webmuncher:$webmuncher.version}"
}

Or you can also build from source and have the built jar in your classpath.

The available releases can be seen [here] (https://github.com/dadepo/webmuncher/releases)

The announcement for the most recent release can be seen here

###How to use webmuncher.

webmuncher is designed around the [Strategy Pattern] (http://en.wikipedia.org/wiki/Strategy_pattern). The main object that would be used is the webmuncher object, while the client using webmuncher would need to provide an implementation of the webAction interface which contains code that operates on every fetched page represented by the FetchedPage object

The webAction interface has only one method that needs to be implemented. The execute() method. The execute() method is given a FetchedPage object which contains the information extracted from every crawled pages. e.g, the HTML content of the page, the uri of the page, the title of the page, the time it took webmuncher to retrieve the page etc.

Since version 0.1.2 webmuncher comes with utility FetchActions, that makes it easy to persist pages crawled. The included utility actions are:

JDBCAction - for persisting web pages into a relational database. (since 0.1.2)
ElasticSearchAction - for indexing web pages into ElasticSearch. (since 0.1.2)
CSVAction - for saving web pages into a CSV file. (since 0.1.2)

For example, to use webmuncher to extract all the contents of http://www.example.com into a CSV file, you do:


    // Use the builder to build the CSVAction
    CSVAction action = CSVAction.builder()
                .convertToPlainText(true) // converts HTML to plain text
                .setDestination(Paths.get("example-com.csv"))
                .buildAction();

    // creates an instance of the crawler with the action
    webmuncher crawler = new webmuncher(action);

    // Configure the crawler to your hearts desire

    // Crawler will wait 20 seconds between each requests
    crawler.setDelay(20);

    // When at first you don't succeed?
    // Give up and move onto the next one, after 3 attempts!
    crawler.setMaxRetry(3)

    // the crawler would select randomly from the list of user agents
    // you give for each request
    crawler.setUserAgents(Arrays.asList(
      "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6)...",
      "Opera/9.80 (X11; Linux i686; Ubuntu/14.10)...")
    );

    // Provide the list of addresses to use for the referral. So the folks at
    // example.com when checking the webserver logs:sometimes the request
    // comes from google, sometimes, yahoo, sometimes bing...
    crawler.setReferrals(Arrays.asList(
                    "http://www.google.com",
                    "http://www.yahoo.com",
                    "http://www.bing.com"));

        // Start the crawling operation as a blocking call.
        Set<String> strings = crawler.crawl("http://www.example.com");

        // If you want to execute the crawling in another thread,
        // so the current thread does not block, then do:
        Set<String> strings = crawler.crawlAsync("http://www.example.com");

        // in case you do the crawling in another thread,
        // you most likely want to be notified when the
        // crawling operations terminates. in such a case,
        // you should use crawler.onExit(FetchExitCallback callback)
        // to register the callback

The above steps makes use of the CSVAction that comes with the library. In case you have custom operations you want applied to the fetched web pages, then you can easily implement your own FetchAction. for example a JPA backed webAction implementation may look like:

class CustomJpaAction implements webAction {

        private EntityManager em;
        private EntityManagerFactory emf;

        /**
         * Operates on given {@link com.blogspot.geekabyte.webmuncher.FetchedPage}
         *
         * @param page
         */
        @Override
        public void execute(FetchedPage page) {
            emf = Persistence.createEntityManagerFactory("FetchedPage");
            em = emf.createEntityManager();
            em.getTransaction().begin();

            FetchedPageEntity entity = new FetchedPageEntity();
            entity.setHtml(page.getHtml());
            entity.setLoadTime(page.getLoadTime());
            entity.setStatus(page.getStatus());
            entity.setTitle(page.getTitle());
            entity.setUrl(page.getUrl());
            entity.setSourceUrl(page.getSourceUrl());

            em.persist(entity);
            em.flush();
            em.getTransaction().commit();
        }
}

###Overview of webmuncher API.

The accompanying Javadoc should be helpful in having an overview of the API. It can be gotten using the [Javadoc tool] (http://www.oracle.com/technetwork/articles/java/index-jsp-135444.html) or via Maven using the [Maven Javadoc plugin] (http://maven.apache.org/plugins/maven-javadoc-plugin/).

More conveniently, thanks to Javadoc.io, you can also access the most recent Javadoc online

The API for the older version: (Krakraw) can be find online here

Licenses

[The MIT License (MIT)] (http://www.opensource.org/licenses/mit-license.php)

FAQs

What is com.blogspot.geekabyte.webmuncher:webmuncher?

Is com.blogspot.geekabyte.webmuncher:webmuncher well maintained?

Package last updated on 14 Dec 2016

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

com.blogspot.geekabyte.webmuncher:webmuncher

Licenses

Related posts

Linux Foundation Warns Open Source Developers: Compliance with Sanctions Is Not Optional

Maven Central Adds Sigstore Signature Validation