Web Article Curator
A crawler program which can be used for downloading the content of portals (news, forums, blogs) and converting it to the desired output format, in accordance with the configuration.
Requirements
- Python 3.8+
- (optional for corpus converter if installed as
webarticlecurator[newspaper3k]
) for Newspaper3k, the installation of the following packages must precede the installation of this program: python3-dev libxml2-dev libxslt-dev libjpeg-dev zlib1g-dev libpng12-dev
Install
pip
pip3 install webarticlecurator
The following extras can be installed:
- Newspaper3k:
newspaper
- All the above:
full
E.g. pip3 install webarticlecurator[full]
Manual
Poetry and (optionally) GNU Make are required.
git clone https://github.com/ELTE-DH/webarticlecurator.git
- Run
make
On Windows or without Make (after cloning the repository):
poetry install --no-root
poetry build
poetry run pip install --upgrade dist/*.whl
(the correct filename must be specified on Windows)
To install extras run: poetry install -E [NAME OF THE EXTRA TO INSTALL]
Usage
The program can be used in multiple ways:
- Crawling (see the options below):
python3 -m webarticlecurator crawl CONFIGURATION [parameters]
- Listing URLs in a previously created WARC file:
python3 -m webarticlecurator listurls -s SOURCE_WARC
- Validating a previously created WARC file (with warcio):
python3 -m webarticlecurator validate -s SOURCE_WARC
- Sampling a previously created WARC file based on a list of URLs (one URL per line, URLs not present in the source archive are downloaded if
--offline
is False. If --negative
is specified all URLs are sampled except ones from the list): python3 -m webarticlecurator sample -s SOURCE_WARC -i selected_urls.txt TARGET_WARC --offline True/False --negative True/False
- Printing the content of the selected URLs into an empty directory:
python3 -m webarticlecurator cat -s SOURCE_WARC -i selected_urls.txt TARGET_DIR
- Downloading a single URL (for testing purposes):
python3 -m webarticlecurator download SOURCE_URL TARGET_WARC
- Check URLs in the extracted article urls of an archive warc (for debugging a portal):
python3 -m webarticlecurator checkurls -s SOURCE_WARC -i selected_urls.txt -d TARGET_DIR CONFIGURATION
- As a library: Check strategies.py and enhanced_downloader.py for details !
Configuration schema
The configuration is divided into three levels. On the first two levels, YAML is used for the configuration format with schema checks.
The third level of configuration uses Python functions.
Crawl-level configuration
It specifies the configuration for the current crawling process with the following fields:
schema
: The filename pointing to the schema of the portal to be crawled (second level configuration)output_corpus
(optional): The desired filename of the output corpus (default: no output corpus)log_file_archive
(optional): The log file for the archive crawler (default: log is not saved)log_file_articles
(optional): The log file for the article crawler (default: log is no saved)new_problematic_archive_urls
(optional): The file where the problematic archive URLs should be written (default: URLs are not saved)new_problematic_urls
(optional): The file where the problematic article URLs should be written (default: URLs are not saved)new_good_archive_urls
(optional): The file where the newly downloaded, good archive URLs should be written (default: URLs are not saved)new_good_urls
(optional): The file where the newly downloaded, good article URLs should be written (default: URLs are not saved)date_from
(optional): The inclusive minimal date of the required articles in ISO 8601 format, YYYY-MM-DD (default: from the schema of the portal if applies)date_until
(optional): The inclusive maximal date of the required articles in ISO 8601 format, YYYY-MM-DD (default: yesterday if applies)
Site schemas
The following parameters must be filled in the case of every portal:
site_name
: A friendly name for the portalnew_article_url_threshold
: The minimal amount of new URLs required on an archive page (e.g. the archive pages slid due to new articles in case of an active portal)
Python functions:
portal_specific_exctractor_functions_file
: The filename pointing to the python file which contains the required extractor functionsextract_next_page_url_fun
(it can be NULL): The name of the function to be imported from the portal_specific_exctractor_functions_file
to extract the "next page URL"extract_article_urls_from_page_fun
: The name of the function to be imported from the portal_specific_exctractor_functions_file
to extract the article URLs from the archive pageextract_article_urls_from_page_plus_fun
: The name of the function to be imported from the portal_specific_exctractor_functions_file
to extract the article URLs from the archive page with metadata form the portal's archive (for checkurls
mode)next_page_of_article_fun
(it can be NULL): The name of the function to be imported from the portal_specific_exctractor_functions_file
if there are multipage articles. This function extracts the "next page URL" for the rest of the pages in a multipage article. (It must be used with MultiPageArticleConverter
or similar as corpus_converter
to work!)corpus_converter_file
: The filename pointing to the python file which contains the required corpus extractor classcorpus_converter
: The name of the class to be imported from the corpus_converter_file
. The default is to do nothing (dummy-converter
).
Boolean features to describe the site:
next_url_by_pagenum
: Use page numbering for pagination of the archive, e.g. infinite scrolling (false means no pages or pages handled by extract_next_page_url_fun
)infinite_scrolling
: The crawler increment page numbers until the first page with zero article urlsarchive_page_urls_by_date
: Group the archive page URLs by their datesgo_reverse_in_archive
: Go reverse (backwards in time) in the archive by date (when the earliest article is not known)verify_request
: Suppress complaining about invalid HTTPS certificatesignore_archive_cache
: Ignore archive cache (for those portals which only use pagination)stop_on_empty_archive_page
(optional): Stop archive crawling if no articles extracted from page (default: false)stop_on_taboo_set
(optional): Stop archive crawling when one or more URLs in taboo_article_urls
list is specified (default: false)
Column definitions:
In the columns
dictionary, the following features can be set for each column (defined with a friendly name):
date_first_article
(optional): The date of the first article on the portal/column (also used for archive crawling)date_first_article
(optional): The date of the last article on the portal/column (also used for archive crawling)initial_pagenum
(optional): The initial page number which could be omitted (an empty string if not set, else it should be min_pagenum
- 1)min_pagenum
(optional): The "first" page number to increment (e.g. initial_pagenum + 1 = min_pagenum <= max_pagenum if not a single page column where only initial_pagenum must be specified, min_pagenum and max_pagenum must be omitted)max_pagenum
(optional): The upper bound of the number of pages for safety or for stop criteriaarchive_url_format
: The schema for the archive URL of the portal/column (supply #year
, #month
, #day
and
#next-year
, #next-month
, #next-day
tags which have to be replaced with the actual field of date, and
#pagenum
with the actual page number during the crawling)max_tries
(optional): How many times should we try to download an archive URL? (defautlt: 1)
Note: One can iterate the archive by months or years by omitting #day
(#next-day
) or #month
(#next-month
)
There are maximum three types of extractors to be included for each portal.
See the examples in the configs
directory for further information and DummyConverter
for the converter API.
Command line parameters
The first two command-line parameters should be crawl
and the filename pointing to the configuration file of the current crawl. These can be followed by some optional parameters:
--old-archive-warc OLD_ARCHIVE_WARC
: Existing WARC archives of the portal's archive (use them as cache)--archive-warc ARCHIVE_WARC
: New WARC archive of the portal's archive (copy all cached pages if --old-archive-warc
is specified)--old-articles-warc OLD_ARTICLES_WARC
: Existing WARC archives of the portal's archive (use them as cache)--articles-warc ARTICLES_WARC
: New WARC archive of the portal's archive (copy all cached pages if --old-archive-warc
is specified)--archive-just-cache [ARCHIVE_JUST_CACHE]
: Use only cached pages (no output WARC file): --old-archive-warc
must be specified!--articles-just-cache [ARTICLES_JUST_CACHE]
: Use only cached pages (no output WARC file): --old-articles-warc
must be specified!--debug-news-archive [DEBUG_NEWS_ARCHIVE]
: Set DEBUG logging on NewsArchiveCrawler and print the number of extracted URLs per page--strict [STRICT]
: Set strict-mode in WARCReader to enable validation--crawler-name CRAWLER_NAME
: The name of the crawler for the WARC info record--user-agent USER_AGENT
: The User-Agent string to use in headers while downloading--no-overwrite-warc
: Do not overwrite --{archive,articles}-warc
if needed--cumulative-error-threshold CUMULATIVE_ERROR_THRESHOLD
: The sum of download errors before giving up--known-bad-urls KNOWN_BAD_URLS
: Known bad URLs to be excluded from download (filename, one URL per line)--known-article-urls KNOWN_ARTICLE_URLS
: Known article URLs to mark the desired end of the archive (filename, one URL per line)--max-no-of-calls-in-period MAX_NO_OF_CALLS_IN_PERIOD
: Limit the number of HTTP requests per period--limit-period LIMIT_PERIOD
: Limit the period of HTTP requests (in seconds), see also --max-no-of-calls-in-period
--proxy-url PROXY_URL
: SOCKS Proxy URL to use, e.g. socks5h://localhost:9050--allow-cookies [ALLOW_COOKIES]
: Allow session cookies--stay-offline [STAY_OFFLINE]
: Do not download but write output WARC (see --just-cache
when no output WARC file is needed)--archive
: Crawl only the portal's archive--articles
: Crawl articles (and optionally use cached WARC for the portal's archive), DEFAULT behaviour--corpus
: Use --old-articles-warc
to create a corpus (no crawling, equals to --archive-just-cache
and --articles-just-cache
)
Licence
This project is licensed under the terms of the GNU LGPL 3.0 license.
Acknowledgement
This software is the direct continuation of corpusbuilder.
The authors gratefully acknowledge the groundbreaking work of all pioneers who inspired this program.
Special thanks go to Tibor Kákonyi who put the initial implementation under the GNU LGPL 3.0 license and allowing us to continue his work.
References
The DOI of the code is: https://doi.org/10.5281/zenodo.3755323
If you use this program, please cite the following paper:
The ELTE.DH Pilot Corpus – Creating a Handcrafted Gigaword Web Corpus with Metadata Balázs Indig, Árpád Knap, Zsófia Sárközi-Lindner, Mária Timári, Gábor Palkó In the Proceedings of the 12th Web as Corpus Workshop (WAC XII), pages 33-41 Marseille, France 2020
@inproceedings{indig-etal-2020-elte,
title = "The {ELTE}.{DH} Pilot Corpus {--} Creating a Handcrafted {G}igaword Web Corpus with Metadata",
author = {Indig, Bal{\'a}zs and
Knap, {\'A}rp{\'a}d and
S{\'a}rk{\"o}zi-Lindner, Zs{\'o}fia and
Tim{\'a}ri, M{\'a}ria and
Palk{\'o}, G{\'a}bor},
booktitle = "Proceedings of the 12th Web as Corpus Workshop",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.wac-1.5",
pages = "33--41",
abstract = "In this article, we present the method we used to create a middle-sized corpus using targeted web crawling. Our corpus contains news portal articles along with their metadata, that can be useful for diverse audiences, ranging from digital humanists to NLP users. The method presented in this paper applies rule-based components that allow the curation of the text and the metadata content. The curated data can thereon serve as a reference for various tasks and measurements. We designed our workflow to encourage modification and customisation. Our concept can also be applied to other genres of portals by using the discovered patterns in the architecture of the portals. We found that for a systematic creation or extension of a similar corpus, our method provides superior accuracy and ease of use compared to The Wayback Machine, while requiring minimal manpower and computational resources. Reproducing the corpus is possible if changes are introduced to the text-extraction process. The standard TEI format and Schema.org encoded metadata is used for the output format, but we stress that placing the corpus in a digital repository system is recommended in order to be able to define semantic relations between the segments and to add rich annotation.",
language = "English",
ISBN = "979-10-95546-68-9",
}