Security News
Maven Central Adds Sigstore Signature Validation
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
Warc2graph extracts a graph data structure from WARC files. The module was built to dig deeper into WARC files. It extracts (almost) all internal and external references from a WARC file by analyzing the WARC header and the payload. Multiple methods can be used for extraction, single or combined. Warc2graph has a CLI interface and can be used as a python module. The output when using the CLI interface consists of graph data in a standard graph XML format GEXF and several visualizations of that data using different visualization algorithms. We acknowledge that visualizations carry an epistemic value and thus need to be designed according to the analyzed objects and research questions. Warc2graph uses NetworkX as its graph data and analytics backend, so more involved graph data analytics can be realized when using warc2graph as a python module.
The initial purpose of warc2graph was to analyze and visualize the textual structure of net literature works in the DLA corpus of net literature works and blogs dating from the early time of the web in the 1990s up to the 2000s. Development is part of the Science Data Center for Literature research project. We now consider warc2graph as a tool for detailed WARC analytics regarding the referential structure of the archived sites and hope that it will be useful for the web archiving and web research community.
Warc2graph is under active development.
If you consider using warc2graph for a research project or in an archival context, please get in touch! We'd love to hear about your work.
Warc2graph has been presented at the Electronic Literature Organization Conference 2020:
| Overview and Video: https://elmcip.net/critical-writing/networks-net-literature-modelling-extracting-and-visualizing-link-based-networks
| Conference Paper (PDF): https://elmcip.net/sites/default/files/media/critical_writing/attachments/claus-michael_schlesinger_mona_ulrich_pascal_hein_and_andre_blessing_networks_of_net_literature_-_modelling_extracting_and_visualizing_192.pdf
warc2graph requires Python >= 3.6.
Use the package manager pip to install warc2graph.
pip install warc2graph
Alternatively you can install manually using the python package setuptools.
git clone https://github.com/dla-marbach/warc2graph.git
cd warc2graph
python3 setup.py build
python3 setup.py install --user
To be able to use the dot algorithm to visualize the graph, make sure, to have GraphViz installed.
You can use the package in your python projects, or you can use the provided command line interface. While the former offers more possibilities, the latter might be more intuitive.
The installation of the package provides the warc2graph
command for your terminal. Call warc2graph --help
to get an
overview over the available options.
If you want to create a model for only one warc file simply call
warc2graph path/to/warc.warc.gz
If the warc file is not on you file system, and you want it to be downloaded from the internet, you can pass an url. You
have to pass the parameter d
.
warc2graph url/to/warc.warc.gz d
If you want to create a model using a list of warc files all together archiving one big website, first create a list of all the warc files.
ls path/to/warcs/*.warc.gz >> list_of_warcs.txt
You can also create the file manually, it should look as follows.
path/to/warc1.warc.gz
path/to/warc2.warc.gz
path/to/warc3.warc.gz
path/to/warc4.warc.gz
Then call warc2graph with the parameter wl
, and the list as an input file.
warc2graph list_of_warcs.txt wl
You can also model a website that is not archived. Create a plain text file containing the urls to all the webpages you want to consider. This file should look as follows.
url/to/webpage1.html
url/to/webpage2.htm
Then call warc2graph with the parameter ll
, and the list as an input file.
warc2graph list_of_webpages.txt ll
You can inspect the examples.ipynb
using jupyter notebook for some interactive examples.
Our package relies heavily on the networkx package. Read its documentation for further information about the possibilities and interfaces for the analysis of networkx graphs.
import warc2graph # our package
import matplotlib.pyplot as plt # plot graphs
import networkx as nx # handle graphs
# assign the path to a warc file to a variable
warc_path = "tests/WEB-20210202165627638-00000-24143~clarin02~8443.warc.gz"
# create a basic model with all resources as nodes and all links and embeddings as edges
basic_model = warc2graph.create_graph(warc_path)
# visualizing the graph using the graphviz "dot" algorithm
fig, ax = plt.subplots(1, figsize=(8, 4))
pos = nx.drawing.nx_agraph.graphviz_layout(basic_model, prog="dot")
nx.draw_networkx(basic_model, with_labels=False, pos=pos, ax=ax)
plt.draw()
import warc2graph # our package
import networkx as nx # handle graphs
from pprint import PrettyPrinter # print dicts nicely
pp = PrettyPrinter()
warc_path = "tests/WEB-20210202165627638-00000-24143~clarin02~8443.warc.gz"
basic_model = warc2graph.create_graph(warc_path)
degree_centralities = nx.algorithms.centrality.degree_centrality(basic_model)
pp.pprint(degree_centralities)
Outputs:
{'http://httpd.apache.org/': 0.07692307692307693,
'http://www.scientificlinux.org/': 0.07692307692307693,
'https://clarin09.ims.uni-stuttgart.de/': 0.23076923076923078,
'https://clarin09.ims.uni-stuttgart.de/icons/apache_pb2.gif': 0.07692307692307693,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/angular1.html': 0.23076923076923078,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/index.html': 0.8461538461538463,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/jquery.html': 0.23076923076923078,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/js/angular.min.js': 0.07692307692307693,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/js/jquery-1.11.3.min.js': 0.07692307692307693,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/page1.html': 0.15384615384615385,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/page2.html': 0.15384615384615385,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/page_target_ang1.html': 0.07692307692307693,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/page_target_jquery1.html': 0.07692307692307693,
'https://clarin09.ims.uni-stuttgart.de/sdc_warc/page_target_jquery2.html': 0.07692307692307693}
You can also enrich the models using the original data.
import warc2graph # our package
# assign the path to a warc file to a variable
warc_path = "tests/WEB-20210202165627638-00000-24143~clarin02~8443.warc.gz"
# create an enriched model, structured like the basic model but containing the html content and counts of all tags
enriched_model = warc2graph.create_graph(warc_path, include_content=True, count_tags=True)
index_node = "https://clarin09.ims.uni-stuttgart.de/sdc_warc/index.html"
print(enriched_model.nodes[index_node]["counted_tags"])
# prints:
# {'html': 1, 'head': 1, 'meta': 1, 'title': 1, 'body': 1, 'a': 4, 'br': 6}
print(enriched_model.nodes[index_node]["content"])
Prints:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Insert title here</title>
</head>
<body>
<a href="page1.html">page1</a>
<br>
<br>
<a href="page2.html">page2</a>
<br>
<br>
<a href="angular1.html">angular1</a>
<br>
<br>
<a href="jquery.html">jquery</a>
</body>
</html>
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
All contributed Code will be licensed under the GNU Lesser General Public License.
By contributing you accept the following terms and conditions:
warc2graph is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
warc2graph is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with warc2graph. If not, see https://www.gnu.org/licenses/lgpl-3.0.html.
Consider COPYING and COPYING.LGPL.
warc2graph makes heavy and critical use of following open source libraries:
FAQs
Warc2graph extracts a graph data structure from WARC files.
We found that warc2graph demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.