![Maven Central Adds Sigstore Signature Validation](https://cdn.sanity.io/images/cgdhsj6q/production/7da3bc8a946cfb5df15d7fcf49767faedc72b483-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Maven Central Adds Sigstore Signature Validation
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
WikiRevParser is a Python library that parses Wikipedia revision histories. It allows you to analyse the development of pages on Wikipedia over time and across language versions.
The library takes a language code and Wikipedia page title as input, extracts the revision history, and parses the noisy, unstructured content into clean, accessible data for each timestamp in the revision history. You can use this library to access the development of references of a page, analyse the content or images over time, compare the tables of content across languages, create editor networks, and much more.
Beside the WikiRevParser, you'll need our version of the Wikipedia API wrapper, which extracts and returns the entire revision history of a Wikipedia page. Note that Python3+ is required.
$ pip3 install wikirevparser
$ git clone git@github.com:ajoer/Wikipedia.git
To get the revision history for the page on Marie Curie on the English Wikipedia, run:
>>> from wikirevparser import wikirevparser
>>> parser_instance = wikirevparser.ProcessRevisions("en", "Marie Curie")
>>> parser_instance.wikipedia_page()
>>> data = parser_instance.parse_revisions()
Now you have the revisions of the Marie Curie page in a structured dictionary format, and you can start exploring the data.
Let's look at the use of links. I want to know whether the links on the page are the same now as when the page was first made?
>>> edits = list(data.items())
>>> first_links = edits[-1][1]["links"]
>>> latest_links = edits[0][1]["links"]
>>> present_now = first_links[0] in latest_links
>>> print("The only link in the first version was '%s'.\nThat link is still present in the current version: %s." % (first_links[0], present_now))
The only link in the first version was 'pierre and marie curie'.
That link is still present in the current version: False.
Okay, but what are then the most frequent links on the page now?
>>> from collections import Counter
>>> links = Counter()
>>> for l in latest_links:
... links[l] += 1
>> print(links)
Counter({'polonium': 5, 'radium': 5, 'university of paris': 5, 'russian empire': 4, 'gabriel lippmann': 4, 'nobel prize in physics': 4, 'nobel prize in chemistry': 4, ... })
With the parsed revision history, you could also get answers for questions like these:
Read the documentation for more inspiration and functionalities, and go to FAQ or file a bug if you run into issues!
Read the docs at wikirevparser.readthedocs.io for more details and use case examples.
This work is MIT licensed. See the LICENSE file for full details.
FAQs
Wikipedia revision history parser for Python
We found that wikirevparser demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.