
Security News
Official Go SDK for MCP in Development, Stable Release Expected in August
The official Go SDK for the Model Context Protocol is in development, with a stable, production-ready release expected by August 2025.
Batteries (tagger) not included.
This package is intended to cover the following use cases:
Text can be extracted from the XML files at different granularity (paragraphs, utterance, speech, who, protocol). The text can be grouped (combined) into larger temporal blocks based on time (year, lustrum, decade or custom periods). Within each of these block the text in turn can be grouped by speaker attributes (who, party, gender).
The text extraction can done using the riksprot2text
utility, which is a CLI interface installed with the package, or in Python code using the API that this package exposes. The Python API exposed both streaming (SAX based) methods and a domain model API (i.e. Python classes representing protocols, speeches and utterances).
Both the CLI and the API supports dehyphenation using method described in Anföranden: Annotated and Augmented Parliamentary Debates from Sweden, Stian Rødven Eide, 2020. The API also supports user defined text transformations.
Part-of-speech tagged versions of the protocols can be extracted with the same granularity and aggregation as described above for the raw text. The returned documents are tab-separated files with fields for text, baseform and pos-tag (UPOS, XPOS). Note that the actual part-of-speech tagging is done using tools found in the pyriksprot_tagging
repository (link).
Currently there are no open-source tagged versions of the corpos avaliable. The tagging is done using Stanza with Swedish language models produced and made publically avaliable by Språkbanken Text.
The extracted text can be stored as optionally compressed plain text files on disk, or in a ZIP-archive.
cd some-folder \
git clone --branch "tag" tags/"tag" --depth 1 https://github.com/welfare-state-analytics/riksdagen-corpus.git
cd riksdagen-corpus
git config core.quotepath off
Create an new isolated virtual environment for pyriksprot:
mkdir /path/to/new/pyriksprot-folder
cd /path/to/new/pyriksprot-folder
python -m venv .venv
Activate the environment:
cd /path/to/new/pyriksprot-folder
source .venv/bin/activate
Install pyriksprot
in activated virtual environment.
pip install pyriksprot
λ riksprot2text --help
Usage: riksprot2text [OPTIONS] SOURCE_FOLDER TARGET
Options:
-m, --mode [plain|zip|gzip|bz2|lzma]
Target type
-t, --temporal-key TEXT Temporal partition key(s)
-y, --years TEXT Years to include in output
-g, --group-key TEXT Partition key(s)
-p, --processes INTEGER RANGE Number of processes to use
-l, --segment-level [protocol|speech|utterance|paragraph|who]
Protocol extract segment level
-e, --keep-order Keep output in filename order (slower, multiproc)
-s, --skip-size INTEGER RANGE Skip blocks of char length less than
-d, --dedent Remove indentation
-k, --dehyphen Dehyphen text
--help Show this message and exit.
λ metadata2db --help
Usage: metadata2db.py [OPTIONS] COMMAND [ARGS]...
CLI tool to manage riksprot metadata
Options:
--help Show this message and exit.
Commands:
columns
database
download
filenames
index
λ metadata2db.py database --help
Usage: metadata2db.py database [OPTIONS] TARGET
Options:
--tag TEXT Metadata version
--source-folder TEXT
--force Force overwrite
--load-index Load utterance index
--scripts-folder TEXT Apply scripts in specified folder to DB. If not
specified the scripts are loaded from SQL-module.
--skip-scripts Skip loading SQL scripts
--help Show this message and exit.
λ metadata2db index --help
Usage: metadata2db.py index [OPTIONS] CORPUS_FOLDER TARGET_FOLDER
Options:
--help Show this message and exit.
Aggregate text per year grouped by speaker. Store result in a single zip. Skip documents less than 50 characters.
riksprot2text /path/to/corpus output.zip -m zip -t year -l protocol -g who --skip-size 50
Aggregate text per decade grouped by speaker. Store result in a single zip. Remove indentations and hyphenations.
riksprot2text /path/to/corpus output.zip -m zip -t decade -l who -g who --dedent --dehyphen
Aggregate text using customized temporal periods and grouped by party.
riksprot2text /path/to/corpus output.zip -m zip -t "1920-1938,1929-1945,1946-1989,1990-2020" -l who -g party
Aggregate text per document and group by gender and party.
riksprot2text /path/to/corpus output.zip -m zip -t protocol -l who -g party -g gender
Aggregate text per year grouped by gender and party and include only 1946-1989.
riksprot2text /path/to/corpus output.zip -m zip -t year -l who -g party -g gender -y 1946-1989
Aggregate text per year grouped by speaker. Store result in a single zip. Skip documents less than 50 characters.
import pyriksprot
target_filename: str = f'output.zip'
opts = {
'source_folder': '/path/to/corpus',
'target': 'outout.zip',
'target_type': 'files-in-zip',
'segment_level': SegmentLevel.Who,
'dedent': True,
'dehyphen': False,
'years': '1955-1965',
'temporal_key': TemporalKey.Protocol,
'group_keys': (GroupingKey.Party, GroupingKey.Gender),
}
pyriksprot.extract_corpus_text(**opts)
Iterate over protocol and speaker:
from pyriksprot import interface, iterstors
items: Iterable[interface.ProtocolSegment] = iterators.XmlProtocolTextIterator(
filenames=filenames, segment_level=SegmentLevel.Who, segment_skip_size=0, processes=4
)
for item in items:
print(item.who, len(item.text))
Iterate over protocol and speech, skip empty:
from pyriksprot import interface, iterstors
items: Iterable[interface.ProtocolSegment] = iterators.XmlProtocolTextIterator(
filenames=filenames, segment_level=SegmentLevel.Who, segment_skip_size=1, processes=4
)
for item in items:
print(item.who, len(item.text))
Iterate over protocol and speech, apply preprocess function(s):
from pyriksprot import interface, iterstors
import ftfy # pip install ftfy
import unidecode
fix_text: Callable[[str], str] = pyriksprot.compose(
[str.lower, pyriksprot.dedent, ftfy.fix_character_width, unidecode.unidecode ]
)
items: Iterable[interface.ProtocolSegment] = iterators.XmlProtocolTextIterator(
filenames=filenames, segment_level=SegmentLevel.Speech, segment_skip_size=1, processes=4, preprocessor=fix_text,
)
for item in items:
print(item.who, len(item.text))
FAQs
Python API for Riksdagens Protokoll
We found that pyriksprot demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The official Go SDK for the Model Context Protocol is in development, with a stable, production-ready release expected by August 2025.
Security News
New research reveals that LLMs often fake understanding, passing benchmarks but failing to apply concepts or stay internally consistent.
Security News
Django has updated its security policies to reject AI-generated vulnerability reports that include fabricated or unverifiable content.