
Security News
CVE Volume Surges Past 48,000 in 2025 as WordPress Plugin Ecosystem Drives Growth
CVE disclosures hit a record 48,185 in 2025, driven largely by vulnerabilities in third-party WordPress plugins.
kathairo
Advanced tools
Scripture processing pipeline for converting USFM and USX formats into structured TSV files
kathairo is a comprehensive text processing pipeline for Scripture that converts USFM and USX formats into structured TSV files. Built on SIL's machine.py, kathairo provides both verse-level and token-level outputs with enhanced support for psalm superscriptions and advanced tokenization.
pip install kathairo
sil-machine==1.0.2spacy>=3.7.5polars>=1.4.0Note: Kathairo works cross-platform (Windows, macOS, Linux). Use os.path.join() for portable paths, though forward slashes generally work on all platforms.
The simplest way to use kathairo:
from kathairo.tsvs.build_tsv_json_parser import process_corpus
config = {
"projectName": "MyBible",
"language": "eng",
"targetUsfmCorpusPath": "/path/to/usfm/",
"targetVersificationPath": "/path/to/eng.vrs",
"latinWhiteSpaceIncludedTokenizer": True
}
process_corpus(config)
# Output: output/eng/MyBible/verse/verse_MyBible.tsv
# output/eng/MyBible/token/token_MyBible.tsv
Here's a realistic workflow from installation to processing:
# Step 1: Install kathairo
# pip install kathairo
# Step 2: Import and setup
import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus
import kathairo
# Step 3: Locate built-in versification file
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")
# Step 4: Configure your project
config = {
"projectName": "ESV",
"language": "eng",
"targetUsfmCorpusPath": "/home/user/bibles/ESV/usfm/",
"targetVersificationPath": vrs_path,
"latinWhiteSpaceIncludedTokenizer": True,
"treatApostropheAsSingleQuote": False,
"excludeBracketedText": False,
"excludeCrossReferences": True,
"psalmSuperscriptionTag": "d"
}
# Step 5: Process the corpus
print(f"Processing {config['projectName']}...")
result = process_corpus(config)
print("Done!")
# Output files created at:
# - output/eng/ESV/verse/verse_ESV.tsv
# - output/eng/ESV/token/token_ESV.tsv
Create a JSON configuration file (e.g., projects_config.json):
[
{
"projectName": "ESV",
"language": "eng",
"targetUsfmCorpusPath": "/scripture/ESV/usfm/",
"targetVersificationPath": "/versification/eng.vrs",
"latinWhiteSpaceIncludedTokenizer": true,
"treatApostropheAsSingleQuote": false,
"excludeBracketedText": false,
"excludeCrossReferences": true,
"stopWordsPath": "/config/english_stopwords.tsv",
"zwRemovalPath": "/config/zw_removal.tsv",
"psalmSuperscriptionTag": "d"
},
{
"projectName": "RVR1960",
"language": "spa",
"targetUsfmCorpusPath": "/scripture/RVR1960/usfm/",
"targetVersificationPath": "/versification/spa.vrs",
"latinWhiteSpaceIncludedTokenizer": true,
"excludeCrossReferences": true
}
]
Then process all projects:
from kathairo.tsvs.build_tsv_json_parser import main
main("projects_config.json")
This processes all projects in parallel using Python's ProcessPoolExecutor.
| Option | Type | Description |
|---|---|---|
projectName | string | Name of your project (used in output file naming) |
language | string | Language code (e.g., "eng", "spa") |
targetUsfmCorpusPath | string | Path to USFM files directory |
targetUsxCorpusPath | string | Path to USX files directory (alternative to USFM) |
targetVersificationPath | string | Path to versification file (.vrs) |
latinTokenizer | boolean | Use standard Latin word tokenizer |
latinWhiteSpaceIncludedTokenizer | boolean | Use Latin tokenizer with whitespace preservation |
chineseTokenizer | boolean | Use Chinese Bible word tokenizer |
treatApostropheAsSingleQuote | boolean | Handle apostrophes as single quotes |
excludeBracketedText | boolean | Exclude text within square brackets |
excludeCrossReferences | boolean | Exclude cross-reference text |
stopWordsPath | string | Path to TSV file containing stop words |
zwRemovalPath | string | Path to TSV file for zero-width character removal |
regexRulesPath | string | Path to custom regex rules module |
psalmSuperscriptionTag | string | USFM tag for psalm superscriptions (default: "d") |
tsvPath | string | Path to existing token TSV (for re-versification) |
solo | boolean | If present, only this project will be processed |
At minimum, you must provide:
projectName: Your project identifierlanguage: Language code (e.g., "eng", "spa")targetUsfmCorpusPath OR targetUsxCorpusPath OR tsvPathtargetVersificationPath: Path to .vrs filelatinTokenizer, latinWhiteSpaceIncludedTokenizer, or chineseTokenizerAll other fields are optional and have sensible defaults.
You must specify exactly one tokenizer in your configuration:
latinTokenizer - Standard Latin word tokenizer from SIL Machine
latinWhiteSpaceIncludedTokenizer - Enhanced Latin tokenizer with whitespace preservation
chineseTokenizer - Chinese Bible word tokenizer
Example:
# Pick ONE of these:
"latinTokenizer": True # Standard
"latinWhiteSpaceIncludedTokenizer": True # Recommended
"chineseTokenizer": True # Chinese only
Generated at: output/{language}/{projectName}/verse/verse_{projectName}.tsv
| Column | Description |
|---|---|
id | Verse identifier (BBCCCVVV format) |
source_verse | Source versification verse ID |
text | Complete verse text |
id_range_end | End verse for verse ranges |
source_verse_range_end | End verse in source versification |
Generated at: output/{language}/{projectName}/token/token_{projectName}.tsv
| Column | Description |
|---|---|
id | Token identifier (BBCCCVVVWWW format) |
source_verse | Source versification verse ID |
text | Token text |
skip_space_after | "y" if no space should follow this token, empty otherwise |
exclude | "y" if token should be excluded from analysis (footnotes, cross-references based on excludeCrossReferences), empty otherwise |
id_range_end | End verse for verse ranges |
source_verse_range_end | End verse in source versification |
required | "y" if token contains non-punctuation, "n" otherwise |
Here's what the actual output looks like for Genesis 1:1-2:
Verse-Level Sample:
id source_verse text
01001001 01001001 In the beginning God created the heavens and the earth.
01001002 01001002 Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters.
Token-Level Sample:
id source_verse text skip_space_after exclude required
01001001001 01001001 In y
01001001002 01001001 the y
01001001003 01001001 beginning y
01001001004 01001001 God y
01001001005 01001001 created y
01001001006 01001001 the y
01001001007 01001001 heavens y
01001001008 01001001 and y
01001001009 01001001 the y
01001001010 01001001 earth y y
01001001011 01001001 . y n
01001002001 01001002 Now y
01001002002 01001002 the y
01001002003 01001002 earth y
...
Notice:
skip_space_after is "y" for "earth" before the periodrequired is "n" for punctuation like "."from kathairo.tsvs.build_tsv_json_parser import process_corpus
config = {
"projectName": "ESV",
"language": "eng",
"targetUsfmCorpusPath": "/data/scripture/ESV/usfm/",
"targetVersificationPath": "/data/versification/eng.vrs",
"latinWhiteSpaceIncludedTokenizer": True,
"excludeCrossReferences": True,
"psalmSuperscriptionTag": "d"
}
process_corpus(config)
config = {
"projectName": "NIV",
"language": "eng",
"targetUsxCorpusPath": "/data/scripture/NIV/usx/", # Use USX instead of USFM
"targetVersificationPath": "/data/versification/eng.vrs",
"latinTokenizer": True # Standard Latin tokenizer
}
process_corpus(config)
config = {
"projectName": "CUV",
"language": "zho",
"targetUsfmCorpusPath": "/data/scripture/CUV/usfm/",
"targetVersificationPath": "/data/versification/chinese.vrs",
"chineseTokenizer": True, # Use Chinese-specific tokenizer
"excludeBracketedText": True
}
process_corpus(config)
If you already have a token-level TSV and want to re-versify it:
config = {
"projectName": "ESV_Reversified",
"language": "eng",
"tsvPath": "/data/output/eng/ESV/token/token_ESV.tsv", # Use existing TSV
"targetVersificationPath": "/data/versification/new_versification.vrs",
"latinWhiteSpaceIncludedTokenizer": True,
"stopWordsPath": "/data/config/stopwords.tsv",
"zwRemovalPath": "/data/config/zw_removal.tsv"
}
process_corpus(config)
Create stop words file (stopwords.tsv):
stop_words
\u200b
\u200c
\u200d
Create zero-width removal file (zw_removal.tsv):
words
word\u200bwith\u200bzwsp
another\u200cword
Then use them:
config = {
"projectName": "MyBible",
"language": "hin",
"targetUsfmCorpusPath": "/data/hindi/usfm/",
"targetVersificationPath": "/data/versification/hindi.vrs",
"latinWhiteSpaceIncludedTokenizer": True,
"stopWordsPath": "/data/config/stopwords.tsv",
"zwRemovalPath": "/data/config/zw_removal.tsv"
}
process_corpus(config)
Create a custom Python module (custom_regex.py):
import re
class CustomRegexRules:
# Define custom punctuation patterns for word-level punctuation
WORD_LEVEL_PUNCT_REGEX = re.compile(
r'(?<=[a-zA-Z])[\''](?=[a-zA-Z])|' # Apostrophes within words
r'(?<=[0-9])[,.](?=[0-9])' # Commas/periods in numbers
)
Reference it in your config:
config = {
"projectName": "MyBible",
"language": "eng",
"targetUsfmCorpusPath": "/data/usfm/",
"targetVersificationPath": "/data/versification/eng.vrs",
"latinWhiteSpaceIncludedTokenizer": True,
"regexRulesPath": "/path/to/custom_regex.py" # Use custom rules
}
process_corpus(config)
When working with multiple projects in a JSON config, add "solo": true to process only one:
[
{
"projectName": "TestProject",
"language": "eng",
"targetUsfmCorpusPath": "/test/usfm/",
"targetVersificationPath": "/test/eng.vrs",
"latinTokenizer": true,
"solo": true
},
{
"projectName": "OtherProject",
"language": "spa",
...
}
]
Only TestProject will be processed.
Scripture Files - Either USFM or USX format:
.SFM01-GEN.usfm, 01GENBSB.SFM, GEN.usfm)Versification File (.vrs) - Defines verse structure:
# Versification "English"
GEN 1:31 2:25 3:24 4:26 5:32 ...
EXO 1:22 2:25 3:22 ...
eng.vrs at src/kathairo/versification/eng.vrsBOOK chapter:lastVerse chapter:lastVerse ...Optional: Stop Words TSV:
stop_words
\u200b
\u200c
\u200d
Optional: Zero-Width Removal TSV:
words
problematic\u200bword
another\u200cexample
kathairo automatically creates this directory structure in your current working directory:
output/
└── {language}/
└── {projectName}/
├── verse/
│ └── verse_{projectName}.tsv
└── token/
└── token_{projectName}.tsv
Note: The output/ directory is created relative to where you run the script, not relative to the input files.
kathairo handles complex versification mappings between different Bible versification systems.
import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus
# Get the package's built-in versification file
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")
config = {
"projectName": "MyBible",
"language": "eng",
"targetUsfmCorpusPath": "/data/usfm/",
"targetVersificationPath": vrs_path, # Use built-in versification
"latinWhiteSpaceIncludedTokenizer": True
}
process_corpus(config)
If your versification file is in the same directory as your USFM files:
import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus
# Get the current script directory
current_dir = os.path.dirname(os.path.abspath(__file__))
config = {
"projectName": "BSB",
"language": "eng",
"targetUsfmCorpusPath": os.path.join(current_dir, "bsb_usfm"),
"targetVersificationPath": os.path.join(current_dir, "bsb_usfm", "versification.vrs"),
"latinWhiteSpaceIncludedTokenizer": True,
"excludeCrossReferences": True
}
process_corpus(config)
from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification, VerseRef
# Load and modify a versification
versification = Versification.load("path/to/base_versification.vrs")
# Make modifications (example: adjust Genesis 1:1 mapping)
versification.mappings.add_mapping(
VerseRef("GEN", 1, 1, versification),
VerseRef("GEN", 1, 2, versification)
)
# Save the modified versification
save_versification(versification)
# Creates: src/kathairo/versification/output.vrs
If you need to apply a different versification to already-processed scripture:
config = {
"projectName": "ESV_NewVersification",
"language": "eng",
"tsvPath": "/output/eng/ESV/token/token_ESV.tsv", # Existing token TSV
"targetVersificationPath": "/path/to/new_versification.vrs", # New versification
"latinWhiteSpaceIncludedTokenizer": True
}
process_corpus(config)
If you get ModuleNotFoundError: No module named 'kathairo':
pip install kathairo
# or
pip install kathairo --upgrade
If you see errors about missing versification files:
# Use the built-in English versification
import os
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")
config["targetVersificationPath"] = vrs_path
Error: TypeError or no tokenizer found
Solution: Add exactly one tokenizer to your config:
config["latinWhiteSpaceIncludedTokenizer"] = True
If your TSV files are empty:
By default, process_corpus() runs quietly with minimal output. To monitor progress:
print(f"Processing {config['projectName']}...")
result = process_corpus(config)
print("Done!")
print(f"Output files created:")
print(f" - output/{config['language']}/{config['projectName']}/verse/verse_{config['projectName']}.tsv")
print(f" - output/{config['language']}/{config['projectName']}/token/token_{config['projectName']}.tsv")
For large Bibles (66 books), processing typically takes 5-30 seconds depending on system performance.
For large projects:
from kathairo.tsvs.build_tsv_json_parser import main
main("projects.json") # Processes all projects in parallel
process_corpus(config: dict) -> dictProcess a single scripture project and generate TSV files.
Parameters:
config (dict): Configuration dictionary with project settingsReturns:
Note: Currently returns minimal information; primarily useful for error checking
Example:
from kathairo.tsvs.build_tsv_json_parser import process_corpus
result = process_corpus({
"projectName": "MyBible",
"language": "eng",
"targetUsfmCorpusPath": "/path/to/usfm/",
"targetVersificationPath": "/path/to/eng.vrs",
"latinWhiteSpaceIncludedTokenizer": True
})
main(json_path: str) -> NoneProcess multiple projects in parallel from a JSON configuration file.
Parameters:
json_path (str): Path to JSON configuration fileExample:
from kathairo.tsvs.build_tsv_json_parser import main
main("projects_config.json")
save_versification(versification: Versification) -> NoneSave a versification object to a .vrs file.
Parameters:
versification (Versification): SIL Machine Versification objectOutput:
src/kathairo/versification/output.vrsExample:
from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification
vrs = Versification.load("input.vrs")
save_versification(vrs)
For advanced usage, you can access lower-level functions:
corpus_to_tsv(targetVersification, sourceVersification, corpus, tokenizer, ...)Convert a corpus directly to TSV files.
tokens_to_tsv(targetVersification, sourceVersification, tsvPath, ...)Re-versify an existing token TSV file.
kathairo extends sil-machine with:
kathairo/
├── parsing/ # USFM and USX parsers
│ ├── usfm/ # USFM parsing with modified handlers
│ └── usx/ # USX parsing utilities
├── tokenization/ # Tokenizer implementations
├── tsvs/ # TSV building and processing
├── versification/ # Versification utilities
└── helpers/ # Utility functions
Contributions are welcome! This project processes Scripture text and aims to provide reliable, accurate text processing for Bible translation and analysis workflows.
Robertson Brinker - robertson.brinker@biblica.com
Built on SIL's machine.py - a machine learning and natural language processing library for Scripture.
poetry config repositories.pypi https://upload.pypi.org/legacy/$env:PYPI_USERNAME="__token__"$env:PYPI_PASSWORD="<api-token>"poetry publish --build --username $env:PYPI_USERNAME --password $env:PYPI_PASSWORDFAQs
Scripture processing pipeline for converting USFM and USX formats into structured TSV files
We found that kathairo demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
CVE disclosures hit a record 48,185 in 2025, driven largely by vulnerabilities in third-party WordPress plugins.

Security News
Socket CEO Feross Aboukhadijeh joins Insecure Agents to discuss CVE remediation and why supply chain attacks require a different security approach.

Security News
Tailwind Labs laid off 75% of its engineering team after revenue dropped 80%, as LLMs redirect traffic away from documentation where developers discover paid products.