Socket
Book a DemoInstallSign in
Socket

kathairo

Package Overview
Dependencies
Maintainers
1
Versions
10
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

kathairo

Scripture processing pipeline for converting USFM and USX formats into structured TSV files

pipPyPI
Version
0.1.0
Maintainers
1

kathairo

Scripture Processing Pipeline: Parse, Tokenize, and Versify

PyPI version Python 3.10+

kathairo is a comprehensive text processing pipeline for Scripture that converts USFM and USX formats into structured TSV files. Built on SIL's machine.py, kathairo provides both verse-level and token-level outputs with enhanced support for psalm superscriptions and advanced tokenization.

Installation

pip install kathairo

Requirements

  • Python 3.10 or 3.11
  • Dependencies (automatically installed):
    • sil-machine==1.0.2
    • spacy>=3.7.5
    • polars>=1.4.0

Note: Kathairo works cross-platform (Windows, macOS, Linux). Use os.path.join() for portable paths, though forward slashes generally work on all platforms.

Quick Start

Minimal Example

The simplest way to use kathairo:

from kathairo.tsvs.build_tsv_json_parser import process_corpus

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/path/to/usfm/",
    "targetVersificationPath": "/path/to/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)
# Output: output/eng/MyBible/verse/verse_MyBible.tsv
#         output/eng/MyBible/token/token_MyBible.tsv

Complete Workflow Example

Here's a realistic workflow from installation to processing:

# Step 1: Install kathairo
# pip install kathairo

# Step 2: Import and setup
import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus
import kathairo

# Step 3: Locate built-in versification file
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

# Step 4: Configure your project
config = {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/home/user/bibles/ESV/usfm/",
    "targetVersificationPath": vrs_path,
    "latinWhiteSpaceIncludedTokenizer": True,
    "treatApostropheAsSingleQuote": False,
    "excludeBracketedText": False,
    "excludeCrossReferences": True,
    "psalmSuperscriptionTag": "d"
}

# Step 5: Process the corpus
print(f"Processing {config['projectName']}...")
result = process_corpus(config)
print("Done!")

# Output files created at:
# - output/eng/ESV/verse/verse_ESV.tsv
# - output/eng/ESV/token/token_ESV.tsv

Processing Multiple Projects in Parallel

Create a JSON configuration file (e.g., projects_config.json):

[
  {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/scripture/ESV/usfm/",
    "targetVersificationPath": "/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": true,
    "treatApostropheAsSingleQuote": false,
    "excludeBracketedText": false,
    "excludeCrossReferences": true,
    "stopWordsPath": "/config/english_stopwords.tsv",
    "zwRemovalPath": "/config/zw_removal.tsv",
    "psalmSuperscriptionTag": "d"
  },
  {
    "projectName": "RVR1960",
    "language": "spa",
    "targetUsfmCorpusPath": "/scripture/RVR1960/usfm/",
    "targetVersificationPath": "/versification/spa.vrs",
    "latinWhiteSpaceIncludedTokenizer": true,
    "excludeCrossReferences": true
  }
]

Then process all projects:

from kathairo.tsvs.build_tsv_json_parser import main

main("projects_config.json")

This processes all projects in parallel using Python's ProcessPoolExecutor.

Configuration Options

OptionTypeDescription
projectNamestringName of your project (used in output file naming)
languagestringLanguage code (e.g., "eng", "spa")
targetUsfmCorpusPathstringPath to USFM files directory
targetUsxCorpusPathstringPath to USX files directory (alternative to USFM)
targetVersificationPathstringPath to versification file (.vrs)
latinTokenizerbooleanUse standard Latin word tokenizer
latinWhiteSpaceIncludedTokenizerbooleanUse Latin tokenizer with whitespace preservation
chineseTokenizerbooleanUse Chinese Bible word tokenizer
treatApostropheAsSingleQuotebooleanHandle apostrophes as single quotes
excludeBracketedTextbooleanExclude text within square brackets
excludeCrossReferencesbooleanExclude cross-reference text
stopWordsPathstringPath to TSV file containing stop words
zwRemovalPathstringPath to TSV file for zero-width character removal
regexRulesPathstringPath to custom regex rules module
psalmSuperscriptionTagstringUSFM tag for psalm superscriptions (default: "d")
tsvPathstringPath to existing token TSV (for re-versification)
solobooleanIf present, only this project will be processed

Required Configuration Fields

At minimum, you must provide:

  • projectName: Your project identifier
  • language: Language code (e.g., "eng", "spa")
  • ONE input source: targetUsfmCorpusPath OR targetUsxCorpusPath OR tsvPath
  • targetVersificationPath: Path to .vrs file
  • ONE tokenizer: latinTokenizer, latinWhiteSpaceIncludedTokenizer, or chineseTokenizer

All other fields are optional and have sensible defaults.

Choosing a Tokenizer

You must specify exactly one tokenizer in your configuration:

  • latinTokenizer - Standard Latin word tokenizer from SIL Machine

    • Use for: Most Latin-script languages (English, Spanish, French, etc.)
    • Behavior: Standard word boundary detection
  • latinWhiteSpaceIncludedTokenizer - Enhanced Latin tokenizer with whitespace preservation

    • Use for: Latin-script languages where you need precise whitespace tracking
    • Behavior: Preserves whitespace information for accurate reconstruction
    • Recommended for most projects
  • chineseTokenizer - Chinese Bible word tokenizer

    • Use for: Chinese scripture text
    • Behavior: Specialized Chinese word segmentation

Example:

# Pick ONE of these:
"latinTokenizer": True                      # Standard
"latinWhiteSpaceIncludedTokenizer": True    # Recommended
"chineseTokenizer": True                    # Chinese only

Output Format

Verse-Level TSV

Generated at: output/{language}/{projectName}/verse/verse_{projectName}.tsv

ColumnDescription
idVerse identifier (BBCCCVVV format)
source_verseSource versification verse ID
textComplete verse text
id_range_endEnd verse for verse ranges
source_verse_range_endEnd verse in source versification

Token-Level TSV

Generated at: output/{language}/{projectName}/token/token_{projectName}.tsv

ColumnDescription
idToken identifier (BBCCCVVVWWW format)
source_verseSource versification verse ID
textToken text
skip_space_after"y" if no space should follow this token, empty otherwise
exclude"y" if token should be excluded from analysis (footnotes, cross-references based on excludeCrossReferences), empty otherwise
id_range_endEnd verse for verse ranges
source_verse_range_endEnd verse in source versification
required"y" if token contains non-punctuation, "n" otherwise

Sample Output

Here's what the actual output looks like for Genesis 1:1-2:

Verse-Level Sample:

id          source_verse  text
01001001    01001001      In the beginning God created the heavens and the earth.
01001002    01001002      Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters.

Token-Level Sample:

id           source_verse  text       skip_space_after  exclude  required
01001001001  01001001      In                                    y
01001001002  01001001      the                                   y
01001001003  01001001      beginning                             y
01001001004  01001001      God                                   y
01001001005  01001001      created                               y
01001001006  01001001      the                                   y
01001001007  01001001      heavens                               y
01001001008  01001001      and                                   y
01001001009  01001001      the                                   y
01001001010  01001001      earth      y                          y
01001001011  01001001      .                            y        n
01001002001  01001002      Now                                   y
01001002002  01001002      the                                   y
01001002003  01001002      earth                                 y
...

Notice:

  • Verse IDs use BBCCCVVV format (01001001 = Genesis 1:1)
  • Token IDs add WWW for word position (01001001010 = Genesis 1:1, word 10)
  • skip_space_after is "y" for "earth" before the period
  • required is "n" for punctuation like "."

Usage Examples

Example 1: Processing USFM Files

from kathairo.tsvs.build_tsv_json_parser import process_corpus

config = {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/scripture/ESV/usfm/",
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "excludeCrossReferences": True,
    "psalmSuperscriptionTag": "d"
}

process_corpus(config)

Example 2: Processing USX Files

config = {
    "projectName": "NIV",
    "language": "eng",
    "targetUsxCorpusPath": "/data/scripture/NIV/usx/",  # Use USX instead of USFM
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinTokenizer": True  # Standard Latin tokenizer
}

process_corpus(config)

Example 3: Chinese Scripture with Custom Tokenizer

config = {
    "projectName": "CUV",
    "language": "zho",
    "targetUsfmCorpusPath": "/data/scripture/CUV/usfm/",
    "targetVersificationPath": "/data/versification/chinese.vrs",
    "chineseTokenizer": True,  # Use Chinese-specific tokenizer
    "excludeBracketedText": True
}

process_corpus(config)

Example 4: Re-versification from Existing Token TSV

If you already have a token-level TSV and want to re-versify it:

config = {
    "projectName": "ESV_Reversified",
    "language": "eng",
    "tsvPath": "/data/output/eng/ESV/token/token_ESV.tsv",  # Use existing TSV
    "targetVersificationPath": "/data/versification/new_versification.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "stopWordsPath": "/data/config/stopwords.tsv",
    "zwRemovalPath": "/data/config/zw_removal.tsv"
}

process_corpus(config)

Example 5: With Stop Words and Zero-Width Character Removal

Create stop words file (stopwords.tsv):

stop_words
\u200b
\u200c
\u200d

Create zero-width removal file (zw_removal.tsv):

words
word\u200bwith\u200bzwsp
another\u200cword

Then use them:

config = {
    "projectName": "MyBible",
    "language": "hin",
    "targetUsfmCorpusPath": "/data/hindi/usfm/",
    "targetVersificationPath": "/data/versification/hindi.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "stopWordsPath": "/data/config/stopwords.tsv",
    "zwRemovalPath": "/data/config/zw_removal.tsv"
}

process_corpus(config)

Advanced Usage

Custom Regex Rules

Create a custom Python module (custom_regex.py):

import re

class CustomRegexRules:
    # Define custom punctuation patterns for word-level punctuation
    WORD_LEVEL_PUNCT_REGEX = re.compile(
        r'(?<=[a-zA-Z])[\''](?=[a-zA-Z])|'  # Apostrophes within words
        r'(?<=[0-9])[,.](?=[0-9])'          # Commas/periods in numbers
    )

Reference it in your config:

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/usfm/",
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "regexRulesPath": "/path/to/custom_regex.py"  # Use custom rules
}

process_corpus(config)

Solo Mode for Testing

When working with multiple projects in a JSON config, add "solo": true to process only one:

[
  {
    "projectName": "TestProject",
    "language": "eng",
    "targetUsfmCorpusPath": "/test/usfm/",
    "targetVersificationPath": "/test/eng.vrs",
    "latinTokenizer": true,
    "solo": true
  },
  {
    "projectName": "OtherProject",
    "language": "spa",
    ...
  }
]

Only TestProject will be processed.

Required Files and Structure

Input Files

  • Scripture Files - Either USFM or USX format:

    • USFM: Plain text files with USFM markup
      • Extensions: .SFM
      • Naming: Flexible (e.g., 01-GEN.usfm, 01GENBSB.SFM, GEN.usfm)
      • Directory should contain all books you want to process
    • USX: XML files following USX standard
      • Directory should contain all books you want to process
  • Versification File (.vrs) - Defines verse structure:

    # Versification  "English"
    GEN 1:31 2:25 3:24 4:26 5:32 ...
    EXO 1:22 2:25 3:22 ...
    
    • kathairo includes eng.vrs at src/kathairo/versification/eng.vrs
    • Format: BOOK chapter:lastVerse chapter:lastVerse ...
  • Optional: Stop Words TSV:

    stop_words
    \u200b
    \u200c
    \u200d
    
  • Optional: Zero-Width Removal TSV:

    words
    problematic\u200bword
    another\u200cexample
    

Output Structure

kathairo automatically creates this directory structure in your current working directory:

output/
└── {language}/
    └── {projectName}/
        ├── verse/
        │   └── verse_{projectName}.tsv
        └── token/
            └── token_{projectName}.tsv

Note: The output/ directory is created relative to where you run the script, not relative to the input files.

Versification Support

kathairo handles complex versification mappings between different Bible versification systems.

Using Built-in English Versification

import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus

# Get the package's built-in versification file
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/usfm/",
    "targetVersificationPath": vrs_path,  # Use built-in versification
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)

Using a Local Versification File

If your versification file is in the same directory as your USFM files:

import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus

# Get the current script directory
current_dir = os.path.dirname(os.path.abspath(__file__))

config = {
    "projectName": "BSB",
    "language": "eng",
    "targetUsfmCorpusPath": os.path.join(current_dir, "bsb_usfm"),
    "targetVersificationPath": os.path.join(current_dir, "bsb_usfm", "versification.vrs"),
    "latinWhiteSpaceIncludedTokenizer": True,
    "excludeCrossReferences": True
}

process_corpus(config)

Creating Custom Versification

from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification, VerseRef

# Load and modify a versification
versification = Versification.load("path/to/base_versification.vrs")

# Make modifications (example: adjust Genesis 1:1 mapping)
versification.mappings.add_mapping(
    VerseRef("GEN", 1, 1, versification),
    VerseRef("GEN", 1, 2, versification)
)

# Save the modified versification
save_versification(versification)
# Creates: src/kathairo/versification/output.vrs

Re-versifying Existing TSV Output

If you need to apply a different versification to already-processed scripture:

config = {
    "projectName": "ESV_NewVersification",
    "language": "eng",
    "tsvPath": "/output/eng/ESV/token/token_ESV.tsv",  # Existing token TSV
    "targetVersificationPath": "/path/to/new_versification.vrs",  # New versification
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)

Common Issues and Troubleshooting

Import Errors

If you get ModuleNotFoundError: No module named 'kathairo':

pip install kathairo
# or
pip install kathairo --upgrade

Versification File Not Found

If you see errors about missing versification files:

# Use the built-in English versification
import os
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

config["targetVersificationPath"] = vrs_path

No Tokenizer Specified

Error: TypeError or no tokenizer found

Solution: Add exactly one tokenizer to your config:

config["latinWhiteSpaceIncludedTokenizer"] = True

Empty Output Files

If your TSV files are empty:

  • Check that your USFM/USX files are in the correct directory
  • Verify the files have proper USFM/USX formatting
  • Ensure the versification file matches your scripture structure

Monitoring Progress

By default, process_corpus() runs quietly with minimal output. To monitor progress:

print(f"Processing {config['projectName']}...")
result = process_corpus(config)
print("Done!")
print(f"Output files created:")
print(f"  - output/{config['language']}/{config['projectName']}/verse/verse_{config['projectName']}.tsv")
print(f"  - output/{config['language']}/{config['projectName']}/token/token_{config['projectName']}.tsv")

For large Bibles (66 books), processing typically takes 5-30 seconds depending on system performance.

Performance Issues

For large projects:

  • Use the JSON config approach with multiple projects to leverage parallel processing
  • Each project is processed in a separate process for better performance
from kathairo.tsvs.build_tsv_json_parser import main
main("projects.json")  # Processes all projects in parallel

API Reference

Main Functions

process_corpus(config: dict) -> dict

Process a single scripture project and generate TSV files.

Parameters:

  • config (dict): Configuration dictionary with project settings

Returns:

  • Dictionary with processing results including:
    • Success status
    • Output file paths
    • Processing statistics (if applicable)

Note: Currently returns minimal information; primarily useful for error checking

Example:

from kathairo.tsvs.build_tsv_json_parser import process_corpus

result = process_corpus({
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/path/to/usfm/",
    "targetVersificationPath": "/path/to/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True
})

main(json_path: str) -> None

Process multiple projects in parallel from a JSON configuration file.

Parameters:

  • json_path (str): Path to JSON configuration file

Example:

from kathairo.tsvs.build_tsv_json_parser import main

main("projects_config.json")

save_versification(versification: Versification) -> None

Save a versification object to a .vrs file.

Parameters:

  • versification (Versification): SIL Machine Versification object

Output:

  • Creates src/kathairo/versification/output.vrs

Example:

from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification

vrs = Versification.load("input.vrs")
save_versification(vrs)

Lower-Level Functions

For advanced usage, you can access lower-level functions:

corpus_to_tsv(targetVersification, sourceVersification, corpus, tokenizer, ...)

Convert a corpus directly to TSV files.

tokens_to_tsv(targetVersification, sourceVersification, tsvPath, ...)

Re-versify an existing token TSV file.

Technical Details

Enhanced SIL Machine Features

kathairo extends sil-machine with:

  • Psalm Superscription Support: Modified USFM parser handles psalm superscription tags
  • Improved Tokenizers: Enhanced Latin tokenizer with whitespace preservation
  • Versification Mapping: Automatic verse mapping between source and target versifications

Text Processing Pipeline

  • Parsing: USFM/USX files parsed into structured corpus
  • Tokenization: Text tokenized using selected tokenizer
  • Versification: Verses mapped between source and target versifications
  • Filtering: Optional exclusion of bracketed text, cross-references, stop words
  • Output: Structured TSV files generated for verse and token levels

Project Structure

kathairo/
├── parsing/           # USFM and USX parsers
│   ├── usfm/         # USFM parsing with modified handlers
│   └── usx/          # USX parsing utilities
├── tokenization/      # Tokenizer implementations
├── tsvs/             # TSV building and processing
├── versification/    # Versification utilities
└── helpers/          # Utility functions

Contributing

Contributions are welcome! This project processes Scripture text and aims to provide reliable, accurate text processing for Bible translation and analysis workflows.

Author

Robertson Brinker - robertson.brinker@biblica.com

Acknowledgments

Built on SIL's machine.py - a machine learning and natural language processing library for Scripture.

Note to self: To update python package

  • poetry config repositories.pypi https://upload.pypi.org/legacy/
  • $env:PYPI_USERNAME="__token__"
  • $env:PYPI_PASSWORD="<api-token>"
  • poetry publish --build --username $env:PYPI_USERNAME --password $env:PYPI_PASSWORD

Keywords

bible

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts