kathairo

Package Overview

Dependencies

Maintainers

Versions

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

kathairo

Scripture processing pipeline for converting USFM and USX formats into structured TSV files

PyPI

Version: 0.1.0

Maintainers: 1

kathairo

Scripture Processing Pipeline: Parse, Tokenize, and Versify

kathairo is a comprehensive text processing pipeline for Scripture that converts USFM and USX formats into structured TSV files. Built on SIL's machine.py, kathairo provides both verse-level and token-level outputs with enhanced support for psalm superscriptions and advanced tokenization.

Installation

pip install kathairo

Requirements

Python 3.10 or 3.11
Dependencies (automatically installed):
- sil-machine==1.0.2
- spacy>=3.7.5
- polars>=1.4.0

Note: Kathairo works cross-platform (Windows, macOS, Linux). Use os.path.join() for portable paths, though forward slashes generally work on all platforms.

Quick Start

Minimal Example

The simplest way to use kathairo:

from kathairo.tsvs.build_tsv_json_parser import process_corpus

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/path/to/usfm/",
    "targetVersificationPath": "/path/to/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)
# Output: output/eng/MyBible/verse/verse_MyBible.tsv
#         output/eng/MyBible/token/token_MyBible.tsv

Complete Workflow Example

Here's a realistic workflow from installation to processing:

# Step 1: Install kathairo
# pip install kathairo

# Step 2: Import and setup
import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus
import kathairo

# Step 3: Locate built-in versification file
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

# Step 4: Configure your project
config = {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/home/user/bibles/ESV/usfm/",
    "targetVersificationPath": vrs_path,
    "latinWhiteSpaceIncludedTokenizer": True,
    "treatApostropheAsSingleQuote": False,
    "excludeBracketedText": False,
    "excludeCrossReferences": True,
    "psalmSuperscriptionTag": "d"
}

# Step 5: Process the corpus
print(f"Processing {config['projectName']}...")
result = process_corpus(config)
print("Done!")

# Output files created at:
# - output/eng/ESV/verse/verse_ESV.tsv
# - output/eng/ESV/token/token_ESV.tsv

Processing Multiple Projects in Parallel

Create a JSON configuration file (e.g., projects_config.json):

[
  {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/scripture/ESV/usfm/",
    "targetVersificationPath": "/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": true,
    "treatApostropheAsSingleQuote": false,
    "excludeBracketedText": false,
    "excludeCrossReferences": true,
    "stopWordsPath": "/config/english_stopwords.tsv",
    "zwRemovalPath": "/config/zw_removal.tsv",
    "psalmSuperscriptionTag": "d"
  },
  {
    "projectName": "RVR1960",
    "language": "spa",
    "targetUsfmCorpusPath": "/scripture/RVR1960/usfm/",
    "targetVersificationPath": "/versification/spa.vrs",
    "latinWhiteSpaceIncludedTokenizer": true,
    "excludeCrossReferences": true
  }
]

Then process all projects:

from kathairo.tsvs.build_tsv_json_parser import main

main("projects_config.json")

This processes all projects in parallel using Python's ProcessPoolExecutor.

Configuration Options

Option	Type	Description
`projectName`	string	Name of your project (used in output file naming)
`language`	string	Language code (e.g., "eng", "spa")
`targetUsfmCorpusPath`	string	Path to USFM files directory
`targetUsxCorpusPath`	string	Path to USX files directory (alternative to USFM)
`targetVersificationPath`	string	Path to versification file (.vrs)
`latinTokenizer`	boolean	Use standard Latin word tokenizer
`latinWhiteSpaceIncludedTokenizer`	boolean	Use Latin tokenizer with whitespace preservation
`chineseTokenizer`	boolean	Use Chinese Bible word tokenizer
`treatApostropheAsSingleQuote`	boolean	Handle apostrophes as single quotes
`excludeBracketedText`	boolean	Exclude text within square brackets
`excludeCrossReferences`	boolean	Exclude cross-reference text
`stopWordsPath`	string	Path to TSV file containing stop words
`zwRemovalPath`	string	Path to TSV file for zero-width character removal
`regexRulesPath`	string	Path to custom regex rules module
`psalmSuperscriptionTag`	string	USFM tag for psalm superscriptions (default: "d")
`tsvPath`	string	Path to existing token TSV (for re-versification)
`solo`	boolean	If present, only this project will be processed

Required Configuration Fields

At minimum, you must provide:

projectName: Your project identifier
language: Language code (e.g., "eng", "spa")
ONE input source: targetUsfmCorpusPath OR targetUsxCorpusPath OR tsvPath
targetVersificationPath: Path to .vrs file
ONE tokenizer: latinTokenizer, latinWhiteSpaceIncludedTokenizer, or chineseTokenizer

All other fields are optional and have sensible defaults.

Choosing a Tokenizer

You must specify exactly one tokenizer in your configuration:

latinTokenizer - Standard Latin word tokenizer from SIL Machine
- Use for: Most Latin-script languages (English, Spanish, French, etc.)
- Behavior: Standard word boundary detection
latinWhiteSpaceIncludedTokenizer - Enhanced Latin tokenizer with whitespace preservation
- Use for: Latin-script languages where you need precise whitespace tracking
- Behavior: Preserves whitespace information for accurate reconstruction
- Recommended for most projects
chineseTokenizer - Chinese Bible word tokenizer
- Use for: Chinese scripture text
- Behavior: Specialized Chinese word segmentation

Example:

# Pick ONE of these:
"latinTokenizer": True                      # Standard
"latinWhiteSpaceIncludedTokenizer": True    # Recommended
"chineseTokenizer": True                    # Chinese only

Output Format

Verse-Level TSV

Generated at: output/{language}/{projectName}/verse/verse_{projectName}.tsv

Column	Description
`id`	Verse identifier (BBCCCVVV format)
`source_verse`	Source versification verse ID
`text`	Complete verse text
`id_range_end`	End verse for verse ranges
`source_verse_range_end`	End verse in source versification

Token-Level TSV

Generated at: output/{language}/{projectName}/token/token_{projectName}.tsv

Column	Description
`id`	Token identifier (BBCCCVVVWWW format)
`source_verse`	Source versification verse ID
`text`	Token text
`skip_space_after`	"y" if no space should follow this token, empty otherwise
`exclude`	"y" if token should be excluded from analysis (footnotes, cross-references based on `excludeCrossReferences`), empty otherwise
`id_range_end`	End verse for verse ranges
`source_verse_range_end`	End verse in source versification
`required`	"y" if token contains non-punctuation, "n" otherwise

Sample Output

Here's what the actual output looks like for Genesis 1:1-2:

Verse-Level Sample:

id          source_verse  text
01001001    01001001      In the beginning God created the heavens and the earth.
01001002    01001002      Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters.

Token-Level Sample:

id           source_verse  text       skip_space_after  exclude  required
01001001001  01001001      In                                    y
01001001002  01001001      the                                   y
01001001003  01001001      beginning                             y
01001001004  01001001      God                                   y
01001001005  01001001      created                               y
01001001006  01001001      the                                   y
01001001007  01001001      heavens                               y
01001001008  01001001      and                                   y
01001001009  01001001      the                                   y
01001001010  01001001      earth      y                          y
01001001011  01001001      .                            y        n
01001002001  01001002      Now                                   y
01001002002  01001002      the                                   y
01001002003  01001002      earth                                 y
...

Notice:

Verse IDs use BBCCCVVV format (01001001 = Genesis 1:1)
Token IDs add WWW for word position (01001001010 = Genesis 1:1, word 10)
skip_space_after is "y" for "earth" before the period
required is "n" for punctuation like "."

Usage Examples

Example 1: Processing USFM Files

from kathairo.tsvs.build_tsv_json_parser import process_corpus

config = {
    "projectName": "ESV",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/scripture/ESV/usfm/",
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "excludeCrossReferences": True,
    "psalmSuperscriptionTag": "d"
}

process_corpus(config)

Example 2: Processing USX Files

config = {
    "projectName": "NIV",
    "language": "eng",
    "targetUsxCorpusPath": "/data/scripture/NIV/usx/",  # Use USX instead of USFM
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinTokenizer": True  # Standard Latin tokenizer
}

process_corpus(config)

Example 3: Chinese Scripture with Custom Tokenizer

config = {
    "projectName": "CUV",
    "language": "zho",
    "targetUsfmCorpusPath": "/data/scripture/CUV/usfm/",
    "targetVersificationPath": "/data/versification/chinese.vrs",
    "chineseTokenizer": True,  # Use Chinese-specific tokenizer
    "excludeBracketedText": True
}

process_corpus(config)

Example 4: Re-versification from Existing Token TSV

If you already have a token-level TSV and want to re-versify it:

config = {
    "projectName": "ESV_Reversified",
    "language": "eng",
    "tsvPath": "/data/output/eng/ESV/token/token_ESV.tsv",  # Use existing TSV
    "targetVersificationPath": "/data/versification/new_versification.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "stopWordsPath": "/data/config/stopwords.tsv",
    "zwRemovalPath": "/data/config/zw_removal.tsv"
}

process_corpus(config)

Example 5: With Stop Words and Zero-Width Character Removal

Create stop words file (stopwords.tsv):

stop_words
\u200b
\u200c
\u200d

Create zero-width removal file (zw_removal.tsv):

words
word\u200bwith\u200bzwsp
another\u200cword

Then use them:

config = {
    "projectName": "MyBible",
    "language": "hin",
    "targetUsfmCorpusPath": "/data/hindi/usfm/",
    "targetVersificationPath": "/data/versification/hindi.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "stopWordsPath": "/data/config/stopwords.tsv",
    "zwRemovalPath": "/data/config/zw_removal.tsv"
}

process_corpus(config)

Advanced Usage

Custom Regex Rules

Create a custom Python module (custom_regex.py):

import re

class CustomRegexRules:
    # Define custom punctuation patterns for word-level punctuation
    WORD_LEVEL_PUNCT_REGEX = re.compile(
        r'(?<=[a-zA-Z])[\''](?=[a-zA-Z])|'  # Apostrophes within words
        r'(?<=[0-9])[,.](?=[0-9])'          # Commas/periods in numbers
    )

Reference it in your config:

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/usfm/",
    "targetVersificationPath": "/data/versification/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True,
    "regexRulesPath": "/path/to/custom_regex.py"  # Use custom rules
}

process_corpus(config)

Solo Mode for Testing

When working with multiple projects in a JSON config, add "solo": true to process only one:

[
  {
    "projectName": "TestProject",
    "language": "eng",
    "targetUsfmCorpusPath": "/test/usfm/",
    "targetVersificationPath": "/test/eng.vrs",
    "latinTokenizer": true,
    "solo": true
  },
  {
    "projectName": "OtherProject",
    "language": "spa",
    ...
  }
]

Only TestProject will be processed.

Required Files and Structure

Input Files

Scripture Files - Either USFM or USX format:
- USFM: Plain text files with USFM markup
  - Extensions: .SFM
  - Naming: Flexible (e.g., 01-GEN.usfm, 01GENBSB.SFM, GEN.usfm)
  - Directory should contain all books you want to process
- USX: XML files following USX standard
  - Directory should contain all books you want to process
Versification File (.vrs) - Defines verse structure:
```
# Versification  "English"
GEN 1:31 2:25 3:24 4:26 5:32 ...
EXO 1:22 2:25 3:22 ...
```
- kathairo includes eng.vrs at src/kathairo/versification/eng.vrs
- Format: BOOK chapter:lastVerse chapter:lastVerse ...
Optional: Stop Words TSV:
```
stop_words
\u200b
\u200c
\u200d
```

Optional: Zero-Width Removal TSV:

words
problematic\u200bword
another\u200cexample

Output Structure

kathairo automatically creates this directory structure in your current working directory:

output/
└── {language}/
    └── {projectName}/
        ├── verse/
        │   └── verse_{projectName}.tsv
        └── token/
            └── token_{projectName}.tsv

Note: The output/ directory is created relative to where you run the script, not relative to the input files.

Versification Support

kathairo handles complex versification mappings between different Bible versification systems.

Using Built-in English Versification

import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus

# Get the package's built-in versification file
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

config = {
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/data/usfm/",
    "targetVersificationPath": vrs_path,  # Use built-in versification
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)

Using a Local Versification File

If your versification file is in the same directory as your USFM files:

import os
from kathairo.tsvs.build_tsv_json_parser import process_corpus

# Get the current script directory
current_dir = os.path.dirname(os.path.abspath(__file__))

config = {
    "projectName": "BSB",
    "language": "eng",
    "targetUsfmCorpusPath": os.path.join(current_dir, "bsb_usfm"),
    "targetVersificationPath": os.path.join(current_dir, "bsb_usfm", "versification.vrs"),
    "latinWhiteSpaceIncludedTokenizer": True,
    "excludeCrossReferences": True
}

process_corpus(config)

Creating Custom Versification

from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification, VerseRef

# Load and modify a versification
versification = Versification.load("path/to/base_versification.vrs")

# Make modifications (example: adjust Genesis 1:1 mapping)
versification.mappings.add_mapping(
    VerseRef("GEN", 1, 1, versification),
    VerseRef("GEN", 1, 2, versification)
)

# Save the modified versification
save_versification(versification)
# Creates: src/kathairo/versification/output.vrs

Re-versifying Existing TSV Output

If you need to apply a different versification to already-processed scripture:

config = {
    "projectName": "ESV_NewVersification",
    "language": "eng",
    "tsvPath": "/output/eng/ESV/token/token_ESV.tsv",  # Existing token TSV
    "targetVersificationPath": "/path/to/new_versification.vrs",  # New versification
    "latinWhiteSpaceIncludedTokenizer": True
}

process_corpus(config)

Common Issues and Troubleshooting

Import Errors

If you get ModuleNotFoundError: No module named 'kathairo':

pip install kathairo
# or
pip install kathairo --upgrade

Versification File Not Found

If you see errors about missing versification files:

# Use the built-in English versification
import os
import kathairo
package_dir = os.path.dirname(kathairo.__file__)
vrs_path = os.path.join(package_dir, "versification", "eng.vrs")

config["targetVersificationPath"] = vrs_path

No Tokenizer Specified

Error: TypeError or no tokenizer found

Solution: Add exactly one tokenizer to your config:

config["latinWhiteSpaceIncludedTokenizer"] = True

Empty Output Files

If your TSV files are empty:

Check that your USFM/USX files are in the correct directory
Verify the files have proper USFM/USX formatting
Ensure the versification file matches your scripture structure

Monitoring Progress

By default, process_corpus() runs quietly with minimal output. To monitor progress:

print(f"Processing {config['projectName']}...")
result = process_corpus(config)
print("Done!")
print(f"Output files created:")
print(f"  - output/{config['language']}/{config['projectName']}/verse/verse_{config['projectName']}.tsv")
print(f"  - output/{config['language']}/{config['projectName']}/token/token_{config['projectName']}.tsv")

For large Bibles (66 books), processing typically takes 5-30 seconds depending on system performance.

Performance Issues

For large projects:

Use the JSON config approach with multiple projects to leverage parallel processing
Each project is processed in a separate process for better performance

from kathairo.tsvs.build_tsv_json_parser import main
main("projects.json")  # Processes all projects in parallel

API Reference

Main Functions

`process_corpus(config: dict) -> dict`

Process a single scripture project and generate TSV files.

Parameters:

config (dict): Configuration dictionary with project settings

Returns:

Dictionary with processing results including:
- Success status
- Output file paths
- Processing statistics (if applicable)

Note: Currently returns minimal information; primarily useful for error checking

Example:

from kathairo.tsvs.build_tsv_json_parser import process_corpus

result = process_corpus({
    "projectName": "MyBible",
    "language": "eng",
    "targetUsfmCorpusPath": "/path/to/usfm/",
    "targetVersificationPath": "/path/to/eng.vrs",
    "latinWhiteSpaceIncludedTokenizer": True
})

`main(json_path: str) -> None`

Process multiple projects in parallel from a JSON configuration file.

Parameters:

json_path (str): Path to JSON configuration file

Example:

from kathairo.tsvs.build_tsv_json_parser import main

main("projects_config.json")

`save_versification(versification: Versification) -> None`

Save a versification object to a .vrs file.

Parameters:

versification (Versification): SIL Machine Versification object

Output:

Creates src/kathairo/versification/output.vrs

Example:

from kathairo.versification.fix_versification import save_versification
from machine.scripture import Versification

vrs = Versification.load("input.vrs")
save_versification(vrs)

Lower-Level Functions

For advanced usage, you can access lower-level functions:

`corpus_to_tsv(targetVersification, sourceVersification, corpus, tokenizer, ...)`

Convert a corpus directly to TSV files.

`tokens_to_tsv(targetVersification, sourceVersification, tsvPath, ...)`

Re-versify an existing token TSV file.

Technical Details

Enhanced SIL Machine Features

kathairo extends sil-machine with:

Psalm Superscription Support: Modified USFM parser handles psalm superscription tags
Improved Tokenizers: Enhanced Latin tokenizer with whitespace preservation
Versification Mapping: Automatic verse mapping between source and target versifications

Text Processing Pipeline

Parsing: USFM/USX files parsed into structured corpus
Tokenization: Text tokenized using selected tokenizer
Versification: Verses mapped between source and target versifications
Filtering: Optional exclusion of bracketed text, cross-references, stop words
Output: Structured TSV files generated for verse and token levels

Project Structure

kathairo/
├── parsing/           # USFM and USX parsers
│   ├── usfm/         # USFM parsing with modified handlers
│   └── usx/          # USX parsing utilities
├── tokenization/      # Tokenizer implementations
├── tsvs/             # TSV building and processing
├── versification/    # Versification utilities
└── helpers/          # Utility functions

Contributing

Contributions are welcome! This project processes Scripture text and aims to provide reliable, accurate text processing for Bible translation and analysis workflows.

Author

Robertson Brinker - robertson.brinker@biblica.com

Acknowledgments

Built on SIL's machine.py - a machine learning and natural language processing library for Scripture.

Note to self: To update python package

poetry config repositories.pypi https://upload.pypi.org/legacy/
$env:PYPI_USERNAME="__token__"
$env:PYPI_PASSWORD="<api-token>"
poetry publish --build --username $env:PYPI_USERNAME --password $env:PYPI_PASSWORD

Keywords

FAQs

What is kathairo?

Is kathairo well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

kathairo

kathairo

Scripture Processing Pipeline: Parse, Tokenize, and Versify

Installation

Requirements

Quick Start

Minimal Example

Complete Workflow Example

Processing Multiple Projects in Parallel

Configuration Options

Required Configuration Fields

Choosing a Tokenizer

Output Format

Verse-Level TSV

Token-Level TSV

Sample Output

Usage Examples

Example 1: Processing USFM Files

Example 2: Processing USX Files

Example 3: Chinese Scripture with Custom Tokenizer

Example 4: Re-versification from Existing Token TSV

Example 5: With Stop Words and Zero-Width Character Removal

Advanced Usage

Custom Regex Rules

Solo Mode for Testing

Required Files and Structure

Input Files

Output Structure

Versification Support

Using Built-in English Versification

Using a Local Versification File

Creating Custom Versification

Re-versifying Existing TSV Output

Common Issues and Troubleshooting

Import Errors

Versification File Not Found

No Tokenizer Specified

Empty Output Files

Monitoring Progress

Performance Issues

API Reference

Main Functions

process_corpus(config: dict) -> dict

main(json_path: str) -> None

save_versification(versification: Versification) -> None

Lower-Level Functions

corpus_to_tsv(targetVersification, sourceVersification, corpus, tokenizer, ...)

tokens_to_tsv(targetVersification, sourceVersification, tsvPath, ...)

Technical Details

Enhanced SIL Machine Features

Text Processing Pipeline

Project Structure

Contributing

Author

Acknowledgments

Note to self: To update python package

Keywords

Related posts

Insecure Agents Podcast: Certified Patches, Supply Chain Security, and AI Agents

Tailwind CSS Announces 75% Layoffs as LLMs Reshape OSS Business Models

`process_corpus(config: dict) -> dict`

`main(json_path: str) -> None`

`save_versification(versification: Versification) -> None`

`corpus_to_tsv(targetVersification, sourceVersification, corpus, tokenizer, ...)`

`tokens_to_tsv(targetVersification, sourceVersification, tsvPath, ...)`