Piraye: Advanced NLP Utilities for Persian, Arabic, and English
Piraye is a Python library providing flexible text normalization and tokenization utilities for Persian, Arabic,
and English NLP tasks. With comprehensive type hints, extensive documentation, and a clean architecture, Piraye is
production-ready for modern NLP pipelines.
📑 Table of Contents
🚀 Key Features
| Multi-Language Normalization | Normalize alphabets, digits, punctuation, and whitespace for Persian, Arabic, and English. |
| Advanced Tokenization | Regex-based, NLTK-based, Spacy-based, and custom tokenizers with hierarchical support. |
| Tokenizer Pipeline | Chain multiple tokenizers for sophisticated text processing workflows. |
| Position Tracking | Map positions between original and normalized text. |
| Multi-Lingual Detection | Automatic language detection and appropriate normalization. |
| Type Safe | Complete type hints for modern Python development. |
| Well Documented | Comprehensive documentation and usage examples. |
| Production Ready | Clean architecture, extensive testing, and easy integration. |
📦 Installation
Basic Installation
pip install piraye
Full Installation (with Spacy support)
pip install piraye[full]
Requirements: Python 3.11+
🧠 Quick Start: Text Normalization
Normalize Persian text by correcting and standardizing letters, digits, and punctuation, performing tokenization,
and removing extra spaces to produce clean, consistent text ready for NLP processing.
Basic Normalization (Builder Pattern)
from piraye import NormalizerBuilder
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = (NormalizerBuilder()
.alphabet_fa()
.digit_fa()
.punctuation_fa()
.tokenizing()
.remove_extra_spaces()
.build())
normalized_text, result = normalizer.normalize(text)
print(normalized_text)
print(result.shifts)
print(result.punc_positions)
Using Config Constructor
from piraye import NormalizerBuilder
from piraye.tasks.normalizer.normalizer_builder import Config
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = NormalizerBuilder(
configs=[Config.PUNCTUATION_FA, Config.ALPHABET_FA, Config.DIGIT_FA],
remove_extra_spaces=True,
tokenization=True
).build()
normalized_text, result = normalizer.normalize(text)
print(normalized_text)
📖 For more examples and usage patterns, see Normalizer Examples.
📊 Normalizer Output
The normalize() method returns a tuple containing the normalized text and a NormalizationResult object with
metadata.
Return Value Structure
normalized_text, result = normalizer.normalize(text)
NormalizationResult Properties
shifts | list[tuple[int, int]] | Position shifts tracking character position changes during normalization. Format: (position, shift) |
punc_positions | list[int] | List of punctuation character positions in the normalized text |
Example
from piraye import NormalizerBuilder
normalizer = (NormalizerBuilder()
.alphabet_fa()
.punctuation_fa()
.digit_fa()
.remove_extra_spaces()
.build())
text = "سلام، این ۱۲۳ است."
normalized_text, result = normalizer.normalize(text)
print(normalized_text)
print(result.shifts)
print(result.punc_positions)
for pos in result.punc_positions:
char = normalized_text[pos]
print(f"Punctuation at position {pos}: '{char}'")
🔢 Position Mapping After Normalization
When normalizing text, characters may be added, removed, or replaced. Piraye tracks these changes and provides utilities
to map positions between normalized and original text.
Methods
calc_original_position(shifts, position) | Returns the original position for a single index in normalized text. |
calc_original_positions(shifts, positions) | Returns original positions for multiple indices (must be sorted). |
Example
from piraye import NormalizerBuilder
normalizer = (NormalizerBuilder()
.space_normal()
.remove_extra_spaces()
.alphabet_en()
.punctuation_en()
.build())
text = "Hello , World !"
normalized_text, result = normalizer.normalize(text)
shifts = result.shifts
print(f"Shifts: {shifts}")
original_pos = normalizer.calc_original_position(shifts, 7)
print(f"Position 7 in normalized text was at position {original_pos} in original")
positions = [3, 7, 12]
original_positions = normalizer.calc_original_positions(shifts, positions)
print(f"Positions {positions} map to {original_positions} in original text")
Working with Punctuation Positions
from piraye import NormalizerBuilder
normalizer = (NormalizerBuilder()
.alphabet_fa()
.punctuation_fa()
.build())
text = "سلام، این یک متن است."
normalized_text, result = normalizer.normalize(text)
print(f"Punctuation found at positions: {result.punc_positions}")
punc_chars = [normalized_text[pos] for pos in result.punc_positions]
print(f"Punctuation characters: {punc_chars}")
💡 Tip: Use position mapping to align annotations, highlight text, or track character positions through
normalization.
⚙️ Configurations
Piraye provides various configurations for text normalization:
| ALPHABET_AR | alphabet_ar | Maps alphabet characters to Arabic |
| ALPHABET_EN | alphabet_en | Maps alphabet characters to English |
| ALPHABET_FA | alphabet_fa | Maps alphabet characters to Persian |
| DIGIT_AR | digit_ar | Converts digits to Arabic digits |
| DIGIT_EN | digit_en | Converts digits to English digits |
| DIGIT_FA | digit_fa | Converts digits to Persian digits |
| DIACRITIC_DELETE | diacritic_delete | Removes all diacritics |
| SPACE_DELETE | space_delete | Removes all spaces |
| SPACE_NORMAL | space_normal | Normalizes spaces (e.g., NO-BREAK SPACE, Tab, etc.) |
| SPACE_KEEP | space_keep | Maps spaces and keeps them as-is |
| PUNCTUATION_AR | punctuation_ar | Maps punctuations to Arabic punctuations |
| PUNCTUATION_FA | punctuation_fa | Maps punctuations to Persian punctuations |
| PUNCTUATION_EN | punctuation_en | Maps punctuations to English punctuations |
Other attributes:
remove_extra_spaces: Collapses multiple consecutive spaces into a single space.
tokenization: Converts punctuation characters into separate tokens.
✂️ Tokenization Framework
All tokenizers inherit from the Tokenizer abstract base class and produce Token objects with rich metadata.
Token Structure
content | str | The text content of the token. |
type | str | The type or name of the tokenizer that created it. |
position | tuple[int, int] | Start and end indices of the token in the original text. |
sub_tokens | List[Token] | A list of child tokens (for hierarchical tokenization). |
Base Methods
tokenize(text: str) -> List[Token] – Main tokenization method
merge(text: str, previous_tokens: List[Token]) -> List[Token] – Merge tokens hierarchically
🔤 Built-in Tokenizers
NLTK-based Tokenizers
NltkWordTokenizer – Word-level tokenization using NLTK
NltkSentenceTokenizer – Sentence-level tokenization using Punkt algorithm
Spacy-based Tokenizers
SpacyWordTokenizer – Word-level tokenization using Spacy
SpacySentenceTokenizer – Sentence-level tokenization using Spacy
Regex-based Tokenizers
RegexTokenizer – Generic regex pattern tokenizer
URLTokenizer – Extract URLs from text
EmailTokenizer – Extract email addresses from text
HTMLTokenizer – Extract HTML tags from text
Structural Tokenizers
ParagraphTokenizer – Split text into paragraphs
🔄 TokenizerPipeline: Hierarchical Tokenization
The TokenizerPipeline class provides a modular and sequential approach to text tokenization. It allows you to chain
multiple tokenizers together, where the output of one tokenizer can be merged or refined by the next. This design makes
it easy to combine tokenizers (e.g., sentences, words, emojis, URLs) into a unified pipeline for flexible and powerful
text preprocessing.
How It Works
The pipeline starts with the first tokenizer, which processes the raw text. Each subsequent tokenizer is applied
sequentially, refining or extending the previous tokens. The final result is a merged list of Token objects representing
a fully tokenized text.
Example Usage
from piraye.tasks.tokenizer import NltkSentenceTokenizer
from piraye.tasks.tokenizer import URLTokenizer
from piraye.tasks.tokenizer.pipeline import TokenizerPipeline
pipeline = TokenizerPipeline([
NltkSentenceTokenizer(),
URLTokenizer()
])
text = "Contact us at support@arusha.dev or info@piraye.ai."
tokens = pipeline(text)
print([t.content for t in tokens])
Paragraph Tokenizer Example
from piraye.tasks.tokenizer import ParagraphTokenizer
text = "First paragraph.\nSecond paragraph.\nThird paragraph."
tokenizer = ParagraphTokenizer()
tokens = tokenizer.tokenize(text)
for token in tokens:
print(token)
📖 For more examples and usage patterns, see Tokenizing Examples.
📁 Project Structure
piraye/
├── piraye/
│ ├── __init__.py
│ ├── constants.py
│ └── tasks/
│ ├── normalizer/
│ │ ├── __init__.py
│ │ ├── char_config.py
│ │ ├── character_normalizer.py
│ │ ├── mappings.py
│ │ ├── multi_lingual_normalizer.py
│ │ ├── multi_lingual_normalizer_builder.py
│ │ ├── normalizer.py
│ │ ├── normalizer_builder.py
│ │ └── data/
│ │ ├── alphabets/
│ │ ├── digits/
│ │ ├── others/
│ │ └── puncs/
│ └── tokenizer/
│ ├── __init__.py
│ ├── pipeline.py
│ ├── token.py
│ └── tokenizers/
│ ├── __init__.py
│ ├── base_tokenizer.py
│ ├── nltk_tokenizer.py
│ ├── spacy_tokenizer.py
│ ├── regex_tokenizer.py
│ ├── paragraph_tokenizer.py
│ └── regex_tokenizers/
│ ├── __init__.py
│ ├── base_regex_tokenizer.py
│ ├── url_tokenizer.py
│ ├── email_tokenizer.py
│ ├── html_tokenizer.py
│ └── README.md
├── tests/
│ ├── test_normalizer.py
│ ├── test_ml_normalizer.py
│ ├── test_tokenizer.py
│ ├── test_tokenizer_pipeline.py
│ ├── test_html_tokenizer.py
│ └── ...
├── README.md
├── LICENSE
└── pyproject.toml
📄 License
GNU Lesser General Public License v2.1
See LICENSE
❤️ Maintainers
Piraye is maintained by Arusha.
Authors:
- Hamed Khademi Khaledi
- HosseiN Khademi Khaledi
- Majid Asgari Bidhendi
For questions or support, please open an issue on GitHub or contact us at info@arusha.dev.
🌟 Show Your Support
If you find Piraye useful, please consider:
- ⭐ Starring the repository on GitHub
- 📢 Sharing it with others who might benefit
- 🐛 Reporting bugs or suggesting features
- 🤝 Contributing to the codebase
Thank you for using Piraye! 🎉