CJK Text Formatter
A Python CLI tool for polishing text with CJK (Chinese, Japanese, Korean) typography rules. Automatically formats mixed CJK-English text, fixes em-dash spacing, normalizes ellipsis, and more.
Features
Universal Rules (All Languages)
- Ellipsis normalization: Converts
. . . or . . . . to ... with proper spacing
Chinese-Specific Rules
Quote & Punctuation:
- Double quote spacing (
“”): Smart spacing that excludes CJK punctuation with built-in visual spacing
- Single quote spacing (
‘’): Same smart rules as double quotes
- Em-dash conversion: Converts
-- (or more dashes) to —— between CJK characters only
- Em-dash spacing: Fixes spacing around existing
—— with context-aware rules
- Full-width punctuation: Normalizes
,→, .→。 in CJK context
- Full-width parentheses/brackets:
(text) → (text) in CJK context
Normalization:
- Full-width alphanumeric: Converts
123ABC → 123ABC
- Consecutive punctuation cleanup:
!!! → ! (configurable limit)
Spacing:
- CJK-English spacing: Auto-adds spaces between Chinese & English/numbers
- Currency spacing:
$ 100 → $100
- Slash spacing:
A / B → A/B
- Multiple space collapsing: Reduces consecutive spaces (preserves indentation)
- Trailing space removal: Cleans up line endings
- Excessive newline collapsing: Limits to max one blank line
File Type Support
- Plain Text (
.txt): Direct formatting
- Markdown (
.md): Preserves code blocks (fenced, indented, inline)
- HTML (
.html, .htm): Formats text content while preserving tags and <code>/<pre> elements
CJK Language Support
Fully Supported Languages:
- Chinese: All Han characters (汉字) and Chinese punctuation
- Japanese: Kanji (漢字), Hiragana (ひらがな), Katakana (カタカナ)
- Korean: Hangul (한글) with Hanja (漢字)
Language-Specific Behavior:
- Chinese & Japanese: Fullwidth punctuation normalization (,。!?)
- Korean: Preserves Western/halfwidth punctuation (. , ! ?)
- All CJK: CJK-English spacing, quote spacing, alphanumeric normalization
Note: CJK-specific typography rules apply when Han characters (Chinese/Kanji/Hanja) are present in the text. This covers the vast majority of real-world CJK text usage.
Installation
Requirements
Install from PyPI
pip install cjk-text-formatter
pip install cjk-text-formatter[html]
Install from Source
git clone https://github.com/xiaolai/cjk-text-formatter.git
cd cjk-text-formatter
pip install -e .
pip install -e ".[html]"
Verify Installation
ctf --version
ctf --help
ctf "文本English混合"
Usage
Command Line
ctf "文本English混合"
ctf "《书名》--作者"
echo "文本English混合" | ctf
ctf input.txt
ctf input.md --output formatted.md
ctf document.txt --inplace
ctf document.txt --dry-run
ctf ./docs/ --inplace
ctf ./docs/ --recursive --inplace
ctf ./docs/ --inplace -e .md -e .txt
Python API
from cjk_text_formatter.polish import polish_text
text = "文本English混合,数字123也包含。"
result = polish_text(text)
print(result)
text = "《Python编程》--一本好书"
result = polish_text(text)
print(result)
from cjk_text_formatter.processors import process_file, find_files
result = process_file(Path("document.md"))
files = find_files(Path("./docs"), recursive=True, extensions=['.md', '.txt'])
for file in files:
result = process_file(file)
Configuration
Requires Python 3.11+ (uses built-in tomllib). On Python <3.11, all rules are enabled by default.
Config File Locations
Configuration is loaded with the following priority (highest to lowest):
- Custom path:
ctf --config /path/to/config.toml
- Project root:
./cjk-text-formatter.toml
- User config:
~/.config/cjk-text-formatter.toml
- Defaults: All rules enabled
Quick Start
ctf --init-config
ctf --init-config --global
cp cjk-text-formatter.toml.example cjk-text-formatter.toml
ctf --show-config-example
ctf --list-rules
ctf --where
Configuration Format
[rules]
ellipsis_normalization = true
dash_conversion = true
emdash_spacing = true
quote_spacing = true
cjk_english_spacing = true
space_collapsing = true
[[custom_rules]]
name = "arrow_unicode"
pattern = '->'
replacement = '→'
description = "Use Unicode arrows"
[[custom_rules]]
name = "multiply_sign"
pattern = '(\d+)\s*x\s*(\d+)'
replacement = '\1×\2'
description = "Use proper multiplication sign"
Built-in Rules
| Universal Rules | | |
ellipsis_normalization | ✅ | Convert . . . to ... |
| Quote & Em-dash | | |
quote_spacing | ✅ | Add spaces around “” with smart CJK handling |
single_quote_spacing | ✅ | Add spaces around ‘’ with smart CJK handling |
dash_conversion | ✅ | Convert -- (2+ dashes) to —— between CJK text only |
emdash_spacing | ✅ | Fix spacing around —— |
| Normalization | | |
fullwidth_punctuation | ✅ | Normalize , . ! ? ; : width |
fullwidth_parentheses | ✅ | Normalize () → () in CJK |
fullwidth_brackets | ❌ | Normalize [] → 【】 in CJK (off by default) |
fullwidth_alphanumeric | ✅ | Convert 123ABC → 123ABC |
| Spacing & Cleanup | | |
cjk_english_spacing | ✅ | Space between Chinese & English/numbers |
currency_spacing | ✅ | Remove space: $ 100 → $100 |
slash_spacing | ✅ | Remove space: A / B → A/B |
space_collapsing | ✅ | Collapse multiple spaces (preserve indents) |
consecutive_punctuation_limit | 0 | Limit repeats: !!! → ! (0=off, 1=single, 2=double) |
Custom Rules
Add your own regex-based transformations:
[[custom_rules]]
name = "rule_name"
pattern = 'regex pattern'
replacement = 'replacement'
description = "What it does"
Examples:
[[custom_rules]]
name = "fraction_half"
pattern = '\b1/2\b'
replacement = '½'
[[custom_rules]]
name = "celsius"
pattern = '(\d+)\s*C\b'
replacement = '\1°C'
[[custom_rules]]
name = "double_quotes"
pattern = '"([^"]+)"'
replacement = '"\1"'
Config Management Commands
ctf --init-config
ctf --init-config --global
ctf --init-config --force
ctf --list-rules
ctf --show-config-example
ctf --show-config-example > my-config.toml
ctf --where
ctf --validate-config cjk-text-formatter.toml
ctf --show-config
ctf --show-config --config custom.toml
Usage with Config
ctf input.txt
ctf input.txt --config my-rules.toml
ctf input.txt --disable dash_conversion
ctf input.txt --disable quote_spacing --disable emdash_spacing
ctf input.txt --enable fullwidth_brackets
ctf input.txt --verbose
Validating Config Files
ctf --validate-config cjk-text-formatter.toml
What gets validated:
- ✅ File exists and is readable
- ✅ Valid TOML syntax
- ✅ Rule names match known built-in rules
- ✅ Custom rules have required fields (
name, pattern, replacement)
- ✅ Regex patterns compile successfully
Showing Effective Config
ctf --show-config
ctf --show-config --config my-rules.toml
Typography Rules
Em-Dash Conversion & Spacing
Dash Conversion (only between CJK characters):
中文--更多 | 中文 —— 更多 | ✅ Between CJK: converts and adds spaces |
中文---更多 | 中文 —— 更多 | ✅ Triple dash (2+): also converts |
中文 -- 更多 | 中文 —— 更多 | ✅ With spaces: also converts |
text--more | text--more | ❌ English only: NOT converted |
--- | --- | ❌ Markdown horizontal rule: NOT converted |
Em-Dash Spacing (existing —— characters):
文本——内容 | 文本 —— 内容 | Regular CJK text: spaces on both sides |
《书名》——作者 | 《书名》—— 作者 | After 》: no space before ——, space after |
作者——《书名》 | 作者 ——《书名》 | Before 《: space before ——, no space after |
(注释)——内容 | (注释)—— 内容 | After ): no space before ——, space after |
内容——(注释) | 内容 ——(注释) | Before (: space before ——, no space after |
CJK-English Spacing
中文English | 中文 English |
数字123 | 数字 123 |
100个item | 100 个 item |
Quote Spacing (Smart CJK Punctuation Handling)
The quote spacing rule intelligently avoids adding spaces when quotes are adjacent to CJK punctuation that already has visual spacing built-in:
文本"引用"文本 | 文本 "引用" 文本 | Regular text: add spaces for readability |
文本,"引用"。 | 文本,"引用"。 | Punctuation ,。: NO space (already has visual spacing) |
《书名》"引用"(注) | 《书名》"引用"(注) | Brackets 《》(): NO space (already has visual spacing) |
前文——"引用"——后文 | 前文 —— "引用" —— 后文 | Em-dash ——: ADD space (curly quotes need spacing) |
English"中文"123 | English "中文" 123 | Alphanumeric: add spaces for readability |
CJK punctuation excluded from quote spacing:
- Terminal punctuation: ,。!?;:、
- Book title marks & corner brackets: 《》「」『』
- Brackets: 【】()〈〉
Note: Em-dash (——) is NOT excluded - spaces are added between em-dash and quotes because curly quotes lack built-in visual spacing.
Ellipsis Normalization
. . . | ... |
wait . . . more | wait... more |
end . . . . | end... |
Development
Running Tests
pytest
pytest --cov=cjk_text_formatter
pytest tests/test_polish.py -v
Project Structure
cjk-text-formatter/
├── src/
│ └── cjk_text_formatter/
│ ├── __init__.py
│ ├── polish.py # Core polishing logic
│ ├── processors.py # File type processors
│ ├── config.py # Configuration management
│ └── cli.py # Command-line interface
├── tests/
│ ├── test_polish.py # Polish function tests
│ ├── test_processors.py # File processor tests
│ ├── test_config.py # Configuration tests
│ └── test_config_validation.py
├── pyproject.toml # Package configuration
└── README.md
Adding New Rules
To add a new typography rule:
-
Add tests in tests/test_polish.py:
def test_new_rule(self):
assert polish_text("input") == "expected_output"
-
Implement the rule in src/cjk_text_formatter/polish.py:
def _new_rule(text: str) -> str:
return text
-
Add to pipeline in polish_text():
def polish_text(text: str) -> str:
text = _normalize_ellipsis(text)
text = _new_rule(text)
return text.strip()
Options
Processing Options
--output PATH | -o | Output file path |
--inplace | -i | Modify files in place |
--recursive | -r | Process directories recursively |
--dry-run | -n | Preview changes without writing |
--extensions EXT | -e | File extensions to process (e.g., -e .txt -e .md) |
--verbose | -v | Show summary of changes made |
Configuration Options
--config PATH | -c | Use custom config file |
--disable RULE | | Temporarily disable a specific rule (repeatable) |
--enable RULE | | Temporarily enable a specific rule (repeatable) |
--init-config | | Create example config file and exit |
--global | | Use with --init-config to create global config |
--force | | Use with --init-config to overwrite existing |
--list-rules | | List all available rules with descriptions and exit |
--show-config-example | | Print example config to stdout and exit |
--where | | Show config file locations and exit |
--validate-config PATH | | Validate config file syntax and rules, then exit |
--show-config | | Show effective configuration and exit |
Other Options
--version | | Show version and exit |
--help | | Show help message and exit |
Examples
Format Chinese-English Mixed Content
$ ctf "Python是一门编程语言,有3.11版本。"
Python 是一门编程语言,有 3.11 版本。
Format Book Titles with Em-Dash
$ ctf "《人生》--路遥著"
《人生》—— 路遥著
Format Japanese Text
$ ctf "私は毎日Raycastを使って仕事の効率化を助けてくれます"
私は毎日 Raycast を使って仕事の効率化を助けてくれます
$ ctf "気温は25°Cです"
気温は 25°C です
Format Korean Text
$ ctf "韓國에서test를합니다"
韓國에서 test 를합니다
Batch Process Markdown Files
ctf ./docs/ --recursive --inplace -e .md
ctf ./docs/ --recursive --dry-run -e .md
Process with Preserved Code Blocks
Markdown code blocks are automatically preserved:
$ cat document.md
文本English混合
\`\`\`python
text--more
\`\`\`
$ ctf document.md --inplace
$ cat document.md
文本 English 混合
\`\`\`python
text--more
\`\`\`
License
MIT License
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
pytest
- Submit a pull request
Author
Created by Xiaolai for the TEPUB project.
Originally developed as part of TEPUB, a tool for EPUB translation and audiobook generation.