
Security News
New React Server Components Vulnerabilities: DoS and Source Code Exposure
New DoS and source code exposure bugs in React Server Components and Next.js: whatβs affected and how to update safely.
tcorpus
Advanced tools
A powerful CLI-based text corpus analyser for extracting palindromes, anagrams, word frequencies, pattern matches, emails, and phone numbers from text files or direct input.
A powerful, lightweight command-line tool for analyzing text corpora. Extract palindromes, anagrams, word frequencies, pattern matches, emails, and phone numbers from text files or direct input.
.gz compressed filespip install tcorpus
git clone https://github.com/Pegasus717/Text-Corpus-Analyser.git
cd text-corpus-analyser
pip install -e .
Install Python 3.8+ (if not already installed):
# Using Homebrew (recommended)
brew install python3
# Or download from python.org
# Visit https://www.python.org/downloads/
Verify Python installation:
python3 --version
# Should show Python 3.8 or higher
Option 1: Using pip3 (Recommended for macOS)
# Install from PyPI
pip3 install tcorpus
# Or install from source
git clone https://github.com/Pegasus717/Text-Corpus-Analyser.git
cd text-corpus-analyser
pip3 install -e .
Option 2: Using Virtual Environment (Best Practice)
# Create a virtual environment
python3 -m venv venv
# Activate the virtual environment
source venv/bin/activate
# Install tcorpus
pip install tcorpus
Note: If you encounter permission errors, use pip3 install --user tcorpus to install in user space.
tcorpus --help
# Or if command not found:
python3 -m tcorpus --help
text-corpus-analyser/
βββ tcorpus/ # Main package directory
β βββ __init__.py # Package initialization (version info)
β βββ __main__.py # Entry point for `python -m tcorpus`
β βββ cli.py # Main CLI entry point and argument processing
β βββ CLI_handling.py # CLI parser builder and processing logic
β βββ main_logic.py # Core text analysis functions
β βββ io_utils.py # File I/O operations (read/write JSON/CSV)
β βββ profiler.py # Performance profiling utilities
βββ tests/ # Test suite
β βββ test_main_logic.py # Unit tests for core functions
βββ config.ini # Default stopwords configuration
βββ demo.txt # Sample text file for testing
βββ pyproject.toml # Project metadata and build configuration
βββ README.md # This file
tcorpus/__init__.py: Package version definitiontcorpus/__main__.py: Module entry point (python -m tcorpus)tcorpus/cli.py: Main CLI entry point, parses arguments and routes to processingtcorpus/CLI_handling.py: Builds argparse parser, handles command processing and orchestrationtcorpus/main_logic.py: Core analysis algorithms (palindromes, anagrams, frequencies, pattern matching, emails, phone numbers)tcorpus/io_utils.py: File reading/writing (supports .gz files, JSON/CSV output with append support)tcorpus/profiler.py: Performance timing utilitiestests/test_main_logic.py: Unit tests for core functionalityconfig.ini: Default stopwords configuration (optional)pyproject.toml: Project metadata, dependencies, and build configuration# Get help
tcorpus --help
# Analyze a text file with all features
tcorpus all demo.txt output.json
# Use direct text input
tcorpus palindrome --text "A man a plan a canal Panama" output.json
You can provide input in two ways:
From a file:
tcorpus <command> <input_file> <output_file>
Direct text (using -t or --text):
tcorpus <command> --text "Your text here" <output_file>
All commands support these filtering options:
--stopwords <word1> <word2> ... - Additional stopwords to ignore (combined with config.ini)--starts-with <letter> - Keep only words starting with this letter--config <path> - Path to config file for stopwords (default: config.ini)-pw, --print-words - Print filtered results to terminalCreate a config.ini file in your working directory to define default stopwords:
[stopwords]
words = the,or,but,a,an,is,are,was,were
Find all palindromes (words that read the same forwards and backwards).
# From file
tcorpus palindrome demo.txt palindromes.json
# Direct text
tcorpus palindrome --text "madam level cat radar" output.json
# With filters
tcorpus palindrome demo.txt output.json --starts-with m --print-words
Output:
{
"palindromes": ["level", "madam", "radar"]
}
Find groups of words that are anagrams of each other.
# Basic usage
tcorpus anagram demo.txt anagrams.json
# With direct text
tcorpus anagram --text "cat act tac dog god" output.json
Output:
{
"anagrams": [
["act", "cat", "tac"],
["dog", "god"]
]
}
Count how often each word appears in the text.
# Count all words
tcorpus freq demo.txt frequencies.json
# Count specific words
tcorpus freq demo.txt output.json --words python code program
# Output as CSV
tcorpus freq demo.txt frequencies.csv
Output (JSON):
{
"frequencies": {
"python": 5,
"code": 3,
"program": 2
}
}
Output (CSV):
word,count
code,3
program,2
python,5
Find words matching a specific pattern using wildcards and special syntax.
Pattern Syntax:
* - Matches zero or more characters (wildcard)? - Matches exactly one characterword+ - Words starting with "word" (e.g., ram+ matches "ram", "ramesh", "ramadan")+word - Words ending with "word" (e.g., +ing matches "running", "coding")+word+ - Words containing "word" (e.g., +ram+ matches "program", "ramesh")Options:
--min-length <n> - Minimum word length--max-length <n> - Maximum word length--length <n> - Exact word length--contains <substring> - Word must contain this substringExamples:
# Wildcard pattern: words starting with 's' and ending with 'e'
tcorpus mask "s*e" demo.txt output.json
# Starts with pattern
tcorpus mask "ram+" demo.txt output.json
# Ends with pattern
tcorpus mask "+ing" demo.txt output.json
# With length filters
tcorpus mask "s*e" demo.txt output.json --min-length 4 --max-length 6
Output:
{
"mask_matches": ["sale", "safe", "same", "sane", "site", "size"]
}
Extract email addresses from text.
# From file
tcorpus email demo.txt emails.json
# Direct text
tcorpus email --text "Contact us at info@example.com or support@test.org" output.json
Output:
{
"emails": ["info@example.com", "support@test.org"]
}
Extract phone numbers from text in various formats.
# Basic usage (default: 10 digits minimum)
tcorpus phone demo.txt phones.json
# Custom minimum digits
tcorpus phone demo.txt output.json --digits 7
# Direct text
tcorpus phone --text "Call +1 (555) 123-4567 or 07123 456789" output.json
Output:
{
"phone_numbers": ["+1 (555) 123-4567", "07123 456789"]
}
Run all analyses at once: palindromes, anagrams, frequencies, emails, and phone numbers.
# Run all analyses
tcorpus all demo.txt complete_analysis.json
# With word frequency filter
tcorpus all demo.txt output.json --words python code
# Custom phone digit requirement
tcorpus all demo.txt output.json --digits 7
Output:
{
"palindromes": ["level", "madam"],
"anagrams": [["act", "cat", "tac"]],
"frequencies": {
"python": 5,
"code": 3
},
"emails": ["info@example.com"],
"phone_numbers": ["+1 (555) 123-4567"]
}
Run any combination of analyses in a single command (e.g., just palindromes and anagrams, or palindromes + frequencies + emails).
tcorpus multi -o palindrome -o anagram demo.txt output.json
tcorpus multi -o palindrome -o freq demo.txt output.json --words python code
tcorpus multi -o palindrome -o mask -o phone --mask "s*e" demo.txt output.json --digits 8
Notes:
-o/--ops once per analysis you want to run (e.g., -o palindrome -o anagram).mask, provide --mask (pattern) and optional length filters (--min-length, --max-length, --length, --contains).freq, --words lets you target specific words; omit to count all.phone, use --digits to set the minimum digit count (default 10).--print-words works here too and prints only the analyses you ran.# Find palindromes starting with 'm', excluding stopwords
tcorpus palindrome demo.txt output.json \
--starts-with m \
--stopwords the a an \
--print-words
The tool automatically handles .gz compressed files:
tcorpus all large_text.txt.gz output.json
# Process multiple files (Unix/Linux/Mac)
for file in *.txt; do
tcorpus all "$file" "${file%.txt}_analysis.json"
done
# Windows PowerShell
Get-ChildItem *.txt | ForEach-Object {
tcorpus all $_.Name ($_.BaseName + "_analysis.json")
}
All commands output JSON by default. The structure varies by command:
{"palindromes": [...]}{"palindromes": [...], "anagrams": [...], ...}When using freq command with .csv extension, outputs CSV format suitable for spreadsheet applications.
Run the test suite:
python -m unittest discover -s tests -v
If tcorpus is not recognized after installation:
Check installation:
pip show tcorpus
Verify PATH includes Python Scripts directory:
python -m site --user-base
Use module execution:
python -m tcorpus --help
If config.ini is missing, the tool will still work but won't use default stopwords. Create a config.ini file in your working directory:
[stopwords]
words = the,or,but,a,an,is,are
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the MIT License.
Current version: 0.1.8
For issues, questions, or contributions, please open an issue on the project repository.
Made with β€οΈ for text analysis enthusiasts
FAQs
A powerful CLI-based text corpus analyser for extracting palindromes, anagrams, word frequencies, pattern matches, emails, and phone numbers from text files or direct input.
We found that tcorpus demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
New DoS and source code exposure bugs in React Server Components and Next.js: whatβs affected and how to update safely.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.

Security News
GitHub has revoked npm classic tokens for publishing; maintainers must migrate, but OpenJS warns OIDC trusted publishing still has risky gaps for critical projects.