🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis →

Book a Demo Install Sign in

tcorpus

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

tcorpus

A powerful CLI-based text corpus analyser for extracting palindromes, anagrams, word frequencies, pattern matches, emails, and phone numbers from text files or direct input.

PyPI

Version: 0.1.9

Maintainers: 1

tcorpus - Text Corpus Analyser

A powerful, lightweight command-line tool for analyzing text corpora. Extract palindromes, anagrams, word frequencies, pattern matches, emails, and phone numbers from text files or direct input.

✨ Features

🔍 Word Analysis: Find palindromes, anagrams, and word frequencies
🎯 Pattern Matching: Search words using wildcard patterns and masks
📧 Email Extraction: Extract email addresses from text
📱 Phone Number Detection: Find phone numbers in various formats
🚫 Stopword Filtering: Filter out common words using config files or CLI options
📊 Multiple Output Formats: Save results as JSON or CSV
⚡ Zero Dependencies: Pure Python standard library, no external packages required
🗜️ Compressed File Support: Automatically handles .gz compressed files

📦 Installation

From PyPI (Recommended)

pip install tcorpus

From Source

git clone https://github.com/Pegasus717/Text-Corpus-Analyser.git
cd text-corpus-analyser
pip install -e .

macOS Installation

Prerequisites

Install Python 3.8+ (if not already installed):

# Using Homebrew (recommended)
brew install python3

# Or download from python.org
# Visit https://www.python.org/downloads/

Verify Python installation:

python3 --version
# Should show Python 3.8 or higher

Installation Steps

Option 1: Using pip3 (Recommended for macOS)

# Install from PyPI
pip3 install tcorpus

# Or install from source
git clone https://github.com/Pegasus717/Text-Corpus-Analyser.git
cd text-corpus-analyser
pip3 install -e .

Option 2: Using Virtual Environment (Best Practice)

# Create a virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

# Install tcorpus
pip install tcorpus

Note: If you encounter permission errors, use pip3 install --user tcorpus to install in user space.

Verify Installation

tcorpus --help
# Or if command not found:
python3 -m tcorpus --help

📁 Project Structure

text-corpus-analyser/
├── tcorpus/                    # Main package directory
│   ├── __init__.py            # Package initialization (version info)
│   ├── __main__.py            # Entry point for `python -m tcorpus`
│   ├── cli.py                 # Main CLI entry point and argument processing
│   ├── CLI_handling.py        # CLI parser builder and processing logic
│   ├── main_logic.py          # Core text analysis functions
│   ├── io_utils.py            # File I/O operations (read/write JSON/CSV)
│   └── profiler.py            # Performance profiling utilities
├── tests/                      # Test suite
│   └── test_main_logic.py     # Unit tests for core functions
├── config.ini                  # Default stopwords configuration
├── demo.txt                    # Sample text file for testing
├── pyproject.toml             # Project metadata and build configuration
└── README.md                   # This file

File Descriptions

tcorpus/__init__.py: Package version definition
tcorpus/__main__.py: Module entry point (python -m tcorpus)
tcorpus/cli.py: Main CLI entry point, parses arguments and routes to processing
tcorpus/CLI_handling.py: Builds argparse parser, handles command processing and orchestration
tcorpus/main_logic.py: Core analysis algorithms (palindromes, anagrams, frequencies, pattern matching, emails, phone numbers)
tcorpus/io_utils.py: File reading/writing (supports .gz files, JSON/CSV output with append support)
tcorpus/profiler.py: Performance timing utilities
tests/test_main_logic.py: Unit tests for core functionality
config.ini: Default stopwords configuration (optional)
pyproject.toml: Project metadata, dependencies, and build configuration

🚀 Quick Start

# Get help
tcorpus --help

# Analyze a text file with all features
tcorpus all demo.txt output.json

# Use direct text input
tcorpus palindrome --text "A man a plan a canal Panama" output.json

📖 Usage

Input Options

You can provide input in two ways:

From a file:

tcorpus <command> <input_file> <output_file>

Direct text (using -t or --text):

tcorpus <command> --text "Your text here" <output_file>

Common Options

All commands support these filtering options:

--stopwords <word1> <word2> ... - Additional stopwords to ignore (combined with config.ini)
--starts-with <letter> - Keep only words starting with this letter
--config <path> - Path to config file for stopwords (default: config.ini)
-pw, --print-words - Print filtered results to terminal

Config File

Create a config.ini file in your working directory to define default stopwords:

[stopwords]
words = the,or,but,a,an,is,are,was,were

📋 Commands

1. Palindrome Detection

Find all palindromes (words that read the same forwards and backwards).

# From file
tcorpus palindrome demo.txt palindromes.json

# Direct text
tcorpus palindrome --text "madam level cat radar" output.json

# With filters
tcorpus palindrome demo.txt output.json --starts-with m --print-words

Output:

{
  "palindromes": ["level", "madam", "radar"]
}

2. Anagram Detection

Find groups of words that are anagrams of each other.

# Basic usage
tcorpus anagram demo.txt anagrams.json

# With direct text
tcorpus anagram --text "cat act tac dog god" output.json

Output:

{
  "anagrams": [
    ["act", "cat", "tac"],
    ["dog", "god"]
  ]
}

3. Word Frequency Analysis

Count how often each word appears in the text.

# Count all words
tcorpus freq demo.txt frequencies.json

# Count specific words
tcorpus freq demo.txt output.json --words python code program

# Output as CSV
tcorpus freq demo.txt frequencies.csv

Output (JSON):

{
  "frequencies": {
    "python": 5,
    "code": 3,
    "program": 2
  }
}

Output (CSV):

word,count
code,3
program,2
python,5

4. Pattern Matching (Mask)

Find words matching a specific pattern using wildcards and special syntax.

Pattern Syntax:

* - Matches zero or more characters (wildcard)
? - Matches exactly one character
word+ - Words starting with "word" (e.g., ram+ matches "ram", "ramesh", "ramadan")
+word - Words ending with "word" (e.g., +ing matches "running", "coding")
+word+ - Words containing "word" (e.g., +ram+ matches "program", "ramesh")

Options:

--min-length <n> - Minimum word length
--max-length <n> - Maximum word length
--length <n> - Exact word length
--contains <substring> - Word must contain this substring

Examples:

# Wildcard pattern: words starting with 's' and ending with 'e'
tcorpus mask "s*e" demo.txt output.json

# Starts with pattern
tcorpus mask "ram+" demo.txt output.json

# Ends with pattern
tcorpus mask "+ing" demo.txt output.json

# With length filters
tcorpus mask "s*e" demo.txt output.json --min-length 4 --max-length 6

Output:

{
  "mask_matches": ["sale", "safe", "same", "sane", "site", "size"]
}

5. Email Extraction

Extract email addresses from text.

# From file
tcorpus email demo.txt emails.json

# Direct text
tcorpus email --text "Contact us at info@example.com or support@test.org" output.json

Output:

{
  "emails": ["info@example.com", "support@test.org"]
}

6. Phone Number Extraction

Extract phone numbers from text in various formats.

# Basic usage (default: 10 digits minimum)
tcorpus phone demo.txt phones.json

# Custom minimum digits
tcorpus phone demo.txt output.json --digits 7

# Direct text
tcorpus phone --text "Call +1 (555) 123-4567 or 07123 456789" output.json

Output:

{
  "phone_numbers": ["+1 (555) 123-4567", "07123 456789"]
}

7. All Analyses

Run all analyses at once: palindromes, anagrams, frequencies, emails, and phone numbers.

# Run all analyses
tcorpus all demo.txt complete_analysis.json

# With word frequency filter
tcorpus all demo.txt output.json --words python code

# Custom phone digit requirement
tcorpus all demo.txt output.json --digits 7

Output:

{
  "palindromes": ["level", "madam"],
  "anagrams": [["act", "cat", "tac"]],
  "frequencies": {
    "python": 5,
    "code": 3
  },
  "emails": ["info@example.com"],
  "phone_numbers": ["+1 (555) 123-4567"]
}

8. Multi Analyses (Choose Specific Combination)

Run any combination of analyses in a single command (e.g., just palindromes and anagrams, or palindromes + frequencies + emails).

tcorpus multi -o palindrome -o anagram demo.txt output.json

tcorpus multi -o palindrome -o freq demo.txt output.json --words python code

tcorpus multi -o palindrome -o mask -o phone --mask "s*e" demo.txt output.json --digits 8

Notes:

Use -o/--ops once per analysis you want to run (e.g., -o palindrome -o anagram).
If you include mask, provide --mask (pattern) and optional length filters (--min-length, --max-length, --length, --contains).
If you include freq, --words lets you target specific words; omit to count all.
If you include phone, use --digits to set the minimum digit count (default 10).
--print-words works here too and prints only the analyses you ran.

🔧 Advanced Examples

Combining Filters

# Find palindromes starting with 'm', excluding stopwords
tcorpus palindrome demo.txt output.json \
  --starts-with m \
  --stopwords the a an \
  --print-words

Processing Compressed Files

The tool automatically handles .gz compressed files:

tcorpus all large_text.txt.gz output.json

Batch Processing

# Process multiple files (Unix/Linux/Mac)
for file in *.txt; do
  tcorpus all "$file" "${file%.txt}_analysis.json"
done

# Windows PowerShell
Get-ChildItem *.txt | ForEach-Object {
  tcorpus all $_.Name ($_.BaseName + "_analysis.json")
}

📤 Output Formats

JSON (Default)

All commands output JSON by default. The structure varies by command:

Single analysis: {"palindromes": [...]}
All analyses: {"palindromes": [...], "anagrams": [...], ...}

CSV (Frequency Only)

When using freq command with .csv extension, outputs CSV format suitable for spreadsheet applications.

🛠️ Requirements

Python 3.8 or higher
No external dependencies (uses only Python standard library)

🧪 Testing

Run the test suite:

python -m unittest discover -s tests -v

❓ Troubleshooting

Command Not Found

If tcorpus is not recognized after installation:

Check installation:
```
pip show tcorpus
```
Verify PATH includes Python Scripts directory:
```
python -m site --user-base
```
Use module execution:
```
python -m tcorpus --help
```

Config File Not Found

If config.ini is missing, the tool will still work but won't use default stopwords. Create a config.ini file in your working directory:

[stopwords]
words = the,or,but,a,an,is,are

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

👥 Authors

Raghu - raghu59770@gmail.com

👨‍💼 Maintainers

Santosh
Raghuram

📄 License

This project is open source and available under the MIT License.

📌 Version

Current version: 0.1.8

💬 Support

For issues, questions, or contributions, please open an issue on the project repository.

Made with ❤️ for text analysis enthusiasts

Keywords

FAQs

What is tcorpus?

Is tcorpus well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

tcorpus

tcorpus - Text Corpus Analyser

✨ Features

📦 Installation

From PyPI (Recommended)

From Source

macOS Installation

Prerequisites

Installation Steps

Verify Installation

📁 Project Structure

File Descriptions

🚀 Quick Start

📖 Usage

Input Options

Common Options

Config File

📋 Commands

1. Palindrome Detection

2. Anagram Detection

3. Word Frequency Analysis

4. Pattern Matching (Mask)

5. Email Extraction

6. Phone Number Extraction

7. All Analyses

8. Multi Analyses (Choose Specific Combination)

🔧 Advanced Examples

Combining Filters

Processing Compressed Files

Batch Processing

📤 Output Formats

JSON (Default)

CSV (Frequency Only)

🛠️ Requirements

🧪 Testing

❓ Troubleshooting

Command Not Found

Config File Not Found

🤝 Contributing

👥 Authors

👨‍💼 Maintainers

📄 License

📌 Version

💬 Support

Keywords

Related posts

Software Engineering Daily Podcast: Feross on AI, Open Source, and Supply Chain Risk

npm Revokes Classic Tokens, as OpenJS Warns Maintainers About OIDC Gaps