🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis β†’
Socket
Book a DemoInstallSign in
Socket

tcorpus

Package Overview
Dependencies
Maintainers
1
Versions
6
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

tcorpus

A powerful CLI-based text corpus analyser for extracting palindromes, anagrams, word frequencies, pattern matches, emails, and phone numbers from text files or direct input.

pipPyPI
Version
0.1.9
Maintainers
1

tcorpus - Text Corpus Analyser

Python Version

A powerful, lightweight command-line tool for analyzing text corpora. Extract palindromes, anagrams, word frequencies, pattern matches, emails, and phone numbers from text files or direct input.

✨ Features

  • πŸ” Word Analysis: Find palindromes, anagrams, and word frequencies
  • 🎯 Pattern Matching: Search words using wildcard patterns and masks
  • πŸ“§ Email Extraction: Extract email addresses from text
  • πŸ“± Phone Number Detection: Find phone numbers in various formats
  • 🚫 Stopword Filtering: Filter out common words using config files or CLI options
  • πŸ“Š Multiple Output Formats: Save results as JSON or CSV
  • ⚑ Zero Dependencies: Pure Python standard library, no external packages required
  • πŸ—œοΈ Compressed File Support: Automatically handles .gz compressed files

πŸ“¦ Installation

pip install tcorpus

From Source

git clone https://github.com/Pegasus717/Text-Corpus-Analyser.git
cd text-corpus-analyser
pip install -e .

macOS Installation

Prerequisites

  • Install Python 3.8+ (if not already installed):

    # Using Homebrew (recommended)
    brew install python3
    
    # Or download from python.org
    # Visit https://www.python.org/downloads/
    
  • Verify Python installation:

    python3 --version
    # Should show Python 3.8 or higher
    

Installation Steps

Option 1: Using pip3 (Recommended for macOS)

# Install from PyPI
pip3 install tcorpus

# Or install from source
git clone https://github.com/Pegasus717/Text-Corpus-Analyser.git
cd text-corpus-analyser
pip3 install -e .

Option 2: Using Virtual Environment (Best Practice)

# Create a virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

# Install tcorpus
pip install tcorpus

Note: If you encounter permission errors, use pip3 install --user tcorpus to install in user space.

Verify Installation

tcorpus --help
# Or if command not found:
python3 -m tcorpus --help

πŸ“ Project Structure

text-corpus-analyser/
β”œβ”€β”€ tcorpus/                    # Main package directory
β”‚   β”œβ”€β”€ __init__.py            # Package initialization (version info)
β”‚   β”œβ”€β”€ __main__.py            # Entry point for `python -m tcorpus`
β”‚   β”œβ”€β”€ cli.py                 # Main CLI entry point and argument processing
β”‚   β”œβ”€β”€ CLI_handling.py        # CLI parser builder and processing logic
β”‚   β”œβ”€β”€ main_logic.py          # Core text analysis functions
β”‚   β”œβ”€β”€ io_utils.py            # File I/O operations (read/write JSON/CSV)
β”‚   └── profiler.py            # Performance profiling utilities
β”œβ”€β”€ tests/                      # Test suite
β”‚   └── test_main_logic.py     # Unit tests for core functions
β”œβ”€β”€ config.ini                  # Default stopwords configuration
β”œβ”€β”€ demo.txt                    # Sample text file for testing
β”œβ”€β”€ pyproject.toml             # Project metadata and build configuration
└── README.md                   # This file

File Descriptions

  • tcorpus/__init__.py: Package version definition
  • tcorpus/__main__.py: Module entry point (python -m tcorpus)
  • tcorpus/cli.py: Main CLI entry point, parses arguments and routes to processing
  • tcorpus/CLI_handling.py: Builds argparse parser, handles command processing and orchestration
  • tcorpus/main_logic.py: Core analysis algorithms (palindromes, anagrams, frequencies, pattern matching, emails, phone numbers)
  • tcorpus/io_utils.py: File reading/writing (supports .gz files, JSON/CSV output with append support)
  • tcorpus/profiler.py: Performance timing utilities
  • tests/test_main_logic.py: Unit tests for core functionality
  • config.ini: Default stopwords configuration (optional)
  • pyproject.toml: Project metadata, dependencies, and build configuration

πŸš€ Quick Start

# Get help
tcorpus --help

# Analyze a text file with all features
tcorpus all demo.txt output.json

# Use direct text input
tcorpus palindrome --text "A man a plan a canal Panama" output.json

πŸ“– Usage

Input Options

You can provide input in two ways:

  • From a file:

    tcorpus <command> <input_file> <output_file>
    
  • Direct text (using -t or --text):

    tcorpus <command> --text "Your text here" <output_file>
    

Common Options

All commands support these filtering options:

  • --stopwords <word1> <word2> ... - Additional stopwords to ignore (combined with config.ini)
  • --starts-with <letter> - Keep only words starting with this letter
  • --config <path> - Path to config file for stopwords (default: config.ini)
  • -pw, --print-words - Print filtered results to terminal

Config File

Create a config.ini file in your working directory to define default stopwords:

[stopwords]
words = the,or,but,a,an,is,are,was,were

πŸ“‹ Commands

1. Palindrome Detection

Find all palindromes (words that read the same forwards and backwards).

# From file
tcorpus palindrome demo.txt palindromes.json

# Direct text
tcorpus palindrome --text "madam level cat radar" output.json

# With filters
tcorpus palindrome demo.txt output.json --starts-with m --print-words

Output:

{
  "palindromes": ["level", "madam", "radar"]
}

2. Anagram Detection

Find groups of words that are anagrams of each other.

# Basic usage
tcorpus anagram demo.txt anagrams.json

# With direct text
tcorpus anagram --text "cat act tac dog god" output.json

Output:

{
  "anagrams": [
    ["act", "cat", "tac"],
    ["dog", "god"]
  ]
}

3. Word Frequency Analysis

Count how often each word appears in the text.

# Count all words
tcorpus freq demo.txt frequencies.json

# Count specific words
tcorpus freq demo.txt output.json --words python code program

# Output as CSV
tcorpus freq demo.txt frequencies.csv

Output (JSON):

{
  "frequencies": {
    "python": 5,
    "code": 3,
    "program": 2
  }
}

Output (CSV):

word,count
code,3
program,2
python,5

4. Pattern Matching (Mask)

Find words matching a specific pattern using wildcards and special syntax.

Pattern Syntax:

  • * - Matches zero or more characters (wildcard)
  • ? - Matches exactly one character
  • word+ - Words starting with "word" (e.g., ram+ matches "ram", "ramesh", "ramadan")
  • +word - Words ending with "word" (e.g., +ing matches "running", "coding")
  • +word+ - Words containing "word" (e.g., +ram+ matches "program", "ramesh")

Options:

  • --min-length <n> - Minimum word length
  • --max-length <n> - Maximum word length
  • --length <n> - Exact word length
  • --contains <substring> - Word must contain this substring

Examples:

# Wildcard pattern: words starting with 's' and ending with 'e'
tcorpus mask "s*e" demo.txt output.json

# Starts with pattern
tcorpus mask "ram+" demo.txt output.json

# Ends with pattern
tcorpus mask "+ing" demo.txt output.json

# With length filters
tcorpus mask "s*e" demo.txt output.json --min-length 4 --max-length 6

Output:

{
  "mask_matches": ["sale", "safe", "same", "sane", "site", "size"]
}

5. Email Extraction

Extract email addresses from text.

# From file
tcorpus email demo.txt emails.json

# Direct text
tcorpus email --text "Contact us at info@example.com or support@test.org" output.json

Output:

{
  "emails": ["info@example.com", "support@test.org"]
}

6. Phone Number Extraction

Extract phone numbers from text in various formats.

# Basic usage (default: 10 digits minimum)
tcorpus phone demo.txt phones.json

# Custom minimum digits
tcorpus phone demo.txt output.json --digits 7

# Direct text
tcorpus phone --text "Call +1 (555) 123-4567 or 07123 456789" output.json

Output:

{
  "phone_numbers": ["+1 (555) 123-4567", "07123 456789"]
}

7. All Analyses

Run all analyses at once: palindromes, anagrams, frequencies, emails, and phone numbers.

# Run all analyses
tcorpus all demo.txt complete_analysis.json

# With word frequency filter
tcorpus all demo.txt output.json --words python code

# Custom phone digit requirement
tcorpus all demo.txt output.json --digits 7

Output:

{
  "palindromes": ["level", "madam"],
  "anagrams": [["act", "cat", "tac"]],
  "frequencies": {
    "python": 5,
    "code": 3
  },
  "emails": ["info@example.com"],
  "phone_numbers": ["+1 (555) 123-4567"]
}

8. Multi Analyses (Choose Specific Combination)

Run any combination of analyses in a single command (e.g., just palindromes and anagrams, or palindromes + frequencies + emails).

tcorpus multi -o palindrome -o anagram demo.txt output.json

tcorpus multi -o palindrome -o freq demo.txt output.json --words python code

tcorpus multi -o palindrome -o mask -o phone --mask "s*e" demo.txt output.json --digits 8

Notes:

  • Use -o/--ops once per analysis you want to run (e.g., -o palindrome -o anagram).
  • If you include mask, provide --mask (pattern) and optional length filters (--min-length, --max-length, --length, --contains).
  • If you include freq, --words lets you target specific words; omit to count all.
  • If you include phone, use --digits to set the minimum digit count (default 10).
  • --print-words works here too and prints only the analyses you ran.

πŸ”§ Advanced Examples

Combining Filters

# Find palindromes starting with 'm', excluding stopwords
tcorpus palindrome demo.txt output.json \
  --starts-with m \
  --stopwords the a an \
  --print-words

Processing Compressed Files

The tool automatically handles .gz compressed files:

tcorpus all large_text.txt.gz output.json

Batch Processing

# Process multiple files (Unix/Linux/Mac)
for file in *.txt; do
  tcorpus all "$file" "${file%.txt}_analysis.json"
done

# Windows PowerShell
Get-ChildItem *.txt | ForEach-Object {
  tcorpus all $_.Name ($_.BaseName + "_analysis.json")
}

πŸ“€ Output Formats

JSON (Default)

All commands output JSON by default. The structure varies by command:

  • Single analysis: {"palindromes": [...]}
  • All analyses: {"palindromes": [...], "anagrams": [...], ...}

CSV (Frequency Only)

When using freq command with .csv extension, outputs CSV format suitable for spreadsheet applications.

πŸ› οΈ Requirements

  • Python 3.8 or higher
  • No external dependencies (uses only Python standard library)

πŸ§ͺ Testing

Run the test suite:

python -m unittest discover -s tests -v

❓ Troubleshooting

Command Not Found

If tcorpus is not recognized after installation:

  • Check installation:

    pip show tcorpus
    
  • Verify PATH includes Python Scripts directory:

    python -m site --user-base
    
  • Use module execution:

    python -m tcorpus --help
    

Config File Not Found

If config.ini is missing, the tool will still work but won't use default stopwords. Create a config.ini file in your working directory:

[stopwords]
words = the,or,but,a,an,is,are

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ‘₯ Authors

πŸ‘¨β€πŸ’Ό Maintainers

  • Santosh
  • Raghuram

πŸ“„ License

This project is open source and available under the MIT License.

πŸ“Œ Version

Current version: 0.1.8

πŸ’¬ Support

For issues, questions, or contributions, please open an issue on the project repository.

Made with ❀️ for text analysis enthusiasts

Keywords

text-analysis

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts