Codebase to Text Converter
A powerful Python tool that converts codebases (folder structures with files) into a single text file or Microsoft Word document (.docx), while preserving folder structure and file contents. Perfect for AI/LLM processing, documentation generation, and code analysis.
✨ Features
- Multi-source input: Local directories and GitHub repositories
- Flexible output: Text files (.txt) and Microsoft Word documents (.docx)
- Smart exclusions: Advanced pattern matching for files and directories
- Performance optimized: Efficient traversal of large codebases
- Comprehensive logging: Detailed verbose mode for transparency
- Encoding support: Handles various file encodings gracefully
🚀 Installation
pip install codebase-to-text
đź“– Usage
Command Line Interface (CLI)
Basic Usage
codebase-to-text --input "path_or_github_url" --output "output_path" --output_type "txt"
Advanced Usage with Exclusions
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --exclude "*.log,temp/,**/__pycache__/**"
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --exclude "*.pyc" --exclude "build/" --exclude "venv/"
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --exclude_hidden
codebase-to-text --input "./my_project" --output "output.txt" --output_type "txt" --verbose
Python API
from codebase_to_text import CodebaseToText
converter = CodebaseToText(
input_path="path_or_github_url",
output_path="output_path",
output_type="txt"
)
converter.get_file()
converter = CodebaseToText(
input_path="./my_project",
output_path="./output.txt",
output_type="txt",
exclude=["*.log", "temp/", "**/__pycache__/**"],
exclude_hidden=True,
verbose=True
)
converter.get_file()
text_content = converter.get_text()
print(text_content)
🎯 Exclusion Patterns
The tool supports powerful exclusion patterns to filter out unwanted files and directories:
Pattern Types
- Exact filename:
README.md, config.yaml
- Wildcard patterns:
*.log, *.tmp, test_*
- Directory patterns:
__pycache__/, .git/, node_modules/
- Recursive patterns:
**/__pycache__/**, **/node_modules/**
- Path-based patterns:
src/temp/, docs/build/
Exclusion Sources
- CLI Arguments: Use
--exclude flag (can be used multiple times)
.exclude file: Place in your project root (see example below)
- Default patterns: Common files/folders are excluded automatically
Default Exclusions
The tool automatically excludes common development files:
.git/, __pycache__/, *.pyc, *.pyo
node_modules/, .venv/, venv/, env/
*.log, *.tmp, .DS_Store
.pytest_cache/, build/, dist/
📝 .exclude File Example
Create a .exclude file in your project root:
.git/
.gitignore
__pycache__/
*.pyc
venv/
.pytest_cache/
node_modules/
*.log
.vscode/
.idea/
config/secrets.yaml
data/large_files/
đź”§ CLI Parameters
--input | Input path (local folder or GitHub URL) | ./my_project or https://github.com/user/repo |
--output | Output file path | ./output.txt |
--output_type | Output format (txt or docx) | txt |
--exclude | Exclusion patterns (repeatable) | --exclude "*.log" --exclude "temp/" |
--exclude_hidden | Exclude hidden files/folders | Flag (no value) |
--verbose | Enable detailed logging | Flag (no value) |
đź’ˇ Examples
Convert Local Project
codebase-to-text --input "~/projects/my_app" --output "my_app_code.txt" --output_type "txt"
codebase-to-text --input "~/projects/my_app" --output "my_app_code.txt" --output_type "txt" --exclude "*.log,build/,dist/" --verbose
Convert GitHub Repository
codebase-to-text --input "https://github.com/username/repo" --output "repo_analysis.docx" --output_type "docx"
codebase-to-text --input "https://github.com/username/repo" --output "repo_clean.txt" --output_type "txt" --exclude "*.md,docs/,examples/"
Python Integration
from codebase_to_text import CodebaseToText
def analyze_codebase(project_path):
converter = CodebaseToText(
input_path=project_path,
output_path="analysis.txt",
output_type="txt",
exclude=["*.log", "test/", "**/__pycache__/**"],
verbose=True
)
content = converter.get_text()
return content
code_content = analyze_codebase("./my_project")
🎯 Use Cases
- AI/LLM Training: Prepare codebases for language model training
- Code Review: Generate comprehensive code overviews for review
- Documentation: Create single-file documentation from projects
- Analysis: Feed entire codebases to AI tools for analysis
- Migration: Document legacy codebases before migration
- Learning: Study open-source projects more effectively
🔄 Output Format
The generated output includes:
- Folder Structure: Tree-like representation of the directory structure
- File Contents: Full content of each file with metadata
- Clear Separators: Distinct sections for easy navigation
✒️ License
License This project is licensed under the MIT License - see the LICENSE file for details.