
Security News
npm ‘is’ Package Hijacked in Expanding Supply Chain Attack
The ongoing npm phishing campaign escalates as attackers hijack the popular 'is' package, embedding malware in multiple versions.
documents-to-markdown
Advanced tools
A comprehensive Python library for converting various document types to Markdown format
A comprehensive Python library for converting various document types to Markdown format with AI-powered image extraction and processing capabilities.
# Install from PyPI
pip install documents-to-markdown
# Or install from source
git clone https://github.com/ChaosAIs/DocumentsToMarkdown.git
cd DocumentsToMarkdown
pip install -e .
from documents_to_markdown import DocumentConverter
# Initialize converter
converter = DocumentConverter()
# Convert a single file
success = converter.convert_file("document.docx", "output.md")
print(f"Conversion successful: {success}")
# Convert all files in a folder
results = converter.convert_all("input_folder", "output_folder")
print(f"Converted {results['successful_conversions']} files")
# Convert all files in input folder
documents-to-markdown
# Convert specific file
documents-to-markdown --file document.docx output.md
# Custom input/output folders
documents-to-markdown --input docs --output markdown
# Show help
documents-to-markdown --help
.docx
, .doc
.pdf
.xlsx
, .xlsm
, .xls
.png
, .jpg
, .jpeg
, .gif
, .bmp
, .tiff
(AI-powered).txt
, .csv
, .tsv
, .log
(AI-enhanced)from documents_to_markdown import DocumentConverter
# Initialize converter
converter = DocumentConverter(
add_section_numbers=True, # Enable section numbering
verbose=False # Enable verbose logging
)
# Convert single file
success = converter.convert_file("input.docx", "output.md")
# Convert all files in folder
results = converter.convert_all("input_folder", "output_folder")
# Check supported formats
formats = converter.get_supported_formats()
print(f"Supported: {formats}")
# Check if file can be converted
if converter.can_convert("document.pdf"):
print("File can be converted!")
from documents_to_markdown import DocumentConverter, convert_document, convert_folder
# Quick single file conversion
success = convert_document("report.docx", "report.md")
# Quick folder conversion
results = convert_folder("documents", "markdown_output")
# Advanced converter configuration
converter = DocumentConverter()
converter.set_section_numbering(False) # Disable numbering
converter.set_verbose_logging(True) # Enable debug output
# Get detailed statistics
stats = converter.get_conversion_statistics()
print(f"Available converters: {stats['total_converters']}")
for conv in stats['converters']:
print(f"- {conv['name']}: {', '.join(conv['supported_extensions'])}")
# Convert folder and handle results
results = converter.convert_all("input", "output")
print(f"Total files: {results['total_files']}")
print(f"Successful: {results['successful_conversions']}")
print(f"Failed: {results['failed_conversions']}")
# Process individual results
for result in results['results']:
status = "✓" if result['status'] == 'success' else "✗"
print(f"{status} {result['file']} ({result['converter']})")
After installing the package, you can use the command-line interface:
# Install the package
pip install documents-to-markdown
# Now you can use the CLI commands
documents-to-markdown --help
doc2md --help # Alternative shorter command
# Convert all files in current input folder
documents-to-markdown
# Convert all files with custom folders
documents-to-markdown --input docs --output markdown
# Convert a single file
documents-to-markdown --file document.docx output.md
# Show converter statistics
documents-to-markdown --stats
# Disable section numbering
documents-to-markdown --no-numbering
# Enable verbose output
documents-to-markdown --verbose
documents-to-markdown [OPTIONS]
Options:
-i, --input FOLDER Input folder (default: input)
-o, --output FOLDER Output folder (default: output)
-f, --file INPUT OUTPUT Convert single file
--no-numbering Disable section numbering
--stats Show converter statistics
-v, --verbose Enable verbose logging
--version Show version
--help Show help message
For enhanced image processing and text analysis, you can configure AI services:
# Install OLLAMA (see https://ollama.ai)
ollama serve
ollama pull llava:latest
# Create .env file
echo "AI_SERVICE=ollama" > .env
echo "OLLAMA_BASE_URL=http://localhost:11434" >> .env
echo "OLLAMA_MODEL=llava:latest" >> .env
Benefits:
# Get API key from https://platform.openai.com/api-keys
# Create .env file
echo "AI_SERVICE=openai" > .env
echo "OPENAI_API_KEY=your_api_key_here" >> .env
Benefits:
# Configure both services - system will choose best available
cat > .env << EOF
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llava:latest
OPENAI_API_KEY=your_api_key_here
EOF
# Complete .env configuration
AI_SERVICE=ollama|openai # Specific service or leave empty for auto-detection
# OpenAI Settings
OPENAI_MODEL=gpt-4o
OPENAI_MAX_TOKENS=4096
OPENAI_TEMPERATURE=0.1
# OLLAMA Settings
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llava:latest
OLLAMA_TIMEOUT=120
# Image Processing
IMAGE_MAX_SIZE_MB=20
IMAGE_QUALITY_COMPRESSION=85
IMAGE_MAX_SIZE_PIXELS=2048
# Logging
LOG_LEVEL=INFO
from documents_to_markdown import DocumentConverter
converter = DocumentConverter()
# Word document
converter.convert_file("report.docx", "report.md")
# PDF document
converter.convert_file("manual.pdf", "manual.md")
# Excel spreadsheet
converter.convert_file("data.xlsx", "data.md")
# Image with text (requires AI setup)
converter.convert_file("screenshot.png", "screenshot.md")
# Plain text/CSV
converter.convert_file("data.csv", "data.md")
from documents_to_markdown import convert_folder
# Convert entire folder
results = convert_folder("documents", "markdown_output")
print(f"✅ Converted: {results['successful_conversions']}")
print(f"❌ Failed: {results['failed_conversions']}")
# Process results
for result in results['results']:
if result['status'] == 'success':
print(f"✓ {result['file']} -> Converted with {result['converter']}")
else:
print(f"✗ {result['file']} -> Failed")
from documents_to_markdown import DocumentConverter
# Initialize with custom settings
converter = DocumentConverter(
add_section_numbers=False, # Disable numbering
verbose=True # Enable debug logging
)
# Check what formats are supported
formats = converter.get_supported_formats()
print(f"Supported formats: {', '.join(formats)}")
# Get detailed converter information
stats = converter.get_conversion_statistics()
for conv_info in stats['converters']:
name = conv_info['name']
exts = ', '.join(conv_info['supported_extensions'])
print(f"{name}: {exts}")
documents_to_markdown/
├── __init__.py # Main package exports
├── api.py # Public API interface
├── cli.py # Command-line interface
└── services/ # Core conversion services
├── document_converter_manager.py # Main orchestrator
├── base_converter.py # Abstract base converter
├── word_converter.py # Word document converter
├── pdf_converter.py # PDF document converter
├── excel_converter.py # Excel spreadsheet converter
├── image_converter.py # Image converter (AI-powered)
├── plain_text_converter.py # Text/CSV converter (AI-enhanced)
├── text_chunking_utils.py # Text processing utilities
└── ai_services/ # AI service abstraction
├── base_ai_service.py # AI service interface
├── openai_service.py # OpenAI implementation
├── ollama_service.py # OLLAMA implementation
└── ai_service_factory.py # Service factory
The modular design makes it easy to:
# Example: Custom converter
from documents_to_markdown.services.base_converter import BaseDocumentConverter
class MyCustomConverter(BaseDocumentConverter):
def get_supported_extensions(self):
return ['.custom']
def can_convert(self, file_path):
return file_path.suffix.lower() == '.custom'
def _convert_document_to_markdown(self, doc_path):
# Your conversion logic here
return "# Converted Content\n\nCustom format converted!"
# Add to converter manager
from documents_to_markdown import DocumentConverter
converter = DocumentConverter()
converter._get_manager().add_converter(MyCustomConverter())
# Clone the repository
git clone https://github.com/ChaosAIs/DocumentsToMarkdown.git
cd DocumentsToMarkdown
# Install in development mode
pip install -e .
# Install development dependencies
pip install -e .[dev]
# Run tests
pytest
# Run specific tests
python test_converter.py
python test_ai_services.py
# Test basic conversion
python test_converter.py
# Test AI services
python test_ai_services.py
# Test image conversion
python test_image_converter.py
# Test flowchart conversion
python test_flowchart_conversion.py
# Build the package
python -m build
# Install locally for testing
pip install dist/documents_to_markdown-1.0.0-py3-none-any.whl
# Publish to PyPI (maintainers only)
python -m twine upload dist/*
Input Word document with formatting:
# 1. Project Report
Some **bold text** and *italic text*
## 1.1 Data Summary
| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Data 1 | Data 2 | Data 3 |
| Data 4 | Data 5 | Data 6 |
Input CSV:
Employee ID,Name,Department,Salary
001,Alice Johnson,Engineering,75000
002,Bob Smith,Marketing,65000
Output:
| Employee ID | Name | Department | Salary |
|:-----------:|:-------------|:------------|-------:|
| 001 | Alice Johnson| Engineering | 75000 |
| 002 | Bob Smith | Marketing | 65000 |
Images containing flowcharts are automatically converted to ASCII diagrams:
┌─────────────┐
│ Start │
└──────┬──────┘
↓
┌─────────────┐
│ Process A │
└──────┬──────┘
↓
┌─────────────┐
│ End │
└─────────────┘
Installation Problems:
# Missing dependencies
pip install documents-to-markdown
# Development installation
git clone https://github.com/ChaosAIs/DocumentsToMarkdown.git
cd DocumentsToMarkdown
pip install -e .
AI Service Issues:
# Test AI services
python -c "from documents_to_markdown.services.ai_services import ai_service_factory; print('AI services available:', ai_service_factory.get_available_services())"
# OLLAMA not running
ollama serve
ollama pull llava:latest
# OpenAI API key issues
echo "OPENAI_API_KEY=your_key_here" > .env
File Processing Issues:
We welcome contributions! Here's how to get started:
git checkout -b feature/amazing-feature
pytest
or python test_converter.py
git commit -m 'Add amazing feature'
git push origin feature/amazing-feature
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ by Felix
FAQs
A comprehensive Python library for converting various document types to Markdown format
We found that documents-to-markdown demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The ongoing npm phishing campaign escalates as attackers hijack the popular 'is' package, embedding malware in multiple versions.
Security News
A critical flaw in the popular npm form-data package could allow HTTP parameter pollution, affecting millions of projects until patched versions are adopted.
Security News
Bun 1.2.19 introduces isolated installs for smoother monorepo workflows, along with performance boosts, new tooling, and key compatibility fixes.