You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

documents-to-markdown

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

documents-to-markdown

A comprehensive Python library for converting various document types to Markdown format

1.0.0

PyPI

Maintainers: 1

Documents to Markdown Converter

A comprehensive Python library for converting various document types to Markdown format with AI-powered image extraction and processing capabilities.

🚀 Quick Start

Installation

# Install from PyPI
pip install documents-to-markdown

# Or install from source
git clone https://github.com/ChaosAIs/DocumentsToMarkdown.git
cd DocumentsToMarkdown
pip install -e .

Library Usage

from documents_to_markdown import DocumentConverter

# Initialize converter
converter = DocumentConverter()

# Convert a single file
success = converter.convert_file("document.docx", "output.md")
print(f"Conversion successful: {success}")

# Convert all files in a folder
results = converter.convert_all("input_folder", "output_folder")
print(f"Converted {results['successful_conversions']} files")

Command Line Usage

# Convert all files in input folder
documents-to-markdown

# Convert specific file
documents-to-markdown --file document.docx output.md

# Custom input/output folders
documents-to-markdown --input docs --output markdown

# Show help
documents-to-markdown --help

📋 Supported Formats

Word Documents: .docx, .doc
PDF Documents: .pdf
Excel Spreadsheets: .xlsx, .xlsm, .xls
Images: .png, .jpg, .jpeg, .gif, .bmp, .tiff (AI-powered)
Plain Text: .txt, .csv, .tsv, .log (AI-enhanced)

✨ Features

Core Capabilities

Multi-format support: Word, PDF, Excel, Plain Text, and Image documents
AI-powered processing: Choose between OpenAI (cloud) and OLLAMA (local)
Batch processing: Convert multiple documents efficiently
Preserves formatting: Bold, italic, tables, and document structure
Automatic section numbering: Hierarchical numbering (1, 1.1, 1.2, etc.)
Modular architecture: Extensible converter system

AI-Enhanced Features

Image text extraction: Extract text from images using AI vision
Embedded image processing: Process images within Word/PDF documents
Flowchart conversion: Convert flowcharts to ASCII diagrams
Smart text processing: AI-enhanced plain text formatting
Privacy options: Local AI processing with OLLAMA

📚 Library API

Basic Usage

from documents_to_markdown import DocumentConverter

# Initialize converter
converter = DocumentConverter(
    add_section_numbers=True,  # Enable section numbering
    verbose=False              # Enable verbose logging
)

# Convert single file
success = converter.convert_file("input.docx", "output.md")

# Convert all files in folder
results = converter.convert_all("input_folder", "output_folder")

# Check supported formats
formats = converter.get_supported_formats()
print(f"Supported: {formats}")

# Check if file can be converted
if converter.can_convert("document.pdf"):
    print("File can be converted!")

Advanced Usage

from documents_to_markdown import DocumentConverter, convert_document, convert_folder

# Quick single file conversion
success = convert_document("report.docx", "report.md")

# Quick folder conversion
results = convert_folder("documents", "markdown_output")

# Advanced converter configuration
converter = DocumentConverter()
converter.set_section_numbering(False)  # Disable numbering
converter.set_verbose_logging(True)     # Enable debug output

# Get detailed statistics
stats = converter.get_conversion_statistics()
print(f"Available converters: {stats['total_converters']}")
for conv in stats['converters']:
    print(f"- {conv['name']}: {', '.join(conv['supported_extensions'])}")

Working with Results

# Convert folder and handle results
results = converter.convert_all("input", "output")

print(f"Total files: {results['total_files']}")
print(f"Successful: {results['successful_conversions']}")
print(f"Failed: {results['failed_conversions']}")

# Process individual results
for result in results['results']:
    status = "✓" if result['status'] == 'success' else "✗"
    print(f"{status} {result['file']} ({result['converter']})")

🖥️ Command Line Interface

Installation

After installing the package, you can use the command-line interface:

# Install the package
pip install documents-to-markdown

# Now you can use the CLI commands
documents-to-markdown --help
doc2md --help  # Alternative shorter command

Basic Commands

# Convert all files in current input folder
documents-to-markdown

# Convert all files with custom folders
documents-to-markdown --input docs --output markdown

# Convert a single file
documents-to-markdown --file document.docx output.md

# Show converter statistics
documents-to-markdown --stats

# Disable section numbering
documents-to-markdown --no-numbering

# Enable verbose output
documents-to-markdown --verbose

Command Options

documents-to-markdown [OPTIONS]

Options:
  -i, --input FOLDER     Input folder (default: input)
  -o, --output FOLDER    Output folder (default: output)
  -f, --file INPUT OUTPUT Convert single file
  --no-numbering         Disable section numbering
  --stats               Show converter statistics
  -v, --verbose         Enable verbose logging
  --version             Show version
  --help                Show help message

🤖 AI Configuration (Optional)

For enhanced image processing and text analysis, you can configure AI services:

Option 1: OLLAMA (Local AI) - Recommended for Privacy

# Install OLLAMA (see https://ollama.ai)
ollama serve
ollama pull llava:latest

# Create .env file
echo "AI_SERVICE=ollama" > .env
echo "OLLAMA_BASE_URL=http://localhost:11434" >> .env
echo "OLLAMA_MODEL=llava:latest" >> .env

Benefits:

✅ Free: No API costs
✅ Private: Data never leaves your computer
✅ Offline: Works without internet

Option 2: OpenAI (Cloud AI) - Recommended for Ease

# Get API key from https://platform.openai.com/api-keys
# Create .env file
echo "AI_SERVICE=openai" > .env
echo "OPENAI_API_KEY=your_api_key_here" >> .env

Benefits:

✅ Easy Setup: Just need API key
✅ High Quality: Consistently good results
❌ Costs Money: Pay per API call

Auto-Detection (Recommended)

# Configure both services - system will choose best available
cat > .env << EOF
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llava:latest
OPENAI_API_KEY=your_api_key_here
EOF

Advanced Configuration

# Complete .env configuration
AI_SERVICE=ollama|openai          # Specific service or leave empty for auto-detection

# OpenAI Settings
OPENAI_MODEL=gpt-4o
OPENAI_MAX_TOKENS=4096
OPENAI_TEMPERATURE=0.1

# OLLAMA Settings
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llava:latest
OLLAMA_TIMEOUT=120

# Image Processing
IMAGE_MAX_SIZE_MB=20
IMAGE_QUALITY_COMPRESSION=85
IMAGE_MAX_SIZE_PIXELS=2048

# Logging
LOG_LEVEL=INFO

📖 Examples

Converting Different File Types

from documents_to_markdown import DocumentConverter

converter = DocumentConverter()

# Word document
converter.convert_file("report.docx", "report.md")

# PDF document
converter.convert_file("manual.pdf", "manual.md")

# Excel spreadsheet
converter.convert_file("data.xlsx", "data.md")

# Image with text (requires AI setup)
converter.convert_file("screenshot.png", "screenshot.md")

# Plain text/CSV
converter.convert_file("data.csv", "data.md")

Batch Processing

from documents_to_markdown import convert_folder

# Convert entire folder
results = convert_folder("documents", "markdown_output")

print(f"✅ Converted: {results['successful_conversions']}")
print(f"❌ Failed: {results['failed_conversions']}")

# Process results
for result in results['results']:
    if result['status'] == 'success':
        print(f"✓ {result['file']} -> Converted with {result['converter']}")
    else:
        print(f"✗ {result['file']} -> Failed")

Custom Configuration

from documents_to_markdown import DocumentConverter

# Initialize with custom settings
converter = DocumentConverter(
    add_section_numbers=False,  # Disable numbering
    verbose=True               # Enable debug logging
)

# Check what formats are supported
formats = converter.get_supported_formats()
print(f"Supported formats: {', '.join(formats)}")

# Get detailed converter information
stats = converter.get_conversion_statistics()
for conv_info in stats['converters']:
    name = conv_info['name']
    exts = ', '.join(conv_info['supported_extensions'])
    print(f"{name}: {exts}")

🏗️ Architecture

Library Structure

documents_to_markdown/
├── __init__.py              # Main package exports
├── api.py                   # Public API interface
├── cli.py                   # Command-line interface
└── services/                # Core conversion services
    ├── document_converter_manager.py  # Main orchestrator
    ├── base_converter.py             # Abstract base converter
    ├── word_converter.py             # Word document converter
    ├── pdf_converter.py              # PDF document converter
    ├── excel_converter.py            # Excel spreadsheet converter
    ├── image_converter.py            # Image converter (AI-powered)
    ├── plain_text_converter.py       # Text/CSV converter (AI-enhanced)
    ├── text_chunking_utils.py        # Text processing utilities
    └── ai_services/                  # AI service abstraction
        ├── base_ai_service.py        # AI service interface
        ├── openai_service.py         # OpenAI implementation
        ├── ollama_service.py         # OLLAMA implementation
        └── ai_service_factory.py     # Service factory

Converter Architecture

DocumentConverter: Main public API class
DocumentConverterManager: Orchestrates multiple converters
BaseDocumentConverter: Abstract base for all converters
Specialized Converters: Word, PDF, Excel, Image, PlainText
AI Services: Pluggable AI backends (OpenAI, OLLAMA)

Extensibility

The modular design makes it easy to:

Add new document formats
Integrate additional AI services
Customize conversion behavior
Extend processing capabilities

# Example: Custom converter
from documents_to_markdown.services.base_converter import BaseDocumentConverter

class MyCustomConverter(BaseDocumentConverter):
    def get_supported_extensions(self):
        return ['.custom']

    def can_convert(self, file_path):
        return file_path.suffix.lower() == '.custom'

    def _convert_document_to_markdown(self, doc_path):
        # Your conversion logic here
        return "# Converted Content\n\nCustom format converted!"

# Add to converter manager
from documents_to_markdown import DocumentConverter
converter = DocumentConverter()
converter._get_manager().add_converter(MyCustomConverter())

🧪 Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/ChaosAIs/DocumentsToMarkdown.git
cd DocumentsToMarkdown

# Install in development mode
pip install -e .

# Install development dependencies
pip install -e .[dev]

# Run tests
pytest

# Run specific tests
python test_converter.py
python test_ai_services.py

Running Tests

# Test basic conversion
python test_converter.py

# Test AI services
python test_ai_services.py

# Test image conversion
python test_image_converter.py

# Test flowchart conversion
python test_flowchart_conversion.py

Building and Publishing

# Build the package
python -m build

# Install locally for testing
pip install dist/documents_to_markdown-1.0.0-py3-none-any.whl

# Publish to PyPI (maintainers only)
python -m twine upload dist/*

📋 Output Examples

Word Document Conversion

Input Word document with formatting:

# 1. Project Report

Some **bold text** and *italic text*

## 1.1 Data Summary

| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Data 1 | Data 2 | Data 3 |
| Data 4 | Data 5 | Data 6 |

CSV to Markdown Table

Input CSV:

Employee ID,Name,Department,Salary
001,Alice Johnson,Engineering,75000
002,Bob Smith,Marketing,65000

Output:

| Employee ID | Name         | Department  | Salary |
|:-----------:|:-------------|:------------|-------:|
| 001         | Alice Johnson| Engineering |  75000 |
| 002         | Bob Smith    | Marketing   |  65000 |

AI-Enhanced Image Processing

Images containing flowcharts are automatically converted to ASCII diagrams:

┌─────────────┐
│    Start    │
└──────┬──────┘
       ↓
┌─────────────┐
│  Process A  │
└──────┬──────┘
       ↓
┌─────────────┐
│     End     │
└─────────────┘

🔧 Troubleshooting

Common Issues

Installation Problems:

# Missing dependencies
pip install documents-to-markdown

# Development installation
git clone https://github.com/ChaosAIs/DocumentsToMarkdown.git
cd DocumentsToMarkdown
pip install -e .

AI Service Issues:

# Test AI services
python -c "from documents_to_markdown.services.ai_services import ai_service_factory; print('AI services available:', ai_service_factory.get_available_services())"

# OLLAMA not running
ollama serve
ollama pull llava:latest

# OpenAI API key issues
echo "OPENAI_API_KEY=your_key_here" > .env

File Processing Issues:

Ensure files are in supported formats
Check file permissions and paths
Review logs for detailed error messages

🤝 Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and add tests
Run tests: pytest or python test_converter.py
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

Development Guidelines

Follow PEP 8 style guidelines
Add tests for new features
Update documentation as needed
Ensure backward compatibility

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

python-docx for Word document processing
PyMuPDF for PDF handling
OpenAI for AI vision capabilities
OLLAMA for local AI processing
openpyxl for Excel support

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Project Wiki

Made with ❤️ by Felix

Keywords

FAQs

What is documents-to-markdown?

Is documents-to-markdown well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

documents-to-markdown

Documents to Markdown Converter

🚀 Quick Start

Installation

Library Usage

Command Line Usage

📋 Supported Formats

✨ Features

Core Capabilities

AI-Enhanced Features

📚 Library API

Basic Usage

Advanced Usage

Working with Results

🖥️ Command Line Interface

Installation

Basic Commands

Command Options

🤖 AI Configuration (Optional)

Option 1: OLLAMA (Local AI) - Recommended for Privacy

Option 2: OpenAI (Cloud AI) - Recommended for Ease

Auto-Detection (Recommended)

Advanced Configuration

📖 Examples

Converting Different File Types

Batch Processing

Custom Configuration

🏗️ Architecture

Library Structure

Converter Architecture

Extensibility

🧪 Development

Setting Up Development Environment

Running Tests

Building and Publishing

📋 Output Examples

Word Document Conversion

CSV to Markdown Table

AI-Enhanced Image Processing

🔧 Troubleshooting

Common Issues

🤝 Contributing

Development Guidelines

📄 License

🙏 Acknowledgments

📞 Support

Keywords

Related posts

Critical Vulnerability in Popular npm form-data Package Used Across Millions of Installs

Bun 1.2.19 Adds Isolated Installs for Better Monorepo Support