๐ n8n Document Converter Node

๐ n8n community node for converting various document formats to JSON/text with AI-friendly output
๐ Table of Contents
โจ Features
๐ฏ Core Features
- โ
12+ file formats supported
- โ
Automatic file type detection
- โ
Hybrid processing (primary + fallback)
- โ
Stream processing for large files
- โ
Promise pooling for concurrency control
- โ
Comprehensive error handling
|
๐ Security & Performance
- โ
Input validation & sanitization
- โ
XSS protection (sanitize-html)
- โ
Path traversal protection
- โ
Memory-efficient streaming
- โ
Configurable file size limits (up to 100MB)
- โ
JSON structure normalization
|
๐ Supported Formats
| Text Documents | DOCX, ODT, TXT, PDF | โ
Full Support |
| Spreadsheets | XLSX, ODS, CSV | โ
Multi-sheet support |
| Presentations | PPTX, ODP | โ
Full Support |
| Web & Data | HTML, HTM, XML, JSON | โ
Full Support |
| E-commerce | YML (Yandex Market) | โ
Specialized parsing |
| Legacy | DOC, PPT, XLS | โ Not supported* |
*Legacy formats require conversion to modern formats (DOCX, PPTX, XLSX)
๐ DOCX to HTML Conversion (v1.0.21+)
Latest: Node renamed to "Document Converter" in v1.0.22
๐จ Choose Your Output Format
| ๐ Plain Text (Default) | ๐ HTML Format |
|---|
|
Best for:
- Simple text extraction
- Minimal output size
- Maximum speed
- Backward compatibility
Output size: ~3,600 chars
|
Best for:
- Documents with tables
- AI/LLM processing
- Preserving formatting
- Structured content
Output size: ~58,000 chars (+1,591%)
|
๐ Usage in n8n
1. Add "Document Converter" node
2. Select "Output Format (DOCX)" parameter:
โข Plain Text โ Simple extraction
โข HTML โ Tables + formatting preserved
๐ก Example Output
Plain Text Output
{
"text": "Situation: Often search by one field\nAction: Create index on that field"
}
HTML Output (with tables)
{
"text": "<table><tr><td><strong>Situation</strong></td><td><strong>Action</strong></td></tr><tr><td>Often search by one field</td><td>Create index on that field</td></tr></table>"
}
๐ฏ HTML Format Features
| Tables | <table>, <tr>, <td> - full structure preserved |
| Formatting | <strong>, <em>, <h1>-<h6> |
| Lists | <ul>, <ol>, <li> |
| Paragraphs | <p> tags for structure |
| AI-Friendly | โ
Understood by ChatGPT, Claude, Gemini |
๐ XLSX Multi-Sheet Processing
๐๏ธ How It Works
{
"sheets": {
"Products": [
{ "A": "ID", "B": "Name", "C": "Price" },
{ "A": 1, "B": "Apple", "C": 100 },
{ "A": 2, "B": "Banana", "C": 50 }
],
"Orders": [
{ "A": "Order", "B": "Quantity" },
{ "A": 101, "B": 5 }
]
}
}
๐ Key Features
| Multiple Sheets | Each sheet = separate array in sheets object |
| Column Names | A, B, C... Z (Excel-style) |
| Row Format | Array of objects (rows) |
| Empty Cells | Skipped (only filled cells included) |
| Size Limit | 10,000 rows per sheet (configurable) |
| Memory Safe | Large files auto-limited to prevent OOM |
๐ Installation
Option 1: npm Package (Recommended)
Via n8n web interface:
Settings โ Community nodes โ Install
Package name: @mazix/n8n-nodes-converter-documents
Or via command line:
npm install @mazix/n8n-nodes-converter-documents
Option 2: Standalone Version
git clone https://github.com/mazixs/n8n-node-converter-documents.git
cd n8n-node-converter-documents
npm install
npm run standalone
cp -r ./standalone ~/.n8n/custom-nodes/n8n-node-converter-documents
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install
Option 3: Manual Installation
mkdir -p ~/.n8n/custom-nodes/n8n-node-converter-documents
cp dist/*.js dist/*.svg ~/.n8n/custom-nodes/n8n-node-converter-documents/
cp package.json ~/.n8n/custom-nodes/n8n-node-converter-documents/
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm install --production
๐ Usage Examples
Text Document Output
{
"text": "Extracted text content...",
"metadata": {
"fileName": "document.docx",
"fileSize": 12345,
"fileType": "docx",
"processedAt": "2024-06-01T12:00:00.000Z"
}
}
Excel Spreadsheet Output
{
"sheets": {
"Sheet1": [
{ "A": "Name", "B": "Age", "C": "City" },
{ "A": "Alice", "B": 30, "C": "Moscow" },
{ "A": "Bob", "B": 25, "C": "SPB" }
]
},
"metadata": {
"fileName": "data.xlsx",
"fileSize": 23456,
"fileType": "xlsx"
}
}
JSON Normalization
Input:
{
"user": {
"name": "John",
"address": { "city": "Moscow" }
}
}
Output (flattened):
{
"text": "{\n \"user.name\": \"John\",\n \"user.address.city\": \"Moscow\"\n}",
"warning": "Multi-level JSON structure was converted to flat object"
}
๐๏ธ Architecture
Strategy Pattern Implementation
DOCX Processing Flow:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. If outputFormat === 'html': โ
โ โ mammoth.convertToHtml() โ
โ โ [Success] Return HTML โ
โ โ [Fail] Fallback to text โ
โ โ
โ 2. Text mode (default): โ
โ โ officeparser (primary) โ
โ โ mammoth.extractRawText (fb) โ
โ โ XML direct parsing (last) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Technology Stack
|
Core Libraries
officeparser (v5.1.1) - Primary parser
mammoth (v1.9.1) - DOCX processor
exceljs (v4.4.0) - Excel handler
pdf-parse (v1.1.1) - PDF fallback
papaparse (v5.5.3) - CSV parser
|
Build & Quality
- TypeScript 5.8 (strict mode)
- Jest (80 tests passing)
- ESLint (TypeScript rules)
- Webpack bundling
- CommonJS modules
|
Security Features
| Input Validation | Strict type & structure checks |
| XSS Protection | sanitize-html library |
| Path Traversal | File name sanitization |
| Memory Limits | 10K rows/sheet, 50MB default |
| Dependency Audit | Regular npm audit checks |
๐ป Development
Quick Start
npm install
npm run dev
npm run build
npm test
npm run lint
Build Commands
npm run build | TypeScript โ JavaScript |
npm run bundle | Webpack bundling |
npm run standalone | Standalone with deps |
npm run test:coverage | Coverage report |
npm run lint:fix | Auto-fix issues |
Project Structure
โโโ src/
โ โโโ FileToJsonNode.node.ts # Main node (Strategy Pattern)
โ โโโ helpers.ts # Utilities
โ โโโ errors.ts # Custom errors
โโโ test/
โ โโโ unit/ # Unit tests
โ โโโ integration/ # Integration tests
โ โโโ samples/ # Test files
โโโ docs/ # Documentation
โ โโโ SOLUTION.md
โ โโโ HTML_CONVERSION_PLAN.md
โ โโโ MAMMOTH_ANALYSIS.md
โโโ dist/ # Compiled output
๐ Latest Updates
๐ v1.0.22 (Current - 2025-10-10)
|
๐จ UI & Quality
- โ
Node renamed: "Document Converter"
- โ
Icon fixed: 60ร60 (proper size)
- โ
Code refactored: -78 lines
- โ
Zero duplication: 100% eliminated
- โ
Full error handling: PPTX fixed
|
๐ Docs & Tests
- โ
README redesign: Badges, TOC, tables
- โ
80 tests passing (+7 XLSX)
- โ
Full JSDoc: All functions documented
- โ
Better IntelliSense: IDE support improved
- โ
Professional look: Visual tables & icons
|
What's New:
+ Node renamed to "Document Converter" (better UX)
+ Icon size fixed: 2048ร1853 โ 60ร60
+ Code quality: eliminated all duplication
+ BaseConverterError class (DRY principle)
+ checkCFBFormat() helper (unified CFB check)
+ processViaOfficeParser() helper (unified error handling)
+ Full JSDoc documentation added
+ README complete visual redesign
+ 7 new XLSX multi-sheet tests
Previous Versions
v1.0.21 - DOCX to HTML Conversion
- DOCX to HTML conversion with table support
- outputFormat parameter (text | html)
- Table preservation in HTML
- AI/LLM friendly output
- 73 tests passing
v1.0.20 - TextBox & Shapes Support
- Extract text from TextBoxes and shapes
- ONLYOFFICE document fix
- 62 tests passing
v1.0.19 - ONLYOFFICE Parser Fix
- Fixed XML namespace extraction
- No more schema URLs in output
- 61 tests passing
๐ Documentation
๐ง Troubleshooting
Common Issues
Error: Cannot find module 'exceljs'
npm run standalone
cd ~/.n8n/custom-nodes/n8n-node-converter-documents
npm list
npm install
Large files causing OOM
- Split files into smaller parts
- Reduce
maxFileSize parameter
- Use streaming for CSV/TXT formats
โ ๏ธ Limitations
| Legacy formats | DOC, PPT, XLS not supported | Convert to DOCX, PPTX, XLSX |
| Memory | Large PDF/XLSX load into RAM | Split files or increase memory |
| File size | Default 50MB limit | Configurable up to 100MB |
๐ Statistics
- 12+ file formats supported
- 80 tests passing
- 5 specialized parsers
- 10K rows per sheet limit
- 100MB max file size
- 0 critical vulnerabilities
๐ค Contributing
Issues and pull requests are welcome!
๐ License
MIT ยฉ mazix
๐ Links
Made with โค๏ธ for the n8n community
If you find this helpful, please โญ star the repository!
[1.0.22] - 2025-10-10
๐จ UI & Branding
- Node Renamed: "Convert File to JSON" โ "Document Converter"
- Better reflects actual functionality (text, HTML, sheets)
- More intuitive for users
- Updated display name and defaults
๐ง Code Quality & Refactoring
FileToJsonNode.node.ts (Reduced by 78 lines):
- โ
Eliminated CFB duplication: Created
checkCFBFormat() helper (was duplicated in DOC/PPT)
- โ
Unified error handling: Created
processViaOfficeParser() for ODT/ODP/ODS (eliminated 3ร duplication)
- โ
Fixed PPTX error handling: Added proper error handling (was missing)
- โ
Cleaner code: -78 lines without losing functionality
errors.ts (Enhanced with base class):
- โ
DRY principle: Created
BaseConverterError class
- โ
Better stack traces: Added
Error.captureStackTrace for debugging
- โ
Full JSDoc: Documented all error classes
helpers.ts (Enhanced documentation):
- โ
JSDoc added: Complete documentation with @param, @returns, @throws
- โ
Usage examples: Added @example tags
- โ
Better IntelliSense: IDE autocomplete improved
icon.svg (Fixed size):
- โ
Correct dimensions: 2048ร1853 โ 60ร60 (n8n standard)
- โ
Better visibility: Icon now displays at proper size in n8n UI
๐ Documentation
README.md (Complete redesign):
- โ
Badges added: npm version, tests, license, TypeScript
- โ
Table of Contents: Easy navigation with anchors
- โ
Visual tables: 12 comparison and feature tables
- โ
XLSX section: New multi-sheet processing documentation
- โ
Collapsible details: Better organized examples
- โ
Updated stats: 80 tests (was 73)
๐งช Testing
- 80 tests passing (+7 new XLSX tests)
- New file:
test/integration/xlsx-sheets.test.ts
- Multi-sheet handling tests
- Column letter conversion tests
- Size limiting tests (10K rows/sheet)
- Sparse data handling tests
- Output format verification tests
๐ Code Quality Metrics
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Code duplication | 3 places | 0 | โ
100% eliminated |
| Lines of code | 920 | 870 | โ 50 lines |
| Error handling coverage | Incomplete | 100% | โ
Fixed PPTX |
| Documentation | Basic | Full JSDoc | โ
Complete |
| Test coverage | 73 tests | 80 tests | +7 tests |
๐ฏ Impact
- Users: Better node naming, proper icon size
- Developers: Cleaner codebase, easier to maintain
- Documentation: Professional README with visual aids
- Quality: Zero code duplication, full error handling
๐ Files Changed
package.json: Version bump to 1.0.22
src/FileToJsonNode.node.ts: -78 lines, +2 helper functions
src/errors.ts: Added BaseConverterError class
src/helpers.ts: Added full JSDoc documentation
src/icon.svg: Fixed dimensions (60ร60)
README.md: Complete visual redesign
test/integration/xlsx-sheets.test.ts: +7 new tests
docs/README.md: Updated node name
docs/HTML_CONVERSION_PLAN.md: Updated node name