@ulu/pandoc-adapter
A Node.js package that provides utilities for converting documents using Pandoc, with a focus on processing multiple files and managing extracted assets with transformFiles.
Important: This package acts as a wrapper around the Pandoc command-line tool. Therefore, you must have the Pandoc binary installed and accessible in your system's PATH for this package to function correctly. This package itself does not install Pandoc.
Installation
npm install @ulu/pandoc-adapter
Primary Functions
pandoc(config)
Executes the Pandoc binary for document conversion.
import { pandoc } from '@ulu/pandoc-adapter';
import fs from 'fs';
import path from 'path';
async function convertDocxToHtml(inputFilePath, outputPath) {
try {
const inputFileContent = fs.readFileSync(inputFilePath);
const html = await pandoc({
input: inputFileContent,
args: ['--from=docx', '--to=html', '-o', outputPath]
});
console.log('Successfully converted to HTML:', outputPath);
} catch (error) {
console.error('Pandoc conversion failed:', error);
}
}
const inputFilePath = 'path/to/your/input.docx';
const outputFilePath = 'output.html';
config
Object Properties:
input
(string, optional): The input content to be processed by Pandoc. This can be a string which will be piped to Pandoc's stdin. If arguments to Pandoc specify an input file or URL, this can be an empty string.
args
(string[], optional): An array of command-line arguments to pass to the Pandoc binary (e.g., ['-f', 'markdown', '-t', 'html', 'input.md', '-o', 'output.html']
).
options
(object, optional): Additional options.
allowError
(boolean, optional): If true
, resolves with stdout even on Pandoc error. Defaults to false
.
allowStdoutError
(boolean, optional): If true
, resolves with stdout even if Pandoc writes to stderr. Defaults to false
.
execFile
(object, optional): Options passed directly to the child_process.execFile
function.
maxBuffer
(number, optional): The maximum amount of data (in bytes) allowed on stdout or stderr. Defaults to 5242880
(5MB).
Returns: A Promise
that resolves with the stdout from the Pandoc binary.
transformFiles(userOptions)
Transforms multiple files based on a glob pattern, converting them to both HTML and Markdown. It handles output directory creation, asset extraction, and optional URL updating.
When updateAssetUrls
is enabled, it will replace the extracted asset path (assetDir
) out of the images/etc and replacing it with assetPublicPath
so that it can be used in a website. This assetPublicPath
would be a root-relative path like "/assets/extracted" so that anywhere an image is used (ie. "/tools.html" vs /some/other/page.html") it would work (not relative to page).
import path from "path";
import { getUrlDirname } from "@ulu/utils/node/path.js";
import { transformFiles } from "@ulu/pandoc-adapter";
const __dirname = getUrlDirname(import.meta.url);
const options = {
pattern: '*.docx',
inputDir: path.resolve(__dirname, "docx/"),
outputDir: path.resolve(__dirname, "dist/markup/"),
assetDir: path.resolve(__dirname, "dist/assets/"),
};
(async () => {
try {
await transformFiles(options);
console.log('Successfully processed files!');
} catch (error) {
console.error('File transformation failed:', error);
}
})();
userOptions
Object (extends transformFilesDefaults
):
inputDir
(string, required): Absolute path to the input directory.
outputDir
(string, required): Absolute path to the output directory.
assetDir
(string, required): Absolute path to the directory for extracted assets.
assetPublicPath
(string, optional): Public URL path for assets (used for updating URLs in output). Defaults to "/assets/extracted"
.
pattern
(string, optional): Glob pattern to select input files. Defaults to "[!~]?*.docx"
.
emptyOutputDir
(boolean, optional): If true
, empties the output directory before processing. Defaults to true
.
emptyAssetDir
(boolean, optional): If true
, empties the asset directory before processing. Defaults to true
.
updateAssetUrls
(boolean, optional): If true
, updates asset paths in the output to assetPublicPath
. Defaults to true
.
adapterOptions
(object, optional): Options passed to the underlying pandoc
function.
allowError
(boolean, optional): Allow errors from pandoc.
allowStdoutError
(boolean, optional): Allow errors from pandoc stdout.
execFile
(object, optional): Options passed directly to the child_process.execFile
function.
maxBuffer
(number, optional): Maximum buffer size.
getFileOutputPath(ctx)
(function, optional): Function to determine the output file path.
getFileOutputDir(ctx)
(function, optional): Function to determine the output directory for a file.
getFileAssetDir(ctx)
(function, optional): Function to determine the asset extraction directory for a file.
getHtmlArgs(ctx)
(function, optional): Function to generate Pandoc arguments for HTML conversion.
getMarkdownArgs(ctx)
(function, optional): Function to generate Pandoc arguments for Markdown conversion.
beforeWrite(markup, ctx)
(function, optional): Function to modify the output (HTML or Markdown) before writing to disk.
Returns: A Promise
that resolves when all files are processed.
transformFilesDefaults
An object containing the default options for the transformFiles
function.
import { transformFilesDefaults } from '@ulu/pandoc-adapter';
console.log(transformFilesDefaults);
Exported Utilities (utils
)
htmlRemoveImageStyles(html)
: Removes style attributes from <img>
tags in HTML.
markdownRemoveImageDimensions(markdown)
: Removes width and height attributes from <img>
tags in Markdown.
cleanHtml(html)
: Removes image styles from HTML and then pretty-formats it.
urlize(string)
: Converts a string to a URL-friendly slug.
Exported Presets (presets
)
An object containing predefined Pandoc argument arrays for common conversions.
import { presets } from '@ulu/pandoc-adapter';
console.log(presets.docxToMarkdownPandoc);
Available presets:
docxToMarkdownPandoc
docxToMarkdown
docxToHtml
docxToIcml
Full API Documentation
API Documentation
Change Log
Change Log