node-html-markdown
NHM is a fast HTML to markdown converter, compatible with both node and the browser.
It was built with the following two goals in mind:
1. Speed
We had a need to convert gigabytes of HTML daily very quickly. All libraries we found were too slow with node.
We considered using a low-level language but decided to attempt to write something that would squeeze every bit
of performance out of the JIT that we could. The end result was fast enough to make the cut!
2. Human Readability
The other libraries we tested produced output that would break in numerous conditions and produced output with many
repeating linefeeds, etc. Generally speaking, outside of a markdown viewer, the result was not easy to read.
We took the approach of producing a clean, concise result with consistent spacing rules.
Install
<yarn|npm|pnpm> add node-html-markdown
Benchmarks
-----------------------------------------------------------------------------
Estimated processing times (fastest to slowest):
[node-html-markdown (reused instance)]
100 kB: 17ms
1 MB: 176ms
50 MB: 8.80sec
1 GB: 3min, 0sec
50 GB: 2hr, 30min, 14sec
[turndown (reused instance)]
100 kB: 27ms
1 MB: 280ms
50 MB: 13.98sec
1 GB: 4min, 46sec
50 GB: 3hr, 58min, 35sec
-----------------------------------------------------------------------------
Speed comparison - node-html-markdown (reused instance) is:
1.02 times as fast as node-html-markdown
1.57 times as fast as turndown
1.59 times as fast as turndown (reused instance)
-----------------------------------------------------------------------------
Usage
import { NodeHtmlMarkdown, NodeHtmlMarkdownOptions } from 'node-html-markdown'
NodeHtmlMarkdown.translate(
`<b>hello</b>`,
{},
undefined,
undefined
);
NodeHtmlMarkdown.translate(
{
'file1.html': `<b>hello</b>`,
'file2.html': `<b>goodbye</b>`
},
{},
undefined,
undefined
);
const nhm = new NodeHtmlMarkdown(
{},
undefined,
undefined
);
nhm.translate( `<b>hello</b>`);
nhm.translate(
{
'file1.html': `<b>hello</b>`,
'file2.html': `<b>goodbye</b>`
},
);
Options
export interface NodeHtmlMarkdownOptions {
preferNativeParser: boolean,
codeFence: string,
bulletMarker: string,
codeBlockStyle: 'indented' | 'fenced',
emDelimiter: string,
strongDelimiter: string,
strikeDelimiter: string,
ignore?: string[],
blockElements?: string[],
maxConsecutiveNewlines: number,
lineStartEscape: [ pattern: RegExp, replacement: string ]
globalEscape: [ pattern: RegExp, replacement: string ]
textReplace?: [ pattern: RegExp, replacement: string ][]
keepDataImages?: boolean
useLinkReferenceDefinitions?: boolean
useInlineLinks?: boolean
}
Custom Translators
Custom translators are an advanced option to allow handling certain elements a specific way.
These can be modified via the NodeHtmlMarkdown#translators
property, or added during creation.
For detail on how to use them see:
The NodeHtmlMarkdown#codeBlockTranslators
property is a collection of translators which handles elements within a <pre><code>
block.
Further improvements
Being a performance-centric library, we're always interested in further improvements.
There are several probable routes by which we could gain substantial performance increases over the current model.
Such methods include:
- Writing a custom parser
- Integrating an async worker-thread based model for multi-threading
- Fully replacing any remaining regex
These would be fun to implement; however, for the time being, the present library is fast enough for my purposes. That
said, I welcome discussion and any PR toward the effort of further improving performance, and I may ultimately do more
work in that capacity in the future!
Help Wanted!
Looking to contribute? Check out our help wanted list for a good place to start!