Agent Markdown
An accurate, extensible, and fast HTML-to-markdown converter.
Agent Markdown is a HTML user agent that parses HTML, performs a document layout according to the CSS stylesheet for HTML and then "renders" the laid out document to Markdown. This results in markdown that looks very similar to the way the HTML document looked when parsed and rendered in a browser (user agent).
Usage / Quick Start
import { AgentMarkdown } from "agentmarkdown"
const markdownString = await AgentMarkdown.produce(htmlString)
Install
npm (npm install agentmarkdown
)
Features
CLI Example
You can convert any HTML file to Markdown at the command line using the following command, and the markdown output will be printed to stdout:
agentmarkdown <filename.html>
It also responds to stdin, if you pipe html to it. So you can do things like:
echo "<b>bold</bold>" | agentmarkdown > myfile.md
The above commands assume you installed agentmarkdown with npm install --global agentmarkdown
but it also works with npx
so you can run it without installing like:
npx agentmarkdown <filename.html>
Live Example
You can view the live online web example at https://agentmarkdown.now.sh.
You can build and the web example locally with the following commands:
cd example/
npm install
npm run start
NOTE: If you have trouble starting the example on macOS related to fsevents
errors, it may require running xcode-select --install
. If that doesn't work, then possibly a sudo rm -rf $(xcode-select -print-path)
followed by xcode-select --install
will be necessary.
Customize & Extend with Plugins
To customize how the markdown is generated or add support for new elements, implement the LayoutPlugin
interface to handle a particular HTML element. The LayoutPlugin
interface is defined as follows:
export interface LayoutPlugin {
elementName: string
layout: LayoutGenerator
}
The LayoutGenerator
is a single function that performs a CSS2 box generation layout algorithm on the an HTML element. Essentially it creates zero or more boxes for the given element that AgentMarkdown will render to text. A box can contain text content and/or other boxes, and each box has a type of inline
or block
. Inline blocks are laid out horizontally. Block boxes are laid out vertically (i.e. they have new line characters before and after their contents). The LayoutGenerator
function definition is as follows:
export interface LayoutGenerator {
(
context: LayoutContext,
manager: LayoutManager,
element: HtmlNode
): CssBox | null
}
An example of how the HTML <b>
element could be implemented as a plugin like the following:
class BoldPlugin {
elementName: "b"
layout: LayoutGenerator = (
context: LayoutContext,
manager: LayoutManager,
element: HtmlNode
): CssBox | null => {
const kids = manager.layout(context, element.children)
kids.unshift(manager.createBox(BoxType.inline, "**"))
kids.push(manager.createBox(BoxType.inline, "**"))
return manager.createBox(BoxType.inline, "", kids)
}
}
To initialize AgentMarkdown with plugins pass them in as an array value for the layoutPlugins
option as follows. To customize the rendering an element you can just specify a plugin for the elementName and your plugin will override the built-in plugin.
const result = await AgentMarkdown.render({
html: myHtmlString,
layoutPlugins: [
new BoldPlugin()
]
})
Show your support
Please give a ⭐️ if this project helped you!
Contributing 🤝
This is a community project. We invite your participation through issues and pull requests! You can peruse the contributing guidelines.
Building
The package is written in TypeScript. To build the package run the following from the root of the repo:
npm run build # It will be built in /dist
Release Process (Deploying to NPM) 🚀
We use semantic-release to consistently release semver-compatible versions. This project deploys to multiple npm distribution tags. Each of the below branches correspond to the following npm distribution tags:
branch | npm distribution tag |
---|
master | latest |
beta | beta |
To trigger a release use a Conventional Commit following Angular Commit Message Conventions on one of the above branches.
Todo / Roadmap
see /docs/todo.md
Alternatives
License 📝
Copyright © 2019 Scott Willeke.
This project is licensed via Mozilla Public License 2.0.