Research
Security News
Threat Actor Exposes Playbook for Exploiting npm to Build Blockchain-Powered Botnets
A threat actor's playbook for exploiting the npm ecosystem was exposed on the dark web, detailing how to build a blockchain-powered botnet.
Mammoth is an npm package designed to convert .docx documents into HTML and plain text. It focuses on generating clean, simple HTML that is free from inline styles and other unnecessary elements.
Convert .docx to HTML
This feature allows you to convert a .docx file to HTML. The resulting HTML is clean and free from unnecessary inline styles.
const mammoth = require('mammoth');
mammoth.convertToHtml({path: 'path/to/document.docx'})
.then(function(result){
var html = result.value; // The generated HTML
var messages = result.messages; // Any messages, such as warnings during conversion
console.log(html);
})
.catch(function(err) {
console.error(err);
});
Convert .docx to plain text
This feature allows you to extract plain text from a .docx file. It is useful for scenarios where you need the text content without any formatting.
const mammoth = require('mammoth');
mammoth.extractRawText({path: 'path/to/document.docx'})
.then(function(result){
var text = result.value; // The extracted raw text
var messages = result.messages; // Any messages, such as warnings during conversion
console.log(text);
})
.catch(function(err) {
console.error(err);
});
Convert .docx to Markdown
This feature allows you to convert a .docx file to Markdown. The resulting Markdown is clean and easy to read.
const mammoth = require('mammoth');
mammoth.convertToMarkdown({path: 'path/to/document.docx'})
.then(function(result){
var markdown = result.value; // The generated Markdown
var messages = result.messages; // Any messages, such as warnings during conversion
console.log(markdown);
})
.catch(function(err) {
console.error(err);
});
Docxtemplater is a library for generating .docx documents from templates. Unlike Mammoth, which focuses on converting .docx files to other formats, Docxtemplater is used for creating .docx files by filling in templates with data.
Officegen is a library for generating .docx, .xlsx, and .pptx files. It is more versatile than Mammoth in terms of the types of documents it can create, but it does not offer conversion features like Mammoth.
Unoconv is a command-line tool that uses LibreOffice to convert between different office document formats, including .docx to HTML. It is more comprehensive in terms of format support but requires LibreOffice to be installed, unlike Mammoth which is a pure JavaScript solution.
Mammoth is designed to convert .docx documents,
such as those created by Microsoft Word,
and convert them to HTML.
Mammoth aims to produce simple and clean HTML by using semantic information in the document,
and ignoring other details.
For instance,
Mammoth converts any paragraph with the style Heading1
to h1
elements,
rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
npm install mammoth
To convert an existing .docx file to HTML, use mammoth.convertToHtml
:
var mammoth = require("mammoth");
var result = mammoth.convertToHtml({path: "path/to/document.docx"});
var html = result.value; // The generated HTML
var messages = result.messages; // Any messages, such as warnings during conversion
By default,
Mammoth maps some common .docx styles to HTML elements.
For instance,
a paragraph with the style Heading1
is converted to a h1
element.
You can pass in a custom map for styles by passing an options object as a second argument to convertToHtml
:
var mammoth = require("mammoth");
var style = mammoth.style;
var options = {
styleMap: [
style("p.Heading1 => h1"),
style("p.Heading2 => h2")
]
};
var result = mammoth.convertToHtml({path: "path/to/document.docx"}, options);
To extend the standard style map:
var mammoth = require("mammoth");
var style = mammoth.style;
var customStyles = [
style("p.AsideHeading => div.aside > h2:fresh"),
style("p.AsideText => div.aside > p:fresh")
];
var options = {
styleMap: customStyles.concat(mammoth.standardOptions.styleMap)
};
var result = mammoth.convertToHtml({path: "path/to/document.docx"}, options);
A style has two parts:
When converting each paragraph, Mammoth finds the first style where the document element matcher matches the current paragraph. Mammoth then ensures the HTML path is satisfied.
When writing styles, it's helpful to understand Mammoth's notion of freshness. When generating, Mammoth will only close an HTML element when necessary. Otherwise, elements are reused.
For instance, suppose one of the specified styles is p.Heading1 => h1
.
If Mammoth encounters a .docx paragraphs with the style Heading1
,
the .docx paragraph is converted to a h1
element with the same text.
If the next .docx paragraph also has the style Heading1
,
then the text of that paragraph will be appended to the existing h1
element,
rather than creating a new h1
element.
In most cases, you'll probably want to generate a new h1
element instead.
You can specify this by using the :fresh
modifier:
p.Heading1 => h1:fresh
The two consective Heading1
.docx paragraphs will then be converted to two separate h1
elements.
Reusing elements is useful in generating more complicated HTML structures.
For instance, suppose your .docx contains asides.
Each aside might have a heading and some body text,
which should be contained within a single div.aside
element.
In this case, styles similar to AsideHeading => div.aside > h2:fresh
and
AsideText => div.aside > p:fresh
might be helpful.
Match any paragraph:
p
Match any run:
r
To match a paragraph or run with a specific style name,
append a dot followed by the style name.
For instance, to match a paragraph with the style Heading1
:
p.Heading1
The simplest HTML path is to specify single element.
For instance, to specify an h1
element:
h1
To give an element a CSS class, append a dot followed by the name of the class:
h1.section-title
To require that an element is fresh, use :fresh
:
h1:fresh
Modifiers must be used in the correct order:
h1.section-title:fresh
Use >
to specify nested elements.
For instance, to specify h2
within div.aside
:
div.aside > h2
You can nest elements to any depth.
FAQs
Convert Word documents from docx to simple HTML and Markdown
The npm package mammoth receives a total of 284,384 weekly downloads. As such, mammoth popularity was classified as popular.
We found that mammoth demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A threat actor's playbook for exploiting the npm ecosystem was exposed on the dark web, detailing how to build a blockchain-powered botnet.
Security News
NVD’s backlog surpasses 20,000 CVEs as analysis slows and NIST announces new system updates to address ongoing delays.
Security News
Research
A malicious npm package disguised as a WhatsApp client is exploiting authentication flows with a remote kill switch to exfiltrate data and destroy files.