Socket
Socket
Sign inDemoInstall

html-to-text

Package Overview
Dependencies
Maintainers
2
Versions
55
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

html-to-text - npm Package Compare versions

Comparing version 7.1.1 to 8.0.0

test/test-multiple-elements.txt

4

.eslintrc.js

@@ -10,3 +10,3 @@ module.exports = {

],
parserOptions: {},
parserOptions: { ecmaVersion: 2018 },
env: {

@@ -104,3 +104,3 @@ es6: true,

'semi-style': 'error',
'sort-keys': ['error', 'asc', { minKeys: 3 }],
'sort-keys': ['error', 'asc', { minKeys: 4 }],
'space-before-blocks': 'error',

@@ -107,0 +107,0 @@ 'space-before-function-paren': ['error'],

# Changelog
## Version 8.0.0
All commits: [7.1.1...8.0.0](https://github.com/html-to-text/node-html-to-text/compare/7.1.1...8.0.0)
Version 8 roadmap issue: [#228](https://github.com/html-to-text/node-html-to-text/issues/228)
### Selectors
The main focus of this version. Addresses the most demanded user requests ([#159](https://github.com/html-to-text/node-html-to-text/issues/159), [#179](https://github.com/html-to-text/node-html-to-text/issues/179), partially [#143](https://github.com/html-to-text/node-html-to-text/issues/143)).
It is now possible to specify formatting options or assign custom formatters not only by tag names but by almost any selectors.
See the README [Selectors](https://github.com/html-to-text/node-html-to-text#selectors) section for details.
Note: The new `selectors` option is an array, in contrast to the `tags` option introduced in version 6 (and now deprecated). Selectors have to have a well defined order and object properties is not a right tool for that.
Two new packages were created to enable this feature - [parseley](https://github.com/mxxii/parseley) and [selderee](https://github.com/mxxii/selderee).
### Base elements
The same selectors implementation is used now to narrow down the conversion to specific HTML DOM fragments. Addresses [#96](https://github.com/html-to-text/node-html-to-text/issues/96). (Previous implementation had more limited selectors format.)
BREAKING CHANGE: All outermost elements matching provided selectors will be present in the output (previously it was only the first match for each selector). Addresses [#215](https://github.com/html-to-text/node-html-to-text/issues/215).
`limits.maxBaseElements` can be used when you only need a fixed number of base elements and would like to avoid checking the rest of the source HTML document.
Base elements can be arranged in output text in the order of matched selectors (default, to keep it closer to the old implementation) or in the order of appearance in sourse HTML document.
BREAKING CHANGE: previous implementation was treating id selectors in the same way as class selectors (could match `<foo id="a b">` with `foo#a` selector). New implementation is closer to the spec and doesn't expect multiple ids on an element. You can achieve the old behavior with `foo[id~=a]` selector in case you rely on it for some poorly formatted documents (note that it has different specificity though).
### Batch processing
Since options preprocessing is getting more involved with selectors compilation, it seemed reasonable to break the single `htmlToText()` function into compilation and convertation steps. It might provide some performance benefits in client code.
* new function `compile(options)` returns a function of a single argument (html string);
* `htmlToText(html, options)` is now an alias to `convert(html, options)` function and works as before.
### Deprecated options
* `baseElement`;
* `returnDomByDefault`;
* `tables`;
* `tags`.
Refer to README for [migration instructions](https://github.com/html-to-text/node-html-to-text#deprecated-or-removed-options).
No previously deprecated stuff is removed in this version. Significant cleanup is planned for version 9 instead.
## Version 7.1.1

@@ -4,0 +52,0 @@

// eslint-disable-next-line no-unused-vars
const { Picker } = require('selderee');
const { trimCharacter } = require('./helper');

@@ -24,5 +27,7 @@ // eslint-disable-next-line no-unused-vars

* @param { Options } options HtmlToText options.
* @param { Picker<DomNode, TagDefinition> } picker Selectors decision tree picker.
*/
constructor (options) {
constructor (options, picker) {
this.options = options;
this.picker = picker;
this.whitepaceProcessor = new WhitespaceProcessor(options);

@@ -29,0 +34,0 @@ /** @type { StackItem } */

@@ -329,3 +329,3 @@ const he = require('he');

const formatHeaderCell = (formatOptions.uppercaseHeaderCells)
const formatHeaderCell = (formatOptions.uppercaseHeaderCells !== false)
? (cellNode) => {

@@ -332,0 +332,0 @@ builder.pushWordTransform(str => str.toUpperCase());

/**
* Split given tag selector into it's components.
* Only element name, class names and ID names are supported.
*
* @param { string } selector Tag selector ("tag.class#id" etc).
* @returns { { classes: string[], element: string, ids: string[] } }
*/
function splitSelector (selector) {
function getParams (re, string) {
const captures = [];
let found;
while ((found = re.exec(string)) !== null) {
captures.push(found[1]);
}
return captures;
}
const merge = require('deepmerge');
return {
classes: getParams(/\.([\d\w-]*)/g, selector),
element: /(^\w*)/g.exec(selector)[1],
ids: getParams(/#([\d\w-]*)/g, selector)
};
}
/**

@@ -163,5 +141,33 @@ * Given a list of class and ID selectors (prefixed with '.' and '#'),

/**
* Deduplicate an array by a given key callback.
* Item properties are merged recursively and with the preference for last defined values.
* Of items with the same key, merged item takes the place of the last item,
* others are omitted.
*
* @param { any[] } items An array to deduplicate.
* @param { (x: any) => string } getKey Callback to get a value that distinguishes unique items.
* @returns { any[] }
*/
function mergeDuplicatesPreferLast (items, getKey) {
const map = new Map();
for (let i = items.length; i-- > 0;) {
const item = items[i];
const key = getKey(item);
map.set(
key,
(map.has(key))
? merge(item, map.get(key), { arrayMerge: overwriteMerge })
: item
);
}
return [...map.values()].reverse();
}
const overwriteMerge = (acc, src, options) => [...src];
module.exports = {
get: get,
limitedDepthRecursive: limitedDepthRecursive,
mergeDuplicatesPreferLast: mergeDuplicatesPreferLast,
numberToLetterSequence: numberToLetterSequence,

@@ -171,4 +177,3 @@ numberToRoman: numberToRoman,

splitClassesAndIds: splitClassesAndIds,
splitSelector: splitSelector,
trimCharacter: trimCharacter
};

@@ -0,8 +1,10 @@

const { hp2Builder } = require('@selderee/plugin-htmlparser2');
const merge = require('deepmerge');
const he = require('he');
const htmlparser = require('htmlparser2');
const selderee = require('selderee');
const { BlockTextBuilder } = require('./block-text-builder');
const defaultFormatters = require('./formatter');
const { limitedDepthRecursive, set, splitSelector } = require('./helper');
const { limitedDepthRecursive, mergeDuplicatesPreferLast, set } = require('./helper');

@@ -22,3 +24,7 @@ // eslint-disable-next-line import/no-unassigned-import

const DEFAULT_OPTIONS = {
baseElement: 'body',
baseElements: {
selectors: [ 'body' ],
orderBy: 'selectors', // 'selectors' | 'occurrence'
returnDomByDefault: true
},
decodeOptions: {

@@ -31,2 +37,3 @@ isAttributeValue: false,

ellipsis: '...',
maxBaseElements: undefined,
maxChildNodes: undefined,

@@ -41,36 +48,51 @@ maxDepth: undefined,

preserveNewlines: false,
returnDomByDefault: true,
tables: [],
tags: {
'': { format: 'inline' }, // defaults for any other tag name
'a': {
selectors: [
{ selector: '*', format: 'inline' },
{
selector: 'a',
format: 'anchor',
options: { baseUrl: null, hideLinkHrefIfSameAsText: false, ignoreHref: false, noAnchorUrl: true, noLinkBrackets: false }
options: {
baseUrl: null,
hideLinkHrefIfSameAsText: false,
ignoreHref: false,
noAnchorUrl: true,
noLinkBrackets: false
}
},
'article': { format: 'block' },
'aside': { format: 'block' },
'blockquote': {
{ selector: 'article', format: 'block' },
{ selector: 'aside', format: 'block' },
{
selector: 'blockquote',
format: 'blockquote',
options: { leadingLineBreaks: 2, trailingLineBreaks: 2, trimEmptyLines: true }
},
'br': { format: 'lineBreak' },
'div': { format: 'block' },
'footer': { format: 'block' },
'form': { format: 'block' },
'h1': { format: 'heading', options: { leadingLineBreaks: 3, trailingLineBreaks: 2, uppercase: true } },
'h2': { format: 'heading', options: { leadingLineBreaks: 3, trailingLineBreaks: 2, uppercase: true } },
'h3': { format: 'heading', options: { leadingLineBreaks: 3, trailingLineBreaks: 2, uppercase: true } },
'h4': { format: 'heading', options: { leadingLineBreaks: 2, trailingLineBreaks: 2, uppercase: true } },
'h5': { format: 'heading', options: { leadingLineBreaks: 2, trailingLineBreaks: 2, uppercase: true } },
'h6': { format: 'heading', options: { leadingLineBreaks: 2, trailingLineBreaks: 2, uppercase: true } },
'header': { format: 'block' },
'hr': { format: 'horizontalLine', options: { leadingLineBreaks: 2, length: undefined, trailingLineBreaks: 2 } },
'img': { format: 'image', options: { baseUrl: null } },
'main': { format: 'block' },
'nav': { format: 'block' },
'ol': { format: 'orderedList', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
'p': { format: 'paragraph', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
'pre': { format: 'pre', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
'section': { format: 'block' },
'table': {
{ selector: 'br', format: 'lineBreak' },
{ selector: 'div', format: 'block' },
{ selector: 'footer', format: 'block' },
{ selector: 'form', format: 'block' },
{ selector: 'h1', format: 'heading', options: { leadingLineBreaks: 3, trailingLineBreaks: 2, uppercase: true } },
{ selector: 'h2', format: 'heading', options: { leadingLineBreaks: 3, trailingLineBreaks: 2, uppercase: true } },
{ selector: 'h3', format: 'heading', options: { leadingLineBreaks: 3, trailingLineBreaks: 2, uppercase: true } },
{ selector: 'h4', format: 'heading', options: { leadingLineBreaks: 2, trailingLineBreaks: 2, uppercase: true } },
{ selector: 'h5', format: 'heading', options: { leadingLineBreaks: 2, trailingLineBreaks: 2, uppercase: true } },
{ selector: 'h6', format: 'heading', options: { leadingLineBreaks: 2, trailingLineBreaks: 2, uppercase: true } },
{ selector: 'header', format: 'block' },
{
selector: 'hr',
format: 'horizontalLine',
options: { leadingLineBreaks: 2, length: undefined, trailingLineBreaks: 2 }
},
{ selector: 'img', format: 'image', options: { baseUrl: null } },
{ selector: 'main', format: 'block' },
{ selector: 'nav', format: 'block' },
{
selector: 'ol',
format: 'orderedList',
options: { leadingLineBreaks: 2, trailingLineBreaks: 2 }
},
{ selector: 'p', format: 'paragraph', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
{ selector: 'pre', format: 'pre', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
{ selector: 'section', format: 'block' },
{
selector: 'table',
format: 'table',

@@ -86,8 +108,10 @@ options: {

},
'ul': {
{
selector: 'ul',
format: 'unorderedList',
options: { itemPrefix: ' * ', leadingLineBreaks: 2, trailingLineBreaks: 2 }
},
'wbr': { format: 'wbr' }
},
{ selector: 'wbr', format: 'wbr' },
],
tables: [], // deprecated
whitespaceCharacters: ' \t\r\n\f\u200b',

@@ -97,22 +121,26 @@ wordwrap: 80

const concatMerge = (acc, src, options) => [...acc, ...src];
const overwriteMerge = (acc, src, options) => [...src];
const selectorsMerge = (acc, src, options) => (
(acc.some(s => typeof s === 'object'))
? concatMerge(acc, src, options) // selectors
: overwriteMerge(acc, src, options) // baseElements.selectors
);
/**
* Convert given HTML content to plain text string.
* Preprocess options, compile selectors into a decision tree,
* return a function intended for batch processing.
*
* @param { string } html HTML content to convert.
* @param { Options } [options = {}] HtmlToText options.
* @returns { string } Plain text string.
* @param { Options } [options = {}] HtmlToText options.
* @returns { (html: string) => string } Pre-configured converter function.
* @static
*
* @example
* const { htmlToText } = require('html-to-text');
* const text = htmlToText('<h1>Hello World</h1>', {
* wordwrap: 130
* });
* console.log(text); // HELLO WORLD
*/
function htmlToText (html, options = {}) {
function compile (options = {}) {
options = merge(
DEFAULT_OPTIONS,
options,
{ arrayMerge: (destinationArray, sourceArray, mergeOptions) => sourceArray }
{
arrayMerge: overwriteMerge,
customMerge: (key) => ((key === 'selectors') ? selectorsMerge : undefined)
}
);

@@ -123,12 +151,20 @@ options.formatters = Object.assign({}, defaultFormatters, options.formatters);

const maxInputLength = options.limits.maxInputLength;
if (maxInputLength && html && html.length > maxInputLength) {
console.warn(
`Input length ${html.length} is above allowed limit of ${maxInputLength}. Truncating without ellipsis.`
const uniqueSelectors = mergeDuplicatesPreferLast(options.selectors, (s => s.selector));
const selectorsWithoutFormat = uniqueSelectors.filter(s => !s.format);
if (selectorsWithoutFormat.length) {
throw new Error(
'Following selectors have no specified format: ' +
selectorsWithoutFormat.map(s => `\`${s.selector}\``).join(', ')
);
html = html.substring(0, maxInputLength);
}
const picker = new selderee.DecisionTree(
uniqueSelectors.map(s => [s.selector, s])
).build(hp2Builder);
const handler = new htmlparser.DefaultHandler();
new htmlparser.Parser(handler, { decodeEntities: false }).parseComplete(html);
const baseSelectorsPicker = new selderee.DecisionTree(
options.baseElements.selectors.map((s, i) => [s, i + 1])
).build(hp2Builder);
function findBaseElements (dom) {
return findBases(dom, options, baseSelectorsPicker);
}

@@ -143,12 +179,35 @@ const limitedWalk = limitedDepthRecursive(

const baseElements = Array.isArray(options.baseElement)
? options.baseElement
: [options.baseElement];
const bases = baseElements
.map(be => findBase(handler.dom, options, be))
.filter(b => b)
.reduce((acc, b) => acc.concat(b), []);
return function (html) {
return process(html, options, picker, findBaseElements, limitedWalk);
};
}
const builder = new BlockTextBuilder(options);
limitedWalk(bases, builder);
/**
* Convert given HTML according to preprocessed options.
*
* @param { string } html HTML content to convert.
* @param { Options } options HtmlToText options (preprocessed).
* @param { Picker<DomNode, TagDefinition> } picker
* Tag definition picker for DOM nodes processing.
* @param { (dom: DomNode[]) => DomNode[] } findBaseElements
* Function to extract elements from HTML DOM
* that will only be present in the output text.
* @param { RecursiveCallback } walk Recursive callback.
* @returns { string }
*/
function process (html, options, picker, findBaseElements, walk) {
const maxInputLength = options.limits.maxInputLength;
if (maxInputLength && html && html.length > maxInputLength) {
console.warn(
`Input length ${html.length} is above allowed limit of ${maxInputLength}. Truncating without ellipsis.`
);
html = html.substring(0, maxInputLength);
}
const handler = new htmlparser.DomHandler();
new htmlparser.Parser(handler, { decodeEntities: false }).parseComplete(html);
const bases = findBaseElements(handler.dom);
const builder = new BlockTextBuilder(options, picker);
walk(bases, builder);
return builder.toString();

@@ -158,2 +217,21 @@ }

/**
* Convert given HTML content to plain text string.
*
* @param { string } html HTML content to convert.
* @param { Options } [options = {}] HtmlToText options.
* @returns { string } Plain text string.
* @static
*
* @example
* const { convert } = require('html-to-text');
* const text = convert('<h1>Hello World</h1>', {
* wordwrap: 130
* });
* console.log(text); // HELLO WORLD
*/
function convert (html, options = {}) {
return compile(options)(html);
}
/**
* Map previously existing and now deprecated options to the new options layout.

@@ -165,9 +243,16 @@ * This is a subject for cleanup in major releases.

function handleDeprecatedOptions (options) {
const tagDefinitions = Object.values(options.tags);
const selectorDefinitions = options.selectors;
if (options.tags) {
const tagDefinitions = Object.entries(options.tags).map(
([selector, definition]) => ({ ...definition, selector: selector || '*' })
);
selectorDefinitions.push(...tagDefinitions);
}
function copyFormatterOption (source, format, target) {
if (options[source] === undefined) { return; }
for (const tagDefinition of tagDefinitions) {
if (tagDefinition.format === format) {
set(tagDefinition, ['options', target], options[source]);
for (const definition of selectorDefinitions) {
if (definition.format === format) {
set(definition, ['options', target], options[source]);
}

@@ -192,5 +277,5 @@ }

if (options['ignoreImage']) {
for (const tagDefinition of tagDefinitions) {
if (tagDefinition.format === 'image') {
tagDefinition.format = 'skip';
for (const definition of selectorDefinitions) {
if (definition.format === 'image') {
definition.format = 'skip';
}

@@ -201,34 +286,41 @@ }

if (options['singleNewLineParagraphs']) {
for (const tagDefinition of tagDefinitions) {
if (tagDefinition.format === 'paragraph' || tagDefinition.format === 'pre') {
set(tagDefinition, ['options', 'leadingLineBreaks'], 1);
set(tagDefinition, ['options', 'trailingLineBreaks'], 1);
for (const definition of selectorDefinitions) {
if (definition.format === 'paragraph' || definition.format === 'pre') {
set(definition, ['options', 'leadingLineBreaks'], 1);
set(definition, ['options', 'trailingLineBreaks'], 1);
}
}
}
if (options['baseElement']) {
const baseElement = options['baseElement'];
set(
options,
['baseElements', 'selectors'],
(Array.isArray(baseElement) ? baseElement : [baseElement])
);
}
if (options['returnDomByDefault'] !== undefined) {
set(options, ['baseElements', 'returnDomByDefault'], options['returnDomByDefault']);
}
}
function findBase (dom, options, baseElement) {
let result = null;
function findBases (dom, options, baseSelectorsPicker) {
const results = [];
const splitTag = splitSelector(baseElement);
function recursiveWalk (walk, /** @type { DomNode[] } */ dom) {
if (result) { return; }
dom = dom.slice(0, options.limits.maxChildNodes);
for (const elem of dom) {
if (result) { return; }
if (elem.name === splitTag.element) {
const documentClasses = elem.attribs && elem.attribs.class ? elem.attribs.class.split(' ') : [];
const documentIds = elem.attribs && elem.attribs.id ? elem.attribs.id.split(' ') : [];
if (
splitTag.classes.every(function (val) { return documentClasses.indexOf(val) >= 0; }) &&
splitTag.ids.every(function (val) { return documentIds.indexOf(val) >= 0; })
) {
result = [elem];
return;
}
if (elem.type !== 'tag') {
continue;
}
if (elem.children) { walk(elem.children); }
const pickedSelectorIndex = baseSelectorsPicker.pick1(elem);
if (pickedSelectorIndex > 0) {
results.push({ selectorIndex: pickedSelectorIndex, element: elem });
} else if (elem.children) {
walk(elem.children);
}
if (results.length >= options.limits.maxBaseElements) {
return;
}
}

@@ -241,5 +333,10 @@ }

);
limitedWalk(dom);
limitedWalk(dom);
return options.returnDomByDefault ? result || dom : result;
if (options.baseElements.orderBy !== 'occurrence') { // 'selectors'
results.sort((a, b) => a.selectorIndex - b.selectorIndex);
}
return (options.baseElements.returnDomByDefault && results.length === 0)
? dom
: results.map(x => x.element);
}

@@ -276,4 +373,3 @@

case 'tag': {
const tags = options.tags;
const tagDefinition = tags[elem.name] || tags[''];
const tagDefinition = builder.picker.pick1(elem);
const format = options.formatters[tagDefinition.format];

@@ -293,4 +389,4 @@ format(elem, walk, builder, tagDefinition.options || {});

/**
* @deprecated Import/require `{ htmlToText }` function instead!
* @see htmlToText
* @deprecated Use `{ convert }` function instead!
* @see convert
*

@@ -302,7 +398,9 @@ * @param { string } html HTML content to convert.

*/
const fromString = (html, options = {}) => htmlToText(html, options);
const fromString = (html, options = {}) => convert(html, options);
module.exports = {
htmlToText: htmlToText,
fromString: fromString
compile: compile,
convert: convert,
fromString: fromString,
htmlToText: convert
};

@@ -6,9 +6,5 @@

*
* @property { string | string[] } [baseElement = body]
* The resulting text output will be composed from the text content of this element
* (or elements if an array of strings is specified).
* @property { BaseElementsOptions } [baseElements]
* Options for narrowing down to informative parts of HTML document.
*
* Each entry is a single tag name with optional css class and id parameters,
* e.g. `['p.class1.class2#id1#id2', 'p.class1.class2#id1#id2']`.
*
* @property { DecodeOptions } [decodeOptions]

@@ -35,22 +31,10 @@ * Text decoding options given to `he.decode`.

*
* @property { boolean } [returnDomByDefault = true]
* Use the entire document if we don't find the tag defined in `Options.baseElement`.
* @property { SelectorDefinition[] } [selectors = []]
* Instructions for how to render HTML elements based on matched selectors.
*
* Use this to (re)define options for new or already supported tags.
*
* @property { string[] | boolean } [tables = []]
* Allows to select and format certain tables by the `class` or `id` attribute from the HTML document.
* Deprecated. Use selectors with `format: 'dataTable'` instead.
*
* This is necessary because the majority of HTML E-Mails uses a table based layout.
*
* Prefix your table selectors with a `.` for the `class` and with a `#` for the `id` attribute.
* All other tables are ignored (processed as layout containers, not tabular data).
*
* You can assign `true` to this property to format all tables.
*
* @property { object.< string, TagDefinition > } [tags = {}]
* A dictionary with custom tag definitions.
*
* Use this to (re)define how to handle new or already supported tags.
*
* Empty string (`''`) as a key used for the default definition for "any other" tags.
*
* @property { string } [whitespaceCharacters = ' \t\r\n\f\u200b']

@@ -67,2 +51,22 @@ * All characters that are considered whitespace.

/**
* @typedef { object } BaseElementsOptions
* Options for narrowing down to informative parts of HTML document.
*
* @property { string[] } [selectors = ['body']]
* The resulting text output will be composed from the text content of elements
* matched with these selectors.
*
* @property { 'selectors' | 'occurrence' } [orderBy = 'selectors']
* When multiple selectors are set, this option specifies
* whether the selectors order has to be reflected in the output text.
*
* `'selectors'` (default) - matches for the first selector will appear first, etc;
*
* `'occurrence'` - all bases will appear in the same order as in input HTML.
*
* @property { boolean } [returnDomByDefault = true]
* Use the entire document if none of provided selectors matched.
*/
/**
* @typedef { object } DecodeOptions

@@ -87,2 +91,9 @@ * Text decoding options given to `he.decode`.

*
* @property { number | undefined } [maxBaseElements = undefined]
* Stop looking for new base elements after this number of matches.
*
* No ellipsis is used when this condition is met.
*
* No limit if undefined.
*
* @property { number | undefined } [maxChildNodes = undefined]

@@ -123,5 +134,8 @@ * Process only this many child nodes of any element.

/**
* @typedef { object } TagDefinition
* Describes how to handle a tag.
* @typedef { object } SelectorDefinition
* Describes how to handle tags matched by a selector.
*
* @property { string } selector
* CSS selector. Refer to README for notes on supported selectors etc.
*
* @property { string } format

@@ -131,3 +145,3 @@ * Identifier of a {@link FormatCallback}, built-in or provided in `Options.formatters` dictionary.

* @property { FormatOptions } options
* Options to customize the formatter for this tag.
* Options to customize the formatter for this element.
*/

@@ -134,0 +148,0 @@

{
"name": "html-to-text",
"version": "7.1.1",
"version": "8.0.0",
"description": "Advanced html to plain text converter",

@@ -41,17 +41,19 @@ "license": "MIT",

"dependencies": {
"@selderee/plugin-htmlparser2": "^0.6.0",
"deepmerge": "^4.2.2",
"he": "^1.2.0",
"htmlparser2": "^6.1.0",
"minimist": "^1.2.5"
"minimist": "^1.2.5",
"selderee": "^0.6.0"
},
"devDependencies": {
"chai": "^4.3.4",
"eslint": "^7.24.0",
"eslint": "^7.28.0",
"eslint-plugin-filenames": "^1.3.2",
"eslint-plugin-import": "^2.22.1",
"eslint-plugin-jsdoc": "^32.3.0",
"eslint-plugin-import": "^2.23.4",
"eslint-plugin-jsdoc": "^33.3.0",
"eslint-plugin-mocha": "^8.1.0",
"mocha": "^8.3.2",
"mocha": "^8.4.0",
"nyc": "^15.1.0"
}
}

@@ -29,2 +29,4 @@ # html-to-text

Version 8 brings the selectors support to greatly increase the flexibility but that also changes some things introduced in version 6. Base element(s) selection also got important changes.
## Installation

@@ -36,15 +38,12 @@

Or when you want to use it as command line interface it is recommended to install it globally via
## Usage
```
npm install html-to-text -g
```
Convert a single document:
## Usage
```js
const { htmlToText } = require('html-to-text');
const { convert } = require('html-to-text');
// There is also an alias to `convert` called `htmlToText`.
const html = '<h1>Hello World</h1>';
const text = htmlToText(html, {
const text = convert(html, {
wordwrap: 130

@@ -55,2 +54,23 @@ });

Configure `html-to-text` once for batch processing:
```js
const { compile } = require('html-to-text');
const convert = compile({
wordwrap: 130
});
const htmls = [
'<h1>Hello World!</h1>',
'<h1>こんにちは世界!</h1>',
'<h1>Привет, мир!</h1>'
];
const texts = htmls.map(convert);
console.log(texts.join('\n'));
// Hello World!
// こんにちは世界!
// Привет, мир!
```
### Options

@@ -62,3 +82,6 @@

----------------------- | ------------ | -----------
`baseElement` | `'body'` | The tag(s) whose text content will be captured from the html and added to the resulting text output.<br/>Single element or an array of elements can be specified, each as a single tag name with optional css class and id parameters e.g. `['p.class1.class2#id1#id2', 'p.class1.class2#id1#id2']`.
`baseElements` | | Describes which parts of the input document have to be converted and present in the output text, and in what order.
`baseElements.selectors` | `['body']` | Elements matching any of provided selectors will be processed and included in the output text, with all inner content.<br/>Refer to [Supported selectors](#supported-selectors) section below.
`baseElements.orderBy` | `selectors` | `'selectors'` - arrange base elements in the same order as `baseElements.selectors` array;<br/>`'occurrence'` - arrange base elements in the order they are found in the input document.
`baseElements.returnDomByDefault` | `true` | Convert the entire document if none of provided selectors match.
`decodeOptions` | `{ isAttributeValue: false, strict: false }` | Text decoding options given to `he.decode`. For more informations see the [he](https://github.com/mathiasbynens/he) module.

@@ -68,2 +91,3 @@ `formatters` | `{}` | An object with custom formatting functions for specific elements (see [Override formatting](#override-formatting) section below).

`limits.ellipsis` | `'...'` | A string to insert in place of skipped content.
`limits.maxBaseElements` | `undefined` | Stop looking for more base elements after reaching this amount. Unlimited if undefined.
`limits.maxChildNodes` | `undefined` | Maximum number of child nodes of a single node to be added to the output. Unlimited if undefined.

@@ -76,43 +100,43 @@ `limits.maxDepth` | `undefined` | Stop looking for nodes to add to the output below this depth in the DOM tree. Unlimited if undefined.

`preserveNewlines` | `false` | By default, any newlines `\n` from the input HTML are collapsed into space as any other HTML whitespace characters. If `true`, these newlines will be preserved in the output. This is only useful when input HTML carries some plain text formatting instead of proper tags.
`returnDomByDefault` | `true` | Convert the entire document if we don't find the tag defined in `baseElement`.
`tables` | `[]` | Allows to select certain tables by the `class` or `id` attribute from the HTML document. This is necessary because the majority of HTML E-Mails uses a table based layout. Prefix your table selectors with an `.` for the `class` and with a `#` for the `id` attribute. All other tables are ignored.<br/>You can assign `true` to this attribute to select all tables.
`tags` | | Describes how different tags should be formatted. See [Tags](#tags) section below.
`selectors` | `[]` | Describes how different HTML elements should be formatted. See [Selectors](#selectors) section below.
`whitespaceCharacters` | `' \t\r\n\f\u200b'` | A string of characters that are recognized as HTML whitespace. Default value uses the set of characters defined in [HTML4 standard](https://www.w3.org/TR/html4/struct/text.html#h-9.1). (It includes Zero-width space compared to [living standard](https://infra.spec.whatwg.org#ascii-whitespace).)
`wordwrap` | `80` | After how many chars a line break should follow.<br/>Set to `null` or `false` to disable word-wrapping.
#### Options deprecated in version 6
#### Deprecated or removed options
Old&nbsp;option | Instead&nbsp;use
-------------------------- | -----------
`hideLinkHrefIfSameAsText` | `tags: { 'a': { options: { hideLinkHrefIfSameAsText: true } } }`
`ignoreHref` | `tags: { 'a': { options: { ignoreHref: true } } }`
`ignoreImage` | `tags: { 'img': { format: 'skip' } }`
`linkHrefBaseUrl` | `tags: {`<br/>`'a': { options: { baseUrl: 'https://example.com' } },`<br/>`'img': { options: { baseUrl: 'https://example.com' } }`<br/>`}`
`noAnchorUrl` | `tags: { 'a': { options: { noAnchorUrl: true } } }`
`noLinkBrackets` | `tags: { 'a': { options: { noLinkBrackets: true } } }`
`singleNewLineParagraphs` | `tags: {`<br/>`'p': { options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },`<br/>`'pre': { options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }`<br/>`}`
`unorderedListItemPrefix` | `tags: { 'ul': { options: { itemPrefix: ' * ' } } }`
`uppercaseHeadings` | `tags: {`<br/>`'h1': { options: { uppercase: false } },`<br/>`...`<br/>`'table': { options: { uppercaseHeaderCells: false } }`<br/>`}`
Old&nbsp;option | Depr. | Rem. | Instead&nbsp;use
-------------------------- | --- | ----- | -----------------
`baseElement` | 8.0 | | `baseElements: { selectors: [ 'body' ] }`
`format` | | 6.0 | The way formatters are written has changed completely. New formatters have to be added to the `formatters` option, old ones can not be reused without rewrite. See [new instructions](#override-formatting) below.
`hideLinkHrefIfSameAsText` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { hideLinkHrefIfSameAsText: true } } ]`
`ignoreHref` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { ignoreHref: true } } ]`
`ignoreImage` | 6.0 | *9.0* | `selectors: [ { selector: 'img', format: 'skip' } ]`
`linkHrefBaseUrl` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'a', options: { baseUrl: 'https://example.com' } },`<br/>`{ selector: 'img', options: { baseUrl: 'https://example.com' } }`<br/>`]`
`noAnchorUrl` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { noAnchorUrl: true } } ]`
`noLinkBrackets` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { noLinkBrackets: true } } ]`
`returnDomByDefault` | 8.0 | | `baseElements: { returnDomByDefault: true }`
`singleNewLineParagraphs` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },`<br/>`{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }`<br/>`]`
`tables` | 8.0 | | `selectors: [ { selector: 'table.class#id', format: 'dataTable' } ]`
`tags` | 8.0 | | See [Selectors](#selectors) section below.
`unorderedListItemPrefix` | 6.0 | *9.0* | `selectors: [ { selector: 'ul', options: { itemPrefix: ' * ' } } ]`
`uppercaseHeadings` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'h1', options: { uppercase: false } },`<br/>`...`<br/>`{ selector: 'table', options: { uppercaseHeaderCells: false } }`<br/>`]`
Deprecated options will be removed with future major version update.
Other things deprecated:
#### Options removed in version 6
* `fromString` method;
* positional arguments in `BlockTextBuilder` methods (in case you have written some custom formatters for version 6.0).
Old&nbsp;option | Description
--------------- | -----------
`format` | The way formatters are written has changed completely. New formatters have to be added to the `formatters` option, old ones can not be reused without rewrite. See [new instructions](#override-formatting) below.
#### Selectors
#### Tags
Some example:
Example for tag-specific options:
```javascript
const { htmlToText } = require('html-to-text');
const { convert } = require('html-to-text');
const html = '<a href="/page.html">Page</a>';
const text = htmlToText(html, {
tags: {
'a': { options: { baseUrl: 'https://example.com' } },
'figure': { format: 'block' }
}
const html = '<a href="/page.html">Page</a><a href="!#" class="button">Action</a>';
const text = convert(html, {
selectors: [
{ selector: 'a', options: { baseUrl: 'https://example.com' } },
{ selector: 'a.button', format: 'skip' }
]
});

@@ -122,9 +146,36 @@ console.log(text); // Page [https://example.com/page.html]

For new tags you have to specify the `format` value. For tags listed below you can skip it and only provide `options`. (Valid options listed in the next table.)
Selectors array is our loose approximation of a stylesheet.
By default there are following tag to format assignments:
* highest [specificity](https://www.w3.org/TR/selectors/#specificity) selector is used when there are multiple matches;
* the last selector is used when there are multiple matches of equal specificity;
* all entries with the same selector value are merged (recursively) at the compile stage, in such way so the last defined properties a kept and the relative order of unique selectors is kept;
* user-defined entries are appended after [predefined entries](#predefined-formatters);
* Every unique selector must have `format` value specified (at least once);
* unlike in CSS, values from different matched selectors are NOT merged at the convert stage. Single best match is used instead (that is the last one of those with highest specificity).
Tag&nbsp;name | Default&nbsp;format | Notes
To achieve the best performance when checking each DOM element against provided selectors, they are compiled into a decision tree. But it is also important how you choose selectors. For example, `div#id` is much better than `#id` - the former will only check divs for the id while the latter has to check every element in the DOM.
##### Supported selectors
`html-to-text` relies on [parseley](https://github.com/mxxii/parseley) and [selderee](https://github.com/mxxii/selderee) packages for selectors support.
Following selectors can be used in any combinations:
* `*` - universal selector;
* `div` - tag name;
* `.foo` - class name;
* `#bar` - id;
* `[baz]` - attribute presence;
* `[baz=buzz]` - attribute value (with any operators and also quotes and case sensitivity modifiers);
* `+` and `>` combinators (other combinators are not supported).
You can match `<p style="...; display:INLINE; ...">...</p>` with `p[style*="display:inline"i]` for example.
##### Predefined formatters
Following selectors have a formatter specified as a part of the default configuration. Everything can be overriden, but you don't have to repeat the `format` or options that you don't want to override. (But keep in mind this is only true for the same selector. There is no connection between different selectors.)
Selector | Default&nbsp;format | Notes
------------- | ------------------- | -----
`''` | `inline` | Catch-all default for unknown tags.
`*` | `inline` | Universal selector.
`a` | `anchor` |

@@ -152,12 +203,15 @@ `article` | `block` |

`pre` | `pre` |
`table` | `table` | there is also `dataTable` format. Using it will be equivalent to setting `tables` to `true`. `tables` option might be deprecated in the future.
`table` | `table` | Equivalent to `block`. Use `dataTable` instead for tabular data.
`ul` | `unorderedList` |
`wbr` | `wbr` |
More formats also available for use:
More formatters also available for use:
* `dataTable` - for visually-accurate tables. Note that this might be not search-friendly (output text will look like gibberish to a machine when there is any wrapped cell contents) and also better to be avoided for tables used as a page layout tool;
* `skip` - as the name implies it skips the given tag with it's contents without printing anything.
Format options are specified for each tag indepentently:
##### Format options
Following options are available for built-in formatters.
Option | Default | Applies&nbsp;to | Description

@@ -176,6 +230,6 @@ ------------------- | ----------- | ------------------ | -----------

`trimEmptyLines` | `true` | `blockquote` | Trim empty lines from blockquote.<br/>While empty lines should be preserved in HTML, space-saving behavior is chosen as default for convenience.
`uppercaseHeaderCells` | `true` | `table`, `dataTable` | By default, heading cells (`<th>`) are uppercased.<br/>Set this to `false` to leave heading cells as they are.
`maxColumnWidth` | `60` | `table`, `dataTable` | Data table cell content will be wrapped to fit this width instead of global `wordwrap` limit.<br/>Set this to `undefined` in order to fall back to `wordwrap` limit.
`colSpacing` | `3` | `table`, `dataTable` | Number of spaces between data table columns.
`rowSpacing` | `0` | `table`, `dataTable` | Number of empty lines between data table rows.
`uppercaseHeaderCells` | `true` | `dataTable` | By default, heading cells (`<th>`) are uppercased.<br/>Set this to `false` to leave heading cells as they are.
`maxColumnWidth` | `60` | `dataTable` | Data table cell content will be wrapped to fit this width instead of global `wordwrap` limit.<br/>Set this to `undefined` in order to fall back to `wordwrap` limit.
`colSpacing` | `3` | `dataTable` | Number of spaces between data table columns.
`rowSpacing` | `0` | `dataTable` | Number of empty lines between data table rows.

@@ -186,3 +240,3 @@ ### Override formatting

`formatters` option is an object that holds formatting functions. They can be assigned to format different tags by key in the `tags` option.
`formatters` option is an object that holds formatting functions. They can be assigned to format different elements in the `selectors` array.

@@ -199,6 +253,6 @@ Each formatter is a function of four arguments that returns nothing. Arguments are:

```javascript
const { htmlToText } = require('html-to-text');
const { convert } = require('html-to-text');
const html = '<foo>Hello World</foo>';
const text = htmlToText(html, {
const text = convert(html, {
formatters: {

@@ -213,9 +267,10 @@ // Create a formatter.

},
tags: {
selectors: [
// Assign it to `foo` tags.
'foo': {
{
selector: 'foo',
format: 'fooBlockFormatter',
options: { leadingLineBreaks: 1, trailingLineBreaks: 1 }
}
}
]
});

@@ -225,3 +280,3 @@ console.log(text); // Hello World!

Refer to [built-in formatters](https://github.com/html-to-text/node-html-to-text/blob/master/lib/formatter.js) for more examples.
Refer to [built-in formatters](https://github.com/html-to-text/node-html-to-text/blob/master/lib/formatter.js) for more examples. The easiest way to write your own is to pick an existing one and customize.

@@ -228,0 +283,0 @@ Refer to [BlockTextBuilder](https://github.com/html-to-text/node-html-to-text/blob/master/lib/block-text-builder.js) for available functions and arguments.

@@ -6,5 +6,7 @@ const fs = require('fs');

const { htmlToText } = require('..');
const { compile, convert } = require('..');
const defaultConvert = compile();
describe('html-to-text', function () {

@@ -15,11 +17,11 @@

it('should return empty input unchanged', function () {
expect(htmlToText('')).to.equal('');
expect(defaultConvert('')).to.equal('');
});
it('should return empty result if input undefined', function () {
expect(htmlToText()).to.equal('');
expect(defaultConvert()).to.equal('');
});
it('should return plain text (no line breaks) unchanged', function () {
expect(htmlToText('Hello world!')).to.equal('Hello world!');
expect(defaultConvert('Hello world!')).to.equal('Hello world!');
});

@@ -37,3 +39,3 @@

`;
expect(htmlToText(html)).to.equal('text');
expect(defaultConvert(html)).to.equal('text');
});

@@ -50,3 +52,3 @@

`;
expect(htmlToText(html)).to.equal('text');
expect(defaultConvert(html)).to.equal('text');
});

@@ -62,3 +64,3 @@

`;
expect(htmlToText(html)).to.equal('text');
expect(defaultConvert(html)).to.equal('text');
});

@@ -77,17 +79,17 @@

it('should wordwrap at 80 characters by default', function () {
expect(htmlToText(longStr)).to.equal('111111111 222222222 333333333 444444444 555555555 666666666 777777777 888888888\n999999999');
expect(defaultConvert(longStr)).to.equal('111111111 222222222 333333333 444444444 555555555 666666666 777777777 888888888\n999999999');
});
it('should wordwrap at given amount of characters when give a number', function () {
expect(htmlToText(longStr, { wordwrap: 20 })).to.equal('111111111 222222222\n333333333 444444444\n555555555 666666666\n777777777 888888888\n999999999');
expect(htmlToText(longStr, { wordwrap: 50 })).to.equal('111111111 222222222 333333333 444444444 555555555\n666666666 777777777 888888888 999999999');
expect(htmlToText(longStr, { wordwrap: 70 })).to.equal('111111111 222222222 333333333 444444444 555555555 666666666 777777777\n888888888 999999999');
it('should wordwrap at given number of characters', function () {
expect(convert(longStr, { wordwrap: 20 })).to.equal('111111111 222222222\n333333333 444444444\n555555555 666666666\n777777777 888888888\n999999999');
expect(convert(longStr, { wordwrap: 50 })).to.equal('111111111 222222222 333333333 444444444 555555555\n666666666 777777777 888888888 999999999');
expect(convert(longStr, { wordwrap: 70 })).to.equal('111111111 222222222 333333333 444444444 555555555 666666666 777777777\n888888888 999999999');
});
it('should not wordwrap when given null', function () {
expect(htmlToText(longStr, { wordwrap: null })).to.equal(longStr);
expect(convert(longStr, { wordwrap: null })).to.equal(longStr);
});
it('should not wordwrap when given false', function () {
expect(htmlToText(longStr, { wordwrap: false })).to.equal(longStr);
expect(convert(longStr, { wordwrap: false })).to.equal(longStr);
});

@@ -98,3 +100,3 @@

const expected = 'This text isn\'t counted when calculating where to break a string for 80\ncharacter line lengths.';
expect(htmlToText(html, {})).to.equal(expected);
expect(convert(html, {})).to.equal(expected);
});

@@ -105,3 +107,3 @@

const expected = 'If a word with a line feed exists over the line feed boundary then you must\nrespect it.';
expect(htmlToText(html, {})).to.equal(expected);
expect(convert(html, {})).to.equal(expected);
});

@@ -112,3 +114,3 @@

const expected = 'This text isn\'t counted when calculating where to break a string for 80\ncharacter line lengths. However it can affect where the next line breaks and\nthis could lead to having an early line break';
expect(htmlToText(html, {})).to.equal(expected);
expect(convert(html, {})).to.equal(expected);
});

@@ -119,3 +121,3 @@

const expected = 'We appreciate your business. And we hope you\'ll check out our new products\n[http://example.com/]!';
expect(htmlToText(html, {})).to.equal(expected);
expect(convert(html, {})).to.equal(expected);
});

@@ -126,3 +128,3 @@

const expected = 'This string is meant to test if a string is split properly across a\nnewlineandlongword with following text.';
expect(htmlToText(html, {})).to.equal(expected);
expect(convert(html, {})).to.equal(expected);
});

@@ -133,3 +135,3 @@

const expected = 'This string is meant to test if a string is split properly across\nanewlineandlong word with following text.';
expect(htmlToText(html, {})).to.equal(expected);
expect(convert(html, {})).to.equal(expected);
});

@@ -144,3 +146,3 @@

const expected = 'One Two Three';
expect(htmlToText(html)).to.equal(expected);
expect(defaultConvert(html)).to.equal(expected);
});

@@ -151,3 +153,3 @@

const expected = 'One\nTwo\nThree';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -158,3 +160,3 @@

const expected = 'If a word with a line feed exists over the line feed boundary then\nyou\nmust\nrespect it.';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -165,3 +167,3 @@

const expected = 'If a word with a line feed exists over the line feed boundary then\nyou must respect it.';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -172,3 +174,3 @@

const expected = 'This string is meant to test if a string is split properly across a\nnewlineandlongword with following text.';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -179,3 +181,3 @@

const expected = 'This string is meant to test if a string is split properly across\nanewlineandlong\nword with following text.';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -186,3 +188,3 @@

const expected = 'If a word with a line feed exists over the line feed boundary then you must\nrespect it.';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -193,3 +195,3 @@

const expected = 'A string of text\nwith\nmultiple\nspaces\nthat\n\ncan be safely removed.';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -200,3 +202,3 @@

const expected = 'multiple\nspaces';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -209,7 +211,7 @@

it('should decode &#128514; to 😂', function () {
expect(htmlToText('&#128514;')).to.equal('😂');
expect(defaultConvert('&#128514;')).to.equal('😂');
});
it('should decode &lt;&gt; to <>', function () {
expect(htmlToText('<span>span</span>, &lt;not a span&gt;')).to.equal('span, <not a span>');
expect(defaultConvert('<span>span</span>, &lt;not a span&gt;')).to.equal('span, <not a span>');
});

@@ -224,4 +226,9 @@

const expected = fs.readFileSync(path.join(__dirname, 'test.txt'), 'utf8');
const options = { tables: ['#invoice', '.address'] };
expect(htmlToText(html, options)).to.equal(expected);
const options = {
selectors: [
{ selector: 'table#invoice', format: 'dataTable' },
{ selector: 'table.address', format: 'dataTable' }
]
};
expect(convert(html, options)).to.equal(expected);
});

@@ -233,38 +240,69 @@

const options = {
tables: ['.address'],
baseElement: 'table.address'
baseElements: { selectors: ['table.address'] },
selectors: [
{ selector: 'table.address', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});
it('should retrieve and convert content under multiple base elements', function () {
it('should not repeat the same base element', function () {
const html = fs.readFileSync(path.join(__dirname, 'test.html'), 'utf8');
const expected = fs.readFileSync(path.join(__dirname, 'test-address-dup.txt'), 'utf8');
const expected = fs.readFileSync(path.join(__dirname, 'test-address.txt'), 'utf8');
const options = {
tables: ['.address'],
baseElement: ['table.address', 'table.address']
baseElements: { selectors: ['table.address', 'table.address'] },
selectors: [
{ selector: 'table.address', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});
it('should retrieve and convert content under multiple base elements in any order', function () {
it('should retrieve base elements in order of occurrence', function () {
const html = fs.readFileSync(path.join(__dirname, 'test.html'), 'utf8');
const expected = fs.readFileSync(path.join(__dirname, 'test-any-order.txt'), 'utf8');
const expected = fs.readFileSync(path.join(__dirname, 'test-orderby-occurrence.txt'), 'utf8');
const options = {
tables: ['.address'],
baseElement: ['table.address', 'p.normal-space', 'table.address']
baseElements: {
selectors: ['p.normal-space.small', 'table.address'],
orderBy: 'occurrence'
},
selectors: [
{ selector: 'table.address', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});
it('should process the first base element found when multiple exist', function () {
it('should retrieve base elements in order of selectors', function () {
const html = fs.readFileSync(path.join(__dirname, 'test.html'), 'utf8');
const expected = fs.readFileSync(path.join(__dirname, 'test-first-element.txt'), 'utf8');
const expected = fs.readFileSync(path.join(__dirname, 'test-orderby-selectors.txt'), 'utf8');
const options = {
tables: ['.address'],
baseElement: 'p.normal-space'
baseElements: {
selectors: ['p.normal-space.small', 'table.address'],
orderBy: 'selectors'
},
selectors: [
{ selector: 'table.address', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});
it('should retrieve all different base elements matched the same selector', function () {
const html = fs.readFileSync(path.join(__dirname, 'test.html'), 'utf8');
const expected = fs.readFileSync(path.join(__dirname, 'test-multiple-elements.txt'), 'utf8');
const options = { baseElements: { selectors: ['p.normal-space'] } };
expect(convert(html, options)).to.equal(expected);
});
it('should respect maxBaseElements limit', function () {
const html = /*html*/`<!DOCTYPE html><html><head></head><body><p>a</p><div>div</div><p>b</p><p>c</p><p>d</p><p>e</p><p>f</p><p>g</p><p>h</p><p>i</p><p>j</p></body></html>`;
const expected = 'a\n\ndiv\n\nb';
const options = {
baseElements: { selectors: ['p', 'div'], orderBy: 'occurrence' },
limits: { maxBaseElements: 3 }
};
expect(convert(html, options)).to.equal(expected);
});
it('should retrieve and convert the entire document by default if no base element is found', function () {

@@ -274,6 +312,9 @@ const html = fs.readFileSync(path.join(__dirname, 'test.html'), 'utf8');

const options = {
tables: ['#invoice', '.address'],
baseElement: 'table.notthere'
baseElements: { selectors: ['table.notthere'] },
selectors: [
{ selector: 'table#invoice', format: 'dataTable' },
{ selector: 'table.address', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -285,7 +326,12 @@

const options = {
baseElement: 'table.notthere',
returnDomByDefault: false,
tables: ['#invoice', '.address']
baseElements: {
selectors: ['table.notthere'],
returnDomByDefault: false
},
selectors: [
{ selector: 'table#invoice', format: 'dataTable' },
{ selector: 'table.address', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -301,3 +347,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_anewlineandlo\nng word_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -308,3 +354,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_anewlineandlongword_with_following_text.';
expect(htmlToText(html, {})).to.equal(expected);
expect(convert(html, {})).to.equal(expected);
});

@@ -316,3 +362,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_anewlineandlong\nword_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -324,3 +370,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_anewlineandlong\nword_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -332,3 +378,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_\nanewlineandlong word_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -340,3 +386,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_\nanewlineandlong word_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -348,3 +394,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_\nanewlineandlong word_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -356,3 +402,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split-\nproperly_across_anewlineandlong word_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -364,3 +410,3 @@

const expected = 'https://github.com/werk85/node-html-to-text/blob/master/lib/html-to-text.js';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -372,3 +418,3 @@

const expected = 'https://github.com/AndrewFinlay/node-html-to-text/commit/\n64836a5bd97294a672b24c26cb8a3ccdace41001';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -380,3 +426,3 @@

const expected = 'https://github.com/werk85/node-html-to-text/blob/master/lib/werk85/\nnode-html-to-text/blob/master/lib/werk85/node-html-to-text/blob/master/lib/\nwerk85/node-html-to-text/blob/master/lib/werk85/node-html-to-text/blob/master/\nlib/html-to-text.js';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -388,3 +434,3 @@

const expected = 'https://github.com/werk85/node-html-to-text/blob/master/lib/werk85/node-html-to-\ntext/blob/master/lib/werk85/node-html-to-text/blob/master/lib/werk85/node-html-t\no-text/blob/master/lib/werk85/node-html-to-text/blob/master/lib/html-to-text.js';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -396,3 +442,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_\nanewlineandlong\nword_with_following_text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -403,3 +449,3 @@

const expected = '_This_string_is_meant_to_test_if_a_string_is_split_properly_across_anewlineandlong\nword_with_following_text.';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expected);
expect(convert(html, { preserveNewlines: true })).to.equal(expected);
});

@@ -409,5 +455,10 @@

const html = '<a href="http://images.fb.com/2015/12/21/ivete-sangalo-launches-360-music-video-on-facebook/">http://images.fb.com/2015/12/21/ivete-sangalo-launches-360-music-video-on-facebook/</a>';
const options = { longWordSplit: { wrapCharacters: ['/', '_'], forceWrapOnLimit: false }, tags: { 'a': { options: { hideLinkHrefIfSameAsText: true } } } };
const options = {
longWordSplit: { wrapCharacters: ['/', '_'], forceWrapOnLimit: false },
selectors: [
{ selector: 'a', options: { hideLinkHrefIfSameAsText: true } }
]
};
const expected = 'http://images.fb.com/2015/12/21/\nivete-sangalo-launches-360-music-video-on-facebook/';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -419,3 +470,3 @@

const expected = 'http://images.fb.com/2015/12/21/\nivete-sangalo-launches-360-music-video-on-facebook/\n[http://images.fb.com/2015/12/21/\nivete-sangalo-launches-360-music-video-on-facebook/]';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -430,3 +481,3 @@

const expected = 'foo bar';
expect(htmlToText(html)).to.equal(expected);
expect(defaultConvert(html)).to.equal(expected);
});

@@ -436,3 +487,3 @@

const html = /*html*/`a<span>&#x0020;</span>b<span>&Tab;</span>c<span>&NewLine;</span>d<span>&#10;</span>e`;
const result = htmlToText(html);
const result = defaultConvert(html);
const expected = 'a b c d e';

@@ -446,3 +497,3 @@ expect(result).to.equal(expected);

const expected = 'This text contains superscript text.';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -455,7 +506,7 @@

const expectedDefault = 'first span\u00a0\u00a0\u00a0last span';
expect(htmlToText(html)).to.equal(expectedDefault);
expect(defaultConvert(html)).to.equal(expectedDefault);
const options = { whitespaceCharacters: ' \t\r\n\f\u200b\u00a0' };
const expectedCustom = 'first span last span';
expect(htmlToText(html, options)).to.equal(expectedCustom);
expect(convert(html, options)).to.equal(expectedCustom);
});

@@ -467,6 +518,6 @@

const expectedDefault = 'foo bar baz';
expect(htmlToText(html)).to.equal(expectedDefault);
expect(defaultConvert(html)).to.equal(expectedDefault);
const expectedCustom = 'foo\nbar\nbaz';
expect(htmlToText(html, { preserveNewlines: true })).to.equal(expectedCustom);
expect(convert(html, { preserveNewlines: true })).to.equal(expectedCustom);
});

@@ -482,3 +533,3 @@

const expected = 'foo\n\nbar';
expect(htmlToText(html)).to.equal(expected);
expect(defaultConvert(html)).to.equal(expected);
});

@@ -501,3 +552,3 @@

html += '</body></html>';
expect(htmlToText(html)).to.equal(expected);
expect(defaultConvert(html)).to.equal(expected);
});

@@ -518,3 +569,3 @@

const expected = 'nnnnn(...)';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -531,3 +582,3 @@

const expected = 'a(...)g(...)m';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -542,6 +593,8 @@

},
tags: { 'p': { options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } } }
selectors: [
{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }
]
};
const expected = 'a\nb\nc\nd\ne\nf\n(skipped the rest)';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -556,6 +609,8 @@

},
tags: { 'p': { options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } } }
selectors: [
{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }
]
};
const expected = 'a\nb\nc\nd\ne\nf\ng\nh\ni\nj';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -567,6 +622,8 @@

limits: { maxChildNodes: 6 },
tags: { 'p': { options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } } }
selectors: [
{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }
]
};
const expected = 'a\nb\nc\nd\ne\nf\n...';
expect(htmlToText(html, options)).to.equal(expected);
expect(convert(html, options)).to.equal(expected);
});

@@ -594,3 +651,3 @@

const options = { wordwrap: false };
expect(htmlToText(html, options).length).to.equal(1 << 24);
expect(convert(html, options).length).to.equal(1 << 24);
const expectedStderrBuffer = 'Input length 20000000 is above allowed limit of 16777216. Truncating without ellipsis.\n';

@@ -603,3 +660,3 @@ expect(getProcessStderrBuffer()).to.equal(expectedStderrBuffer);

const options = { limits: { maxInputLength: 42 } };
expect(htmlToText(html, options).length).to.equal(42);
expect(convert(html, options).length).to.equal(42);
const expectedStderrBuffer = 'Input length 20000000 is above allowed limit of 42. Truncating without ellipsis.\n';

@@ -606,0 +663,0 @@ expect(getProcessStderrBuffer()).to.equal(expectedStderrBuffer);

@@ -32,3 +32,7 @@

const expected = 'foo\n\n------------------------------\n\nbar';
const options = { tags: { 'hr': { options: { length: 30 } } } };
const options = {
selectors: [
{ selector: 'hr', options: { length: 30 } }
]
};
expect(htmlToText(html, options)).to.equal(expected);

@@ -55,3 +59,7 @@ });

const html = 'text<p>first</p><p>second</p>text';
const options = { tags: { 'p': { options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } } } };
const options = {
selectors: [
{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }
]
};
const expected = 'text\nfirst\nsecond\ntext';

@@ -132,10 +140,10 @@ expect(htmlToText(html, options)).to.equal(expected);

const options = {
tags: {
'h1': { options: { uppercase: false } },
'h2': { options: { uppercase: false } },
'h3': { options: { uppercase: false } },
'h4': { options: { uppercase: false } },
'h5': { options: { uppercase: false } },
'h6': { options: { uppercase: false } }
}
selectors: [
{ selector: 'h1', options: { uppercase: false } },
{ selector: 'h2', options: { uppercase: false } },
{ selector: 'h3', options: { uppercase: false } },
{ selector: 'h4', options: { uppercase: false } },
{ selector: 'h5', options: { uppercase: false } },
{ selector: 'h6', options: { uppercase: false } },
]
};

@@ -167,3 +175,7 @@ expect(htmlToText(html, options)).to.equal(expected);

const options = { tags: { 'blockquote': { options: { trimEmptyLines: false } } } };
const options = {
selectors: [
{ selector: 'blockquote', options: { trimEmptyLines: false } }
]
};
const expectedCustom = '> \n> a\n> \n> \n> ';

@@ -185,3 +197,7 @@ expect(htmlToText(html, options)).to.equal(expectedCustom);

const html = '<img src="/test.png">';
const options = { tags: { 'img': { options: { baseUrl: 'https://example.com' } } } };
const options = {
selectors: [
{ selector: 'img', options: { baseUrl: 'https://example.com' } }
]
};
const expected = '[https://example.com/test.png]';

@@ -209,3 +225,7 @@ expect(htmlToText(html, options)).to.equal(expected);

const html = '<a href="/test.html">test</a>';
const options = { tags: { 'a': { options: { baseUrl: 'https://example.com' } } } };
const options = {
selectors: [
{ selector: 'a', options: { baseUrl: 'https://example.com' } }
]
};
const expected = 'test [https://example.com/test.html]';

@@ -230,3 +250,7 @@ expect(htmlToText(html, options)).to.equal(expected);

const expected = 'test http://my.link';
const options = { tags: { 'a': { options: { noLinkBrackets: true } } } };
const options = {
selectors: [
{ selector: 'a', options: { noLinkBrackets: true } }
]
};
expect(htmlToText(html, options)).to.equal(expected);

@@ -237,3 +261,7 @@ });

const html = '<a href="#link">test</a>';
const options = { tags: { 'a': { options: { noAnchorUrl: true } } } };
const options = {
selectors: [
{ selector: 'a', options: { noAnchorUrl: true } }
]
};
expect(htmlToText(html, options)).to.equal('test');

@@ -245,3 +273,7 @@ });

const expected = 'test [#link]';
const options = { tags: { 'a': { options: { noAnchorUrl: false } } } };
const options = {
selectors: [
{ selector: 'a', options: { noAnchorUrl: false } }
]
};
expect(htmlToText(html, options)).to.equal(expected);

@@ -267,3 +299,8 @@ });

const expected = 'HEADER CELL 1 HEADER CELL 2 [http://example.com] Regular cell [http://example.com]';
expect(htmlToText(html, { tables: true })).to.equal(expected);
const options = {
selectors: [
{ selector: 'table', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});

@@ -290,3 +327,7 @@

const html = '<ul><li>foo</li><li>bar</li></ul>';
const options = { tags: { 'ul': { options: { itemPrefix: ' test ' } } } };
const options = {
selectors: [
{ selector: 'ul', options: { itemPrefix: ' test ' } }
]
};
const expected = ' test foo\n test bar';

@@ -491,3 +532,8 @@ expect(htmlToText(html, options)).to.equal(expected);

const expected = 'Good morning Jacob,\n\nLorem ipsum dolor sit amet.';
expect(htmlToText(html, { tables: true })).to.equal(expected);
const options = {
selectors: [
{ selector: 'table', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});

@@ -506,3 +552,8 @@

const expected = 'Good morning Jacob,\n\nLorem ipsum dolor sit amet.';
expect(htmlToText(html, { tables: true })).to.equal(expected);
const options = {
selectors: [
{ selector: 'table', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});

@@ -555,3 +606,8 @@

'nn qqqq pp';
expect(htmlToText(html, { tables: true })).to.equal(expected);
const options = {
selectors: [
{ selector: 'table', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});

@@ -581,3 +637,7 @@

'd e f';
const options = { tables: true, tags: { 'table': { options: { colSpacing: 1, rowSpacing: 1 } } } };
const options = {
selectors: [
{ selector: 'table', format: 'dataTable', options: { colSpacing: 1, rowSpacing: 1 } }
]
};
expect(htmlToText(html, options)).to.equal(expected);

@@ -613,3 +673,8 @@ });

'f ggggggggg';
expect(htmlToText(html, { tables: true })).to.equal(expected);
const options = {
selectors: [
{ selector: 'table', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});

@@ -641,3 +706,8 @@

' 2. list item two';
expect(htmlToText(html, { tables: true })).to.equal(expected);
const options = {
selectors: [
{ selector: 'table', format: 'dataTable' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});

@@ -677,3 +747,7 @@

' mollit anim id est laborum.';
const options = { tables: true, tags: { 'table': { options: { maxColumnWidth: 30 } } } };
const options = {
selectors: [
{ selector: 'table', format: 'dataTable', options: { maxColumnWidth: 30 } }
]
};
expect(htmlToText(html, options)).to.equal(expected);

@@ -712,3 +786,3 @@ });

const expected = '漢字';
const options = { tags: { 'rt': { format: 'skip' } } };
const options = { selectors: [ { selector: 'rt', format: 'skip' } ] };
expect(htmlToText(html, options)).to.equal(expected);

@@ -720,3 +794,3 @@ });

const expected = 'a b c d e';
const options = { tags: { 'span': { format: 'inline' } } };
const options = { selectors: [ { selector: 'span', format: 'inline' } ] };
expect(htmlToText(html, options)).to.equal(expected);

@@ -729,8 +803,8 @@ });

const options = {
tags: {
'budget': { format: 'block' },
'fidget': { format: 'block' },
'gadget': { format: 'block' },
'widget': { format: 'block' },
}
selectors: [
{ selector: 'budget', format: 'block' },
{ selector: 'fidget', format: 'block' },
{ selector: 'gadget', format: 'block' },
{ selector: 'widget', format: 'block' }
]
};

@@ -758,6 +832,6 @@ expect(htmlToText(html, options)).to.equal(expected);

},
tags: {
'foo': { format: 'formatFoo' },
'bar': { format: 'formatBar' }
}
selectors: [
{ selector: 'foo', format: 'formatFoo' },
{ selector: 'bar', format: 'formatBar' }
]
};

@@ -769,2 +843,49 @@ expect(htmlToText(html, options)).to.equal(expected);

describe('selectors', function () {
it('should merge entries with the same selector', function () {
const html = '<foo></foo><foo></foo><foo></foo>';
const expected = '----------\n\n\n\n----------\n\n\n\n----------';
const options = {
selectors: [
{ selector: 'foo', format: 'somethingElse' },
{ selector: 'foo', options: { length: 20 } },
{ selector: 'foo', options: { leadingLineBreaks: 4 } },
{ selector: 'foo', options: { trailingLineBreaks: 4 } },
{ selector: 'foo', options: { length: 10 } },
{ selector: 'foo', format: 'horizontalLine' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});
it('should pick the most specific selector', function () {
const html = '<hr/><hr class="foo" id="bar"/>';
const expected = '---\n\n-----';
const options = {
selectors: [
{ selector: 'hr', options: { length: 3 } },
{ selector: 'hr#bar', format: 'horizontalLine', options: { length: 5 } },
{ selector: 'hr.foo', format: 'horizontalLine', options: { length: 7 } },
]
};
expect(htmlToText(html, options)).to.equal(expected);
});
it('should pick the last selector of equal specificity', function () {
const html = '<hr class="bar baz"/><hr class="foo bar"/><hr class="foo baz"/>';
const expected = '-----\n\n-------\n\n-------';
const options = {
selectors: [
{ selector: 'hr.foo', format: 'horizontalLine', options: { length: 7 } },
{ selector: 'hr.baz', format: 'horizontalLine', options: { length: 3 } },
{ selector: 'hr.bar', format: 'horizontalLine', options: { length: 5 } },
{ selector: 'hr.foo' }
]
};
expect(htmlToText(html, options)).to.equal(expected);
});
});
});

Sorry, the diff of this file is not supported yet

Sorry, the diff of this file is not supported yet

Sorry, the diff of this file is not supported yet

Sorry, the diff of this file is not supported yet

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc