Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

html-to-text

Package Overview
Dependencies
Maintainers
2
Versions
55
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

html-to-text - npm Package Compare versions

Comparing version 8.2.1 to 9.0.0-preview1

lib/html-to-text.cjs

71

CHANGELOG.md
# Changelog
## Version 9.0.0-preview1 (WIP)
README location before full release: [v9/packages/html-to-text/README.md](https://github.com/html-to-text/node-html-to-text/blob/v9/packages/html-to-text/README.md)
All commits: [8.2.1...v9](https://github.com/html-to-text/node-html-to-text/compare/8.2.1...v9)
Version 9 roadmap: [#240](https://github.com/html-to-text/node-html-to-text/issues/240)
### Node version
Required Node version is now >=14.
### CommonJS and ES Module
Package now provides `cjs` and `mjs` exports.
### CLI is no longer built in
If you use CLI then install that package instead: _(WIP, to be provided before full release)_
### Dependency updates
* `htmlparser2` updated from 6.1.0 to 8.0.1 ([Release notes](https://github.com/fb55/htmlparser2/releases));
* `he` dependency is removed. It was needed at the time it was introduced, apparently, but at this point `htmlparser2` seems to do a better job itself.
### Removed features
* Options deprecated in version 6 are now removed;
* `decodeOptions` section removed with `he` dependency;
* `fromString` method removed;
* deprecated positional arguments in `BlockTextBuilder` methods are now removed.
Refer to README for [migration instructions](https://github.com/html-to-text/node-html-to-text#deprecated-or-removed-options).
### New options
* `decodeEntities` - controls whether HTML entities found in the input HTML should be decoded or left as is in the output text;
* `encodeCharacters` - a dictionary with characters that should be replaced in the output text and corresponding escape sequences.
### New built-in formatters
### Changes to existing built-in formatters
* `anchor` and `image` got `pathRewrite` option;
* `dataTable` formatter allows zero `colSpacing`.
### Improvements for writing custom formatters
* Some logic for making lists is moved to BlockTextBuilder and can be reused for custom lists (`openList`, `openListItem`, `closeListItem`, `closeList`). Addresses [#238](https://github.com/html-to-text/node-html-to-text/issues/238);
* `startNoWrap`, `stopNoWrap` - allows to keep local inline content in a single line regardless of wrapping options;
* `addLiteral` - it's like `addInline` but circumvents most of the text processing logic. This should be preferred when inserting markup elements;
* It is now possible to provide a metadata object along with the HTML string to convert. Metadata object is available for custom formatters via `builder.metadata`. This allows to compile the converter once and still being able to supply per-document data. Metadata object is supplied as the last optional argument to `convert` function and the function returned by `compile` function.
----
## Version 8.2.1
No changes in the package. Bumped dev dependencies and regenerated `package-lock.json`.
No changes in published package. Bumped dev dependencies and regenerated `package-lock.json`.

@@ -71,2 +126,4 @@ ## Version 8.2.0

----
## Version ~~7.1.2~~ 7.1.3

@@ -115,2 +172,4 @@

----
## Version 6.0.0

@@ -166,2 +225,4 @@

----
## Version 5.1.1

@@ -212,2 +273,4 @@

----
## Version 4.0.0

@@ -219,2 +282,4 @@

----
## Version 3.3.0

@@ -241,2 +306,4 @@

----
## Version 2.1.1

@@ -262,2 +329,4 @@

----
## Version 1.6.2

@@ -264,0 +333,0 @@

73

package.json
{
"name": "html-to-text",
"version": "8.2.1",
"version": "9.0.0-preview1",
"description": "Advanced html to plain text converter",
"keywords": [
"html",
"node",
"text",
"mail",
"plain",
"converter"
],
"license": "MIT",
"author": {
"name": "Malte Legenhausen",
"email": "legenhausen@werk85.de"
},
"author": "Malte Legenhausen <legenhausen@werk85.de>",
"contributors": [
"KillyMXI <killy@mxii.eu.org>"
],
"homepage": "https://github.com/html-to-text/node-html-to-text",

@@ -18,42 +26,39 @@ "repository": {

},
"keywords": [
"html",
"node",
"text",
"mail",
"plain",
"converter"
"type": "module",
"main": "./lib/html-to-text.cjs",
"module": "./lib/html-to-text.mjs",
"exports": {
"import": "./lib/html-to-text.mjs",
"require": "./lib/html-to-text.cjs"
},
"files": [
"lib",
"README.md",
"CHANGELOG.md",
"LICENSE"
],
"engines": {
"node": ">=10.23.2"
"node": ">=14"
},
"main": "index.js",
"bin": {
"html-to-text": "./bin/cli.js"
},
"scripts": {
"build:rollup": "rollup -c",
"build": "npm run clean && npm run build:rollup",
"clean": "rimraf lib",
"copy:license": "copyfiles -f ../../LICENSE .",
"cover": "c8 --reporter=lcov --reporter=text-summary mocha -t 20000",
"example": "node ./example/html-to-text.js",
"lint": "eslint .",
"prepublishOnly": "npm run lint && npm test",
"lint": "eslint ../../",
"prepublishOnly": "npm run copy:license && npm run lint && npm test",
"test": "mocha"
},
"dependencies": {
"@selderee/plugin-htmlparser2": "^0.6.0",
"@selderee/plugin-htmlparser2": "^0.9.0",
"deepmerge": "^4.2.2",
"he": "^1.2.0",
"htmlparser2": "^6.1.0",
"minimist": "^1.2.6",
"selderee": "^0.6.0"
"htmlparser2": "^8.0.1",
"selderee": "^0.9.0"
},
"devDependencies": {
"c8": "^7.12.0",
"chai": "^4.3.6",
"eslint": "^7.32.0",
"eslint-plugin-filenames": "^1.3.2",
"eslint-plugin-import": "^2.26.0",
"eslint-plugin-jsdoc": "^33.3.0",
"eslint-plugin-mocha": "^8.2.0",
"mocha": "^8.4.0"
"mocha": {
"node-option": [
"experimental-specifier-resolution=node"
]
}
}

@@ -5,3 +5,2 @@ # html-to-text

[![test status](https://github.com/html-to-text/node-html-to-text/workflows/test/badge.svg)](https://github.com/html-to-text/node-html-to-text/actions/workflows/test.yml)
[![Test Coverage](https://codeclimate.com/github/html-to-text/node-html-to-text/badges/coverage.svg)](https://codeclimate.com/github/html-to-text/node-html-to-text/coverage)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/html-to-text/node-html-to-text/blob/master/LICENSE-MIT)

@@ -24,5 +23,5 @@ [![npm](https://img.shields.io/npm/v/html-to-text?logo=npm)](https://www.npmjs.com/package/html-to-text)

Available here: [CHANGELOG.md](https://github.com/html-to-text/node-html-to-text/blob/master/CHANGELOG.md)
~~Available here: [CHANGELOG.md](https://github.com/html-to-text/node-html-to-text/blob/master/CHANGELOG.md)~~
Version 6 contains a ton of changes, so it worth to take a look.
Version 6 contains a ton of changes, so it worth to take a look at the full changelog.

@@ -33,2 +32,6 @@ Version 7 contains an important change for custom formatters.

Version 9 gets a significant internal rework, drops a lot of previously deprecated options, introduces some new formatters and new capabilities for custom formatters.
Version 9 WIP [GitHub branch](https://github.com/html-to-text/node-html-to-text/tree/v9), [CHANGELOG.md](https://github.com/html-to-text/node-html-to-text/blob/v9/packages/html-to-text/CHANGELOG.md).
## Installation

@@ -86,3 +89,4 @@

`baseElements.returnDomByDefault` | `true` | Convert the entire document if none of provided selectors match.
`decodeOptions` | `{ isAttributeValue: false, strict: false }` | Text decoding options given to `he.decode`. For more information see the [he](https://github.com/mathiasbynens/he) module.
`decodeEntities` | `true` | Decode HTML entities found in the input HTML if `true`. Otherwise preserve in output text.
`encodeCharacters` | `{}` | A dictionary with characters that should be replaced in the output text and corresponding escape sequences.
`formatters` | `{}` | An object with custom formatting functions for specific elements (see [Override formatting](#override-formatting) section below).

@@ -108,20 +112,21 @@ `limits` | | Describes how to limit the output text in case of large HTML documents.

`baseElement` | 8.0 | | `baseElements: { selectors: [ 'body' ] }`
`decodeOptions` | | 9.0 | Entity decoding is now handled by [htmlparser2](https://github.com/fb55/htmlparser2) itself and [entities](https://github.com/fb55/entities) internally. No user-configurable parts compared to [he](https://github.com/mathiasbynens/he).
`format` | | 6.0 | The way formatters are written has changed completely. New formatters have to be added to the `formatters` option, old ones can not be reused without rewrite. See [new instructions](#override-formatting) below.
`hideLinkHrefIfSameAsText` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { hideLinkHrefIfSameAsText: true } } ]`
`ignoreHref` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { ignoreHref: true } } ]`
`ignoreImage` | 6.0 | *9.0* | `selectors: [ { selector: 'img', format: 'skip' } ]`
`linkHrefBaseUrl` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'a', options: { baseUrl: 'https://example.com' } },`<br/>`{ selector: 'img', options: { baseUrl: 'https://example.com' } }`<br/>`]`
`noAnchorUrl` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { noAnchorUrl: true } } ]`
`noLinkBrackets` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { linkBrackets: false } } ]`
`hideLinkHrefIfSameAsText` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { hideLinkHrefIfSameAsText: true } } ]`
`ignoreHref` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { ignoreHref: true } } ]`
`ignoreImage` | 6.0 | 9.0 | `selectors: [ { selector: 'img', format: 'skip' } ]`
`linkHrefBaseUrl` | 6.0 | 9.0 | `selectors: [`<br/>`{ selector: 'a', options: { baseUrl: 'https://example.com' } },`<br/>`{ selector: 'img', options: { baseUrl: 'https://example.com' } }`<br/>`]`
`noAnchorUrl` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { noAnchorUrl: true } } ]`
`noLinkBrackets` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { linkBrackets: false } } ]`
`returnDomByDefault` | 8.0 | | `baseElements: { returnDomByDefault: true }`
`singleNewLineParagraphs` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },`<br/>`{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }`<br/>`]`
`singleNewLineParagraphs` | 6.0 | 9.0 | `selectors: [`<br/>`{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },`<br/>`{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }`<br/>`]`
`tables` | 8.0 | | `selectors: [ { selector: 'table.class#id', format: 'dataTable' } ]`
`tags` | 8.0 | | See [Selectors](#selectors) section below.
`unorderedListItemPrefix` | 6.0 | *9.0* | `selectors: [ { selector: 'ul', options: { itemPrefix: ' * ' } } ]`
`uppercaseHeadings` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'h1', options: { uppercase: false } },`<br/>`...`<br/>`{ selector: 'table', options: { uppercaseHeaderCells: false } }`<br/>`]`
`unorderedListItemPrefix` | 6.0 | 9.0 | `selectors: [ { selector: 'ul', options: { itemPrefix: ' * ' } } ]`
`uppercaseHeadings` | 6.0 | 9.0 | `selectors: [`<br/>`{ selector: 'h1', options: { uppercase: false } },`<br/>`...`<br/>`{ selector: 'table', options: { uppercaseHeaderCells: false } }`<br/>`]`
Other things deprecated:
Other things removed:
* `fromString` method;
* positional arguments in `BlockTextBuilder` methods (in case you have written some custom formatters for version 6.0).
* `fromString` method - use `convert` or `htmlToText` instead;
* positional arguments in `BlockTextBuilder` methods - pass option objects instead.

@@ -207,4 +212,13 @@ #### Selectors

* `dataTable` - for visually-accurate tables. Note that this might be not search-friendly (output text will look like gibberish to a machine when there is any wrapped cell contents) and also better to be avoided for tables used as a page layout tool;
* `skip` - as the name implies it skips the given tag with it's contents without printing anything.
Format | Description
---------------- | -----------
`dataTable` | For visually-accurate tables. Note that this might be not search-friendly (output text will look like gibberish to a machine when there is any wrapped cell contents) and also better to be avoided for tables used as a page layout tool.
`skip` | Skips the given tag with it's contents without printing anything.
`blockString` | Insert a block with the given string literal (`formatOptions.string`) instead of the tag.
`blockTag` | Render an element as HTML block bag, convert it's contents to text.
`blockHtml` | Render an element with all it's children as HTML block.
`inlineString` | Insert the given string literal (`formatOptions.string`) inline instead of the tag.
`inlineSurround` | Render inline element wrapped with given strings (`formatOptions.prefix` and `formatOptions.suffix`).
`inlineTag` | Render an element as inline HTML tag, convert it's contents to text.
`inlineHtml` | Render an element with all it's children as inline HTML.

@@ -219,4 +233,5 @@ ##### Format options

`trailingLineBreaks` | `1` or `2` | all block-level formatters | Number of line breaks to separate this block from the next one.<br/>Note that N+1 line breaks are needed to make N empty lines.
`baseUrl` | `null` | `anchor`, `image` | Server host for link `href` attributes and image `src` attributes relative to the root (the ones that start with `/`).<br/>For example, with `baseUrl = 'http://asdf.com'` and `<a href='/dir/subdir'>...</a>` the link in the text will be `http://asdf.com/dir/subdir`.<br/>Keep in mind that `baseUrl` should not end with a `/`.
`baseUrl` | `null` | `anchor`, `image` | Server host for link `href` attributes and image `src` attributes relative to the root (the ones that start with `/`).<br/>For example, with `baseUrl = 'http://asdf.com'` and `<a href='/dir/subdir'>...</a>` the link in the text will be `http://asdf.com/dir/subdir`.
`linkBrackets` | `['[', ']']` | `anchor`, `image` | Surround links with these brackets.<br/>Set to `false` or `['', '']` to disable.
`pathRewrite` | `undefined` | `anchor`, `image` | A function to rewrite link `href` attributes and image `src` attributes. Optional second argument is the metadata object.<br/>Applied before `baseUrl`.
`hideLinkHrefIfSameAsText` | `false` | `anchor` | By default links are translated in the following way:<br/>`<a href='link'>text</a>` => becomes => `text [link]`.<br/>If this option is set to `true` and `link` and `text` are the same, `[link]` will be omitted and only `text` will be present.

@@ -242,4 +257,2 @@ `ignoreHref` | `false` | `anchor` | Ignore all links. Only process internal text of anchor tags.

This is significantly changed in version 6.
`formatters` option is an object that holds formatting functions. They can be assigned to format different elements in the `selectors` array.

@@ -282,2 +295,4 @@

New in version 9: metadata object can be provided as the last optional argument of the `convert` function (or the function returned by `compile` function). It can be accessed by formatters as `builder.metadata`.
Refer to [built-in formatters](https://github.com/html-to-text/node-html-to-text/blob/master/lib/formatter.js) for more examples. The easiest way to write your own is to pick an existing one and customize.

@@ -287,22 +302,2 @@

Note: `BlockTextBuilder` got some important [changes](https://github.com/html-to-text/node-html-to-text/commit/f50f10f54cf814efb2f7633d9d377ba7eadeaf1e) in the version 7. Positional arguments are deprecated and formatters written for the version 6 have to be updated accordingly in order to keep working after next major update.
## Command Line Interface
It is possible to use html-to-text as command line interface. This allows an easy validation of your generated text and the integration in other systems that does not run on node.js.
`html-to-text` uses `stdin` and `stdout` for data in and output. So you can use `html-to-text` the following way:
```
cat example/test.html | html-to-text > test.txt
```
There also all options available as described above. You can use them like this:
```
cat example/test.html | html-to-text --tables=#invoice,.address --wordwrap=100 > test.txt
```
The `tables` option has to be declared as comma separated list without whitespaces.
## Example

@@ -309,0 +304,0 @@

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc