html-to-text
Advanced tools
Comparing version 8.2.1 to 9.0.0-preview1
# Changelog | ||
## Version 9.0.0-preview1 (WIP) | ||
README location before full release: [v9/packages/html-to-text/README.md](https://github.com/html-to-text/node-html-to-text/blob/v9/packages/html-to-text/README.md) | ||
All commits: [8.2.1...v9](https://github.com/html-to-text/node-html-to-text/compare/8.2.1...v9) | ||
Version 9 roadmap: [#240](https://github.com/html-to-text/node-html-to-text/issues/240) | ||
### Node version | ||
Required Node version is now >=14. | ||
### CommonJS and ES Module | ||
Package now provides `cjs` and `mjs` exports. | ||
### CLI is no longer built in | ||
If you use CLI then install that package instead: _(WIP, to be provided before full release)_ | ||
### Dependency updates | ||
* `htmlparser2` updated from 6.1.0 to 8.0.1 ([Release notes](https://github.com/fb55/htmlparser2/releases)); | ||
* `he` dependency is removed. It was needed at the time it was introduced, apparently, but at this point `htmlparser2` seems to do a better job itself. | ||
### Removed features | ||
* Options deprecated in version 6 are now removed; | ||
* `decodeOptions` section removed with `he` dependency; | ||
* `fromString` method removed; | ||
* deprecated positional arguments in `BlockTextBuilder` methods are now removed. | ||
Refer to README for [migration instructions](https://github.com/html-to-text/node-html-to-text#deprecated-or-removed-options). | ||
### New options | ||
* `decodeEntities` - controls whether HTML entities found in the input HTML should be decoded or left as is in the output text; | ||
* `encodeCharacters` - a dictionary with characters that should be replaced in the output text and corresponding escape sequences. | ||
### New built-in formatters | ||
### Changes to existing built-in formatters | ||
* `anchor` and `image` got `pathRewrite` option; | ||
* `dataTable` formatter allows zero `colSpacing`. | ||
### Improvements for writing custom formatters | ||
* Some logic for making lists is moved to BlockTextBuilder and can be reused for custom lists (`openList`, `openListItem`, `closeListItem`, `closeList`). Addresses [#238](https://github.com/html-to-text/node-html-to-text/issues/238); | ||
* `startNoWrap`, `stopNoWrap` - allows to keep local inline content in a single line regardless of wrapping options; | ||
* `addLiteral` - it's like `addInline` but circumvents most of the text processing logic. This should be preferred when inserting markup elements; | ||
* It is now possible to provide a metadata object along with the HTML string to convert. Metadata object is available for custom formatters via `builder.metadata`. This allows to compile the converter once and still being able to supply per-document data. Metadata object is supplied as the last optional argument to `convert` function and the function returned by `compile` function. | ||
---- | ||
## Version 8.2.1 | ||
No changes in the package. Bumped dev dependencies and regenerated `package-lock.json`. | ||
No changes in published package. Bumped dev dependencies and regenerated `package-lock.json`. | ||
@@ -71,2 +126,4 @@ ## Version 8.2.0 | ||
---- | ||
## Version ~~7.1.2~~ 7.1.3 | ||
@@ -115,2 +172,4 @@ | ||
---- | ||
## Version 6.0.0 | ||
@@ -166,2 +225,4 @@ | ||
---- | ||
## Version 5.1.1 | ||
@@ -212,2 +273,4 @@ | ||
---- | ||
## Version 4.0.0 | ||
@@ -219,2 +282,4 @@ | ||
---- | ||
## Version 3.3.0 | ||
@@ -241,2 +306,4 @@ | ||
---- | ||
## Version 2.1.1 | ||
@@ -262,2 +329,4 @@ | ||
---- | ||
## Version 1.6.2 | ||
@@ -264,0 +333,0 @@ |
{ | ||
"name": "html-to-text", | ||
"version": "8.2.1", | ||
"version": "9.0.0-preview1", | ||
"description": "Advanced html to plain text converter", | ||
"keywords": [ | ||
"html", | ||
"node", | ||
"text", | ||
"mail", | ||
"plain", | ||
"converter" | ||
], | ||
"license": "MIT", | ||
"author": { | ||
"name": "Malte Legenhausen", | ||
"email": "legenhausen@werk85.de" | ||
}, | ||
"author": "Malte Legenhausen <legenhausen@werk85.de>", | ||
"contributors": [ | ||
"KillyMXI <killy@mxii.eu.org>" | ||
], | ||
"homepage": "https://github.com/html-to-text/node-html-to-text", | ||
@@ -18,42 +26,39 @@ "repository": { | ||
}, | ||
"keywords": [ | ||
"html", | ||
"node", | ||
"text", | ||
"mail", | ||
"plain", | ||
"converter" | ||
"type": "module", | ||
"main": "./lib/html-to-text.cjs", | ||
"module": "./lib/html-to-text.mjs", | ||
"exports": { | ||
"import": "./lib/html-to-text.mjs", | ||
"require": "./lib/html-to-text.cjs" | ||
}, | ||
"files": [ | ||
"lib", | ||
"README.md", | ||
"CHANGELOG.md", | ||
"LICENSE" | ||
], | ||
"engines": { | ||
"node": ">=10.23.2" | ||
"node": ">=14" | ||
}, | ||
"main": "index.js", | ||
"bin": { | ||
"html-to-text": "./bin/cli.js" | ||
}, | ||
"scripts": { | ||
"build:rollup": "rollup -c", | ||
"build": "npm run clean && npm run build:rollup", | ||
"clean": "rimraf lib", | ||
"copy:license": "copyfiles -f ../../LICENSE .", | ||
"cover": "c8 --reporter=lcov --reporter=text-summary mocha -t 20000", | ||
"example": "node ./example/html-to-text.js", | ||
"lint": "eslint .", | ||
"prepublishOnly": "npm run lint && npm test", | ||
"lint": "eslint ../../", | ||
"prepublishOnly": "npm run copy:license && npm run lint && npm test", | ||
"test": "mocha" | ||
}, | ||
"dependencies": { | ||
"@selderee/plugin-htmlparser2": "^0.6.0", | ||
"@selderee/plugin-htmlparser2": "^0.9.0", | ||
"deepmerge": "^4.2.2", | ||
"he": "^1.2.0", | ||
"htmlparser2": "^6.1.0", | ||
"minimist": "^1.2.6", | ||
"selderee": "^0.6.0" | ||
"htmlparser2": "^8.0.1", | ||
"selderee": "^0.9.0" | ||
}, | ||
"devDependencies": { | ||
"c8": "^7.12.0", | ||
"chai": "^4.3.6", | ||
"eslint": "^7.32.0", | ||
"eslint-plugin-filenames": "^1.3.2", | ||
"eslint-plugin-import": "^2.26.0", | ||
"eslint-plugin-jsdoc": "^33.3.0", | ||
"eslint-plugin-mocha": "^8.2.0", | ||
"mocha": "^8.4.0" | ||
"mocha": { | ||
"node-option": [ | ||
"experimental-specifier-resolution=node" | ||
] | ||
} | ||
} |
@@ -5,3 +5,2 @@ # html-to-text | ||
[![test status](https://github.com/html-to-text/node-html-to-text/workflows/test/badge.svg)](https://github.com/html-to-text/node-html-to-text/actions/workflows/test.yml) | ||
[![Test Coverage](https://codeclimate.com/github/html-to-text/node-html-to-text/badges/coverage.svg)](https://codeclimate.com/github/html-to-text/node-html-to-text/coverage) | ||
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/html-to-text/node-html-to-text/blob/master/LICENSE-MIT) | ||
@@ -24,5 +23,5 @@ [![npm](https://img.shields.io/npm/v/html-to-text?logo=npm)](https://www.npmjs.com/package/html-to-text) | ||
Available here: [CHANGELOG.md](https://github.com/html-to-text/node-html-to-text/blob/master/CHANGELOG.md) | ||
~~Available here: [CHANGELOG.md](https://github.com/html-to-text/node-html-to-text/blob/master/CHANGELOG.md)~~ | ||
Version 6 contains a ton of changes, so it worth to take a look. | ||
Version 6 contains a ton of changes, so it worth to take a look at the full changelog. | ||
@@ -33,2 +32,6 @@ Version 7 contains an important change for custom formatters. | ||
Version 9 gets a significant internal rework, drops a lot of previously deprecated options, introduces some new formatters and new capabilities for custom formatters. | ||
Version 9 WIP [GitHub branch](https://github.com/html-to-text/node-html-to-text/tree/v9), [CHANGELOG.md](https://github.com/html-to-text/node-html-to-text/blob/v9/packages/html-to-text/CHANGELOG.md). | ||
## Installation | ||
@@ -86,3 +89,4 @@ | ||
`baseElements.returnDomByDefault` | `true` | Convert the entire document if none of provided selectors match. | ||
`decodeOptions` | `{ isAttributeValue: false, strict: false }` | Text decoding options given to `he.decode`. For more information see the [he](https://github.com/mathiasbynens/he) module. | ||
`decodeEntities` | `true` | Decode HTML entities found in the input HTML if `true`. Otherwise preserve in output text. | ||
`encodeCharacters` | `{}` | A dictionary with characters that should be replaced in the output text and corresponding escape sequences. | ||
`formatters` | `{}` | An object with custom formatting functions for specific elements (see [Override formatting](#override-formatting) section below). | ||
@@ -108,20 +112,21 @@ `limits` | | Describes how to limit the output text in case of large HTML documents. | ||
`baseElement` | 8.0 | | `baseElements: { selectors: [ 'body' ] }` | ||
`decodeOptions` | | 9.0 | Entity decoding is now handled by [htmlparser2](https://github.com/fb55/htmlparser2) itself and [entities](https://github.com/fb55/entities) internally. No user-configurable parts compared to [he](https://github.com/mathiasbynens/he). | ||
`format` | | 6.0 | The way formatters are written has changed completely. New formatters have to be added to the `formatters` option, old ones can not be reused without rewrite. See [new instructions](#override-formatting) below. | ||
`hideLinkHrefIfSameAsText` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { hideLinkHrefIfSameAsText: true } } ]` | ||
`ignoreHref` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { ignoreHref: true } } ]` | ||
`ignoreImage` | 6.0 | *9.0* | `selectors: [ { selector: 'img', format: 'skip' } ]` | ||
`linkHrefBaseUrl` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'a', options: { baseUrl: 'https://example.com' } },`<br/>`{ selector: 'img', options: { baseUrl: 'https://example.com' } }`<br/>`]` | ||
`noAnchorUrl` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { noAnchorUrl: true } } ]` | ||
`noLinkBrackets` | 6.0 | *9.0* | `selectors: [ { selector: 'a', options: { linkBrackets: false } } ]` | ||
`hideLinkHrefIfSameAsText` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { hideLinkHrefIfSameAsText: true } } ]` | ||
`ignoreHref` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { ignoreHref: true } } ]` | ||
`ignoreImage` | 6.0 | 9.0 | `selectors: [ { selector: 'img', format: 'skip' } ]` | ||
`linkHrefBaseUrl` | 6.0 | 9.0 | `selectors: [`<br/>`{ selector: 'a', options: { baseUrl: 'https://example.com' } },`<br/>`{ selector: 'img', options: { baseUrl: 'https://example.com' } }`<br/>`]` | ||
`noAnchorUrl` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { noAnchorUrl: true } } ]` | ||
`noLinkBrackets` | 6.0 | 9.0 | `selectors: [ { selector: 'a', options: { linkBrackets: false } } ]` | ||
`returnDomByDefault` | 8.0 | | `baseElements: { returnDomByDefault: true }` | ||
`singleNewLineParagraphs` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },`<br/>`{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }`<br/>`]` | ||
`singleNewLineParagraphs` | 6.0 | 9.0 | `selectors: [`<br/>`{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },`<br/>`{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } }`<br/>`]` | ||
`tables` | 8.0 | | `selectors: [ { selector: 'table.class#id', format: 'dataTable' } ]` | ||
`tags` | 8.0 | | See [Selectors](#selectors) section below. | ||
`unorderedListItemPrefix` | 6.0 | *9.0* | `selectors: [ { selector: 'ul', options: { itemPrefix: ' * ' } } ]` | ||
`uppercaseHeadings` | 6.0 | *9.0* | `selectors: [`<br/>`{ selector: 'h1', options: { uppercase: false } },`<br/>`...`<br/>`{ selector: 'table', options: { uppercaseHeaderCells: false } }`<br/>`]` | ||
`unorderedListItemPrefix` | 6.0 | 9.0 | `selectors: [ { selector: 'ul', options: { itemPrefix: ' * ' } } ]` | ||
`uppercaseHeadings` | 6.0 | 9.0 | `selectors: [`<br/>`{ selector: 'h1', options: { uppercase: false } },`<br/>`...`<br/>`{ selector: 'table', options: { uppercaseHeaderCells: false } }`<br/>`]` | ||
Other things deprecated: | ||
Other things removed: | ||
* `fromString` method; | ||
* positional arguments in `BlockTextBuilder` methods (in case you have written some custom formatters for version 6.0). | ||
* `fromString` method - use `convert` or `htmlToText` instead; | ||
* positional arguments in `BlockTextBuilder` methods - pass option objects instead. | ||
@@ -207,4 +212,13 @@ #### Selectors | ||
* `dataTable` - for visually-accurate tables. Note that this might be not search-friendly (output text will look like gibberish to a machine when there is any wrapped cell contents) and also better to be avoided for tables used as a page layout tool; | ||
* `skip` - as the name implies it skips the given tag with it's contents without printing anything. | ||
Format | Description | ||
---------------- | ----------- | ||
`dataTable` | For visually-accurate tables. Note that this might be not search-friendly (output text will look like gibberish to a machine when there is any wrapped cell contents) and also better to be avoided for tables used as a page layout tool. | ||
`skip` | Skips the given tag with it's contents without printing anything. | ||
`blockString` | Insert a block with the given string literal (`formatOptions.string`) instead of the tag. | ||
`blockTag` | Render an element as HTML block bag, convert it's contents to text. | ||
`blockHtml` | Render an element with all it's children as HTML block. | ||
`inlineString` | Insert the given string literal (`formatOptions.string`) inline instead of the tag. | ||
`inlineSurround` | Render inline element wrapped with given strings (`formatOptions.prefix` and `formatOptions.suffix`). | ||
`inlineTag` | Render an element as inline HTML tag, convert it's contents to text. | ||
`inlineHtml` | Render an element with all it's children as inline HTML. | ||
@@ -219,4 +233,5 @@ ##### Format options | ||
`trailingLineBreaks` | `1` or `2` | all block-level formatters | Number of line breaks to separate this block from the next one.<br/>Note that N+1 line breaks are needed to make N empty lines. | ||
`baseUrl` | `null` | `anchor`, `image` | Server host for link `href` attributes and image `src` attributes relative to the root (the ones that start with `/`).<br/>For example, with `baseUrl = 'http://asdf.com'` and `<a href='/dir/subdir'>...</a>` the link in the text will be `http://asdf.com/dir/subdir`.<br/>Keep in mind that `baseUrl` should not end with a `/`. | ||
`baseUrl` | `null` | `anchor`, `image` | Server host for link `href` attributes and image `src` attributes relative to the root (the ones that start with `/`).<br/>For example, with `baseUrl = 'http://asdf.com'` and `<a href='/dir/subdir'>...</a>` the link in the text will be `http://asdf.com/dir/subdir`. | ||
`linkBrackets` | `['[', ']']` | `anchor`, `image` | Surround links with these brackets.<br/>Set to `false` or `['', '']` to disable. | ||
`pathRewrite` | `undefined` | `anchor`, `image` | A function to rewrite link `href` attributes and image `src` attributes. Optional second argument is the metadata object.<br/>Applied before `baseUrl`. | ||
`hideLinkHrefIfSameAsText` | `false` | `anchor` | By default links are translated in the following way:<br/>`<a href='link'>text</a>` => becomes => `text [link]`.<br/>If this option is set to `true` and `link` and `text` are the same, `[link]` will be omitted and only `text` will be present. | ||
@@ -242,4 +257,2 @@ `ignoreHref` | `false` | `anchor` | Ignore all links. Only process internal text of anchor tags. | ||
This is significantly changed in version 6. | ||
`formatters` option is an object that holds formatting functions. They can be assigned to format different elements in the `selectors` array. | ||
@@ -282,2 +295,4 @@ | ||
New in version 9: metadata object can be provided as the last optional argument of the `convert` function (or the function returned by `compile` function). It can be accessed by formatters as `builder.metadata`. | ||
Refer to [built-in formatters](https://github.com/html-to-text/node-html-to-text/blob/master/lib/formatter.js) for more examples. The easiest way to write your own is to pick an existing one and customize. | ||
@@ -287,22 +302,2 @@ | ||
Note: `BlockTextBuilder` got some important [changes](https://github.com/html-to-text/node-html-to-text/commit/f50f10f54cf814efb2f7633d9d377ba7eadeaf1e) in the version 7. Positional arguments are deprecated and formatters written for the version 6 have to be updated accordingly in order to keep working after next major update. | ||
## Command Line Interface | ||
It is possible to use html-to-text as command line interface. This allows an easy validation of your generated text and the integration in other systems that does not run on node.js. | ||
`html-to-text` uses `stdin` and `stdout` for data in and output. So you can use `html-to-text` the following way: | ||
``` | ||
cat example/test.html | html-to-text > test.txt | ||
``` | ||
There also all options available as described above. You can use them like this: | ||
``` | ||
cat example/test.html | html-to-text --tables=#invoice,.address --wordwrap=100 > test.txt | ||
``` | ||
The `tables` option has to be declared as comma separated list without whitespaces. | ||
## Example | ||
@@ -309,0 +304,0 @@ |
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
Deprecated
MaintenanceThe maintainer of the package marked it as deprecated. This could indicate that a single version should not be used, or that the package is no longer maintained and any new vulnerabilities will not be fixed.
Found 1 instance in 1 package
Major refactor
Supply chain riskPackage has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.
Found 1 instance in 1 package
No v1
QualityPackage is not semver >=1. This means it is not stable and does not support ^ ranges.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
Shell access
Supply chain riskThis module accesses the system shell. Accessing the system shell increases the risk of executing arbitrary code.
Found 1 instance in 1 package
Filesystem access
Supply chain riskAccesses the file system, and could potentially read sensitive data.
Found 1 instance in 1 package
4
0
3816
0
Yes
166406
6
1
1
307
+ Added@selderee/plugin-htmlparser2@0.9.0(transitive)
+ Addeddom-serializer@2.0.0(transitive)
+ Addeddomhandler@5.0.3(transitive)
+ Addeddomutils@3.1.0(transitive)
+ Addedentities@4.5.0(transitive)
+ Addedhtmlparser2@8.0.2(transitive)
+ Addedleac@0.5.1(transitive)
+ Addedparseley@0.10.0(transitive)
+ Addedpeberminta@0.6.0(transitive)
+ Addedselderee@0.9.0(transitive)
- Removedhe@^1.2.0
- Removedminimist@^1.2.6
- Removed@selderee/plugin-htmlparser2@0.6.0(transitive)
- Removedcommander@2.20.3(transitive)
- Removeddiscontinuous-range@1.0.0(transitive)
- Removeddom-serializer@1.4.1(transitive)
- Removeddomutils@2.8.0(transitive)
- Removedentities@2.2.0(transitive)
- Removedhe@1.2.0(transitive)
- Removedhtmlparser2@6.1.0(transitive)
- Removedminimist@1.2.8(transitive)
- Removedmoo@0.5.2(transitive)
- Removednearley@2.20.1(transitive)
- Removedparseley@0.7.0(transitive)
- Removedrailroad-diagrams@1.0.0(transitive)
- Removedrandexp@0.4.6(transitive)
- Removedret@0.1.15(transitive)
- Removedselderee@0.6.0(transitive)
Updatedhtmlparser2@^8.0.1
Updatedselderee@^0.9.0