@mozilla/readability
Advanced tools
Comparing version 0.4.1 to 0.4.2
@@ -14,2 +14,12 @@ # Changelog | ||
## [Unreleased] | ||
- Fix [compatibility with DOM implementations where the `childNodes` property is not live](https://github.com/mozilla/readability/pull/694) ([x2](https://github.com/mozilla/readability/pull/677)). | ||
- Lazily-loaded image references [will no longer use the `alt` attribute](https://github.com/mozilla/readability/pull/689) to find images. | ||
- `parse()` [provides the root element's `lang` attribute](https://github.com/mozilla/readability/pull/721) | ||
- `isProbablyReadable` [includes article tags](https://github.com/mozilla/readability/pull/724) | ||
- Improvements to JSON-LD support | ||
- [Continue parsing other JSON-LD elements until we find one we can support](https://github.com/mozilla/readability/pull/713) | ||
- [Prefer using headline for article title](https://github.com/mozilla/readability/pull/713) | ||
## [0.4.1] - 2021-01-13 | ||
@@ -25,3 +35,3 @@ | ||
- `isProbablyReaderable` can now take an optional options object to configure it, | ||
- `isProbablyReaderable` [can now take an optional options object](https://github.com/mozilla/readability/pull/634) to configure it, | ||
allowing you to specify the minimum content length, minimum score, and how to | ||
@@ -28,0 +38,0 @@ check if nodes are visible. |
{ | ||
"name": "@mozilla/readability", | ||
"version": "0.4.1", | ||
"version": "0.4.2", | ||
"description": "A standalone version of the readability library used for Firefox Reader View.", | ||
@@ -29,10 +29,10 @@ "main": "index.js", | ||
"chai": "^2.1.*", | ||
"eslint": ">=4.2", | ||
"eslint": "^7.26.0", | ||
"htmltidy2": "^0.3.0", | ||
"js-beautify": "^1.13.0", | ||
"js-beautify": "^1.13.13", | ||
"jsdom": "^13.1", | ||
"mocha": "^8.2.0", | ||
"release-it": "^14.2.2", | ||
"mocha": "^8.4.0", | ||
"release-it": "^14.6.2", | ||
"sinon": "^7.3.2" | ||
} | ||
} |
@@ -56,3 +56,3 @@ /* eslint-env es6:false */ | ||
var nodes = doc.querySelectorAll("p, pre"); | ||
var nodes = doc.querySelectorAll("p, pre, article"); | ||
@@ -59,0 +59,0 @@ // Get <div> nodes which have <br> node(s) and append them into the `nodes` variable. |
116
README.md
@@ -5,6 +5,16 @@ # Readability.js | ||
## Usage on the web | ||
## Installation | ||
To parse a document, you must create a new `Readability` object from a DOM document object, and then call `parse()`. Here's an example: | ||
Readability is available on npm: | ||
```bash | ||
npm install @mozilla/readability | ||
``` | ||
You can then `require()` it, or for web-based projects, load the `Readability.js` script from your webpage. | ||
## Basic usage | ||
To parse a document, you must create a new `Readability` object from a DOM document object, and then call the [`parse()`](#parse) method. Here's an example: | ||
```javascript | ||
@@ -14,48 +24,70 @@ var article = new Readability(document).parse(); | ||
This `article` object will contain the following properties: | ||
If you use Readability in a web browser, you will likely be able to use a `document` reference from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin `<iframe>` you have access to, etc.). In Node.js, you can [use an external DOM library](#nodejs-usage). | ||
* `title`: article title | ||
* `content`: HTML string of processed article content | ||
* `textContent`: text content of the article (all HTML removed) | ||
* `length`: length of an article, in characters | ||
* `excerpt`: article description, or short excerpt from the content | ||
* `byline`: author metadata | ||
* `dir`: content direction | ||
## API Reference | ||
If you're using Readability on the web, you will likely be able to use a `document` reference | ||
from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin `<iframe>` you have access to, etc.). | ||
### `new Readability(document, options)` | ||
### Optional | ||
The `options` object accepts a number of properties, all optional: | ||
Readability's `parse()` works by modifying the DOM. This removes some elements in the web page. | ||
You could avoid this by passing the clone of the `document` object while creating a `Readability` object. | ||
* `debug` (boolean, default `false`): whether to enable logging. | ||
* `maxElemsToParse` (number, default `0` i.e. no limit): the maximum number of elements to parse. | ||
* `nbTopCandidates` (number, default `5`): the number of top candidates to consider when analysing how tight the competition is among candidates. | ||
* `charThreshold` (number, default `500`): the number of characters an article must have in order to return a result. | ||
* `classesToPreserve` (array): a set of classes to preserve on HTML elements when the `keepClasses` options is set to `false`. | ||
* `keepClasses` (boolean, default `false`): whether to preserve all classes on HTML elements. When set to `false` only classes specified in the `classesToPreserve` array are kept. | ||
* `disableJSONLD` (boolean, default `false`): when extracting page metadata, Readability gives precendence to Schema.org fields specified in the JSON-LD format. Set this option to `true` to skip JSON-LD parsing. | ||
* `serializer` (function, default `el => el.innerHTML`) controls how the the `content` property returned by the `parse()` method is produced from the root DOM element. It may be useful to specify the `serializer` as the identity function (`el => el`) to obtain a DOM element instead of a string for `content` if you plan to process it further. | ||
``` | ||
var documentClone = document.cloneNode(true); | ||
### `parse()` | ||
Returns an object containing the following properties: | ||
* `title`: article title; | ||
* `content`: HTML string of processed article content; | ||
* `textContent`: text content of the article, with all the HTML tags removed; | ||
* `length`: length of an article, in characters; | ||
* `excerpt`: article description, or short excerpt from the content; | ||
* `byline`: author metadata; | ||
* `dir`: content direction; | ||
* `siteName`: name of the site. | ||
* `lang`: content language | ||
The `parse()` method works by modifying the DOM. This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the `document` object to the `Readability` constructor: | ||
```js | ||
var documentClone = document.cloneNode(true); | ||
var article = new Readability(documentClone).parse(); | ||
``` | ||
## Usage from Node.js | ||
### `isProbablyReaderable(document, options)` | ||
Readability is available on npm: | ||
A quick-and-dirty way of figuring out if it's plausible that the contents of a given document are suitable for processing with Readability. It is likely to produce both false positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive process (like loading and showing the user a webpage) with the complex logic in the core of Readability. Improvements to its logic (while not deteriorating its performance) are very welcome. | ||
```bash | ||
npm install @mozilla/readability | ||
The `options` object accepts a number of properties, all optional: | ||
* `minContentLength` (number, default `140`): the minimum node content length used to decide if the document is readerable; | ||
* `minScore` (number, default `20`): the minumum cumulated 'score' used to determine if the document is readerable; | ||
* `visibilityChecker` (function, default `isNodeVisible`): the function used to determine if a node is visible; | ||
The function returns a boolean corresponding to whether or not we suspect `Readability.parse()` will suceeed at returning an article object. Here's an example: | ||
```js | ||
/* | ||
Only instantiate Readability if we suspect | ||
the `parse()` method will produce a meaningful result. | ||
*/ | ||
if (isProbablyReaderable(document)) { | ||
let article = new Readability(document).parse(); | ||
} | ||
``` | ||
In Node.js, you won't generally have a DOM document object. To obtain one, you can use external | ||
libraries like [jsdom](https://github.com/jsdom/jsdom). While this repository contains a parser of | ||
its own (`JSDOMParser`), that is restricted to reading XML-compatible markup and therefore we do | ||
not recommend it for general use. | ||
## Node.js usage | ||
If you're using `jsdom` to create a DOM object, you should ensure that the page doesn't run (page) | ||
scripts (avoid fetching remote resources etc.) as well as passing it the page's URI as the `url` | ||
property of the `options` object you pass the `JSDOM` constructor. | ||
Since Node.js does not come with its own DOM implementation, we rely on external libraries like [jsdom](https://github.com/jsdom/jsdom). Here's an example using `jsdom` to obtain a DOM document object: | ||
### Example: | ||
```js | ||
var { Readability } = require('@mozilla/readability'); | ||
var JSDOM = require('jsdom').JSDOM; | ||
var doc = new JSDOM("<body>Here's a bunch of text</body>", { | ||
var { JSDOM } = require('jsdom'); | ||
var doc = new JSDOM("<body>Look at this cat: <img src='./cat.jpg'></body>", { | ||
url: "https://www.example.com/the-page-i-got-the-source-from" | ||
@@ -67,22 +99,12 @@ }); | ||
## What's Readability-readerable? | ||
Remember to pass the page's URI as the `url` option in the `JSDOM` constructor (as shown in the example above), so that Readability can convert relative URLs for images, hyperlinks etc. to their absolute counterparts. | ||
It's a quick-and-dirty way of figuring out if it's plausible that the contents of a given | ||
document are suitable for processing with Readability. It is likely to produce both false | ||
positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive | ||
process (like loading and showing the user a webpage) with the complex logic in the core of | ||
Readability. Improvements to its logic (while not deteriorating its performance) are very | ||
welcome. | ||
`jsdom` has the ability to run the scripts included in the HTML and fetch remote resources. For security reasons these are [disabled by default](https://github.com/jsdom/jsdom#executing-scripts), and we **strongly** recommend you keep them that way. | ||
## Security | ||
If you're going to use Readability with untrusted input (whether in HTML or DOM form), we | ||
**strongly** recommend you use a sanitizer library like | ||
[DOMPurify](https://github.com/cure53/DOMPurify) to avoid script injection when you use | ||
the output of Readability. We would also recommend using | ||
[CSP](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) to add further defense-in-depth | ||
If you're going to use Readability with untrusted input (whether in HTML or DOM form), we **strongly** recommend you use a sanitizer library like [DOMPurify](https://github.com/cure53/DOMPurify) to avoid script injection when you use | ||
the output of Readability. We would also recommend using [CSP](https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) to add further defense-in-depth | ||
restrictions to what you allow the resulting content to do. The Firefox integration of | ||
reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input | ||
is explicitly not something we aim to do as part of Readability itself - there are other | ||
good sanitizer libraries out there, use them! | ||
reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input is explicitly not something we aim to do as part of Readability itself - there are other good sanitizer libraries out there, use them! | ||
@@ -89,0 +111,0 @@ ## Contributing |
Sorry, the diff of this file is too big to display
New author
Supply chain riskA new npm collaborator published a version of the package for the first time. New collaborators are usually benign additions to a project, but do indicate a change to the security surface area of a package.
Found 1 instance in 1 package
144128
3355
127
1