hast-util-to-nlcst
Transform HAST to NLCST.
Note: You probably want to use rehype-retext.
Installation
npm:
npm install hast-util-to-nlcst
Usage
Say we have the following example.html
:
<article>
Implicit.
<h1>Explicit: <strong>foo</strong>s-ball</h1>
<pre><code class="language-foo">bar()</code></pre>
</article>
...and next to it, index.js
:
var rehype = require('rehype')
var vfile = require('to-vfile')
var English = require('parse-english')
var inspect = require('unist-util-inspect')
var toNLCST = require('hast-util-to-nlcst')
var file = vfile.readSync('example.html')
var tree = rehype().parse(file)
console.log(inspect(toNLCST(tree, file, English)))
Which, when running, yields:
RootNode[2] (1:1-6:1, 0-134)
├─ ParagraphNode[3] (1:10-3:3, 9-24)
│ ├─ WhiteSpaceNode: "\n " (1:10-2:3, 9-12)
│ ├─ SentenceNode[2] (2:3-2:12, 12-21)
│ │ ├─ WordNode[1] (2:3-2:11, 12-20)
│ │ │ └─ TextNode: "Implicit" (2:3-2:11, 12-20)
│ │ └─ PunctuationNode: "." (2:11-2:12, 20-21)
│ └─ WhiteSpaceNode: "\n " (2:12-3:3, 21-24)
└─ ParagraphNode[1] (3:7-3:43, 28-64)
└─ SentenceNode[4] (3:7-3:43, 28-64)
├─ WordNode[1] (3:7-3:15, 28-36)
│ └─ TextNode: "Explicit" (3:7-3:15, 28-36)
├─ PunctuationNode: ":" (3:15-3:16, 36-37)
├─ WhiteSpaceNode: " " (3:16-3:17, 37-38)
└─ WordNode[4] (3:25-3:43, 46-64)
├─ TextNode: "foo" (3:25-3:28, 46-49)
├─ TextNode: "s" (3:37-3:38, 58-59)
├─ PunctuationNode: "-" (3:38-3:39, 59-60)
└─ TextNode: "ball" (3:39-3:43, 60-64)
API
toNLCST(node, file, Parser)
Transform a HAST syntax tree and corresponding virtual file
into an NLCST tree.
Parameters
node
Syntax tree with positional information (HASTNode
).
file
Virtual file (VFile
).
parser
Constructor of an NLCST parser, such as parse-english
,
parse-dutch
, or parse-latin
(Function
).
Returns
NLCSTNode
.
Notes
Implied sentences
The algorithm supports implicit and explicit paragraphs, such as:
<article>
An implicit sentence.
<h1>An explicit sentence.</h1>
</article>
Overlapping paragraphs are also supported (see the tests or the HTML spec
for more info).
Ignored nodes
Some elements are ignored and their content will not be present in NLCST:
<script>
, <style>
, <svg>
, <math>
, <del>
.
To ignore other elements, add a data-nlcst
attribute with a value of ignore
:
<p>This is <span data-nlcst="ignore">hidden</span>.</p>
<p data-nlcst="ignore">Completely hidden.</p>
Source nodes
<code>
elements are mapped to Source nodes in NLCST.
To mark other elements as source, add a data-nlcst
attribute with a value
of source
:
<p>This is <span data-nlcst="source">marked as source</span>.</p>
<p data-nlcst="source">Completely marked.</p>
Contribute
See contributing.md
in syntax-tree/hast
for ways to get
started.
This organisation has a Code of Conduct. By interacting with this
repository, organisation, or community you agree to abide by its terms.
License
MIT © Titus Wormer