justext
Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. Origionally inspired from the python version at https://github.com/miso-belica/jusText .
Usage
Basic Usage
const jusText = require("jusText");
const defaultOutput = jusText.rawHtml(htmlDoc);
console.log(defaultOutput);
Specific Usage
const defaultOptions = {
lengthLow: 70,
lengthHigh: 200,
stopwordsLow: 0.3,
stopwordsHigh: 0.32,
maxLinkDensity: 0.2,
maxHeadingDistance: 200,
noHeadings: false,
};
const output = jusText.rawHtml(
htmlDoc,
"english",
"unformatted",
defaultOptions
);
console.log(defaultOutput);
Pulling out only long text
const output = jusText.rawHtml(htmlDoc, "english", "unformatted");
const paragraphs = output
.filter(
(paragraph) =>
paragraph.cfClass !== "short" && paragraph.classType === "good"
)
.map((paragraph) => paragraph.text());
console.log(paragraphs);
Helpers
const jusText = require("jusText");
const langauges = jusText.getLanguages();
const stoplist = jusText.getStoplist("english");
Language Detection
For language detection, you can use @smodin/fast-text-language-detection
for best results on Node, or any smaller alternatives like languagedetect
on the browser.
TODO
python source updates / functionality to be included
Languages
bugs
- short paragraphs are included when they shouldn't be. This short text logic needs to be updated to be like the source
Other Features
- Version without stopwords bundled together
History
- Version 0.0.1 - Convert from python code
- Version 0.0.2 - Add logger lib
- Version 0.0.3 - Migrate to rollup
- Version 0.1.0 - Minor bug fix, added unformatted format option, refactor