
Security News
Django Joins curl in Pushing Back on AI Slop Security Reports
Django has updated its security policies to reject AI-generated vulnerability reports that include fabricated or unverifiable content.
de·fud·dle /diˈfʌdl/ transitive verb
to remove unnecessary elements from a web page, and make it easily readable.
Beware! Defuddle is very much a work in progress!
Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.
Defuddle aims to output clean and consistent HTML documents. It was written for Obsidian Web Clipper with the goal of creating a more useful input for HTML-to-Markdown converters like Turndown.
Defuddle can be used as a replacement for Mozilla Readability with a few differences:
npm install defuddle
For Node.js usage, you'll also need to install JSDOM:
npm install jsdom
import { Defuddle } from 'defuddle';
// Parse the current document
const defuddle = new Defuddle(document);
const result = defuddle.parse();
// Access the content and metadata
console.log(result.content);
console.log(result.title);
console.log(result.author);
import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';
// Parse HTML from a string
const html = '<html><body><article>...</article></body></html>';
const result = await Defuddle(html);
// Parse HTML from a URL
const dom = await JSDOM.fromURL('https://example.com/article');
const result = await Defuddle(dom);
// With options
const result = await Defuddle(dom, {
debug: true, // Enable debug mode for verbose logging
markdown: true, // Convert content to markdown
url: 'https://example.com/article' // Original URL of the page
});
// Access the content and metadata
console.log(result.content);
console.log(result.title);
console.log(result.author);
Note: for defuddle/node
to import properly, the module format in your package.json
has to be set to { "type": "module" }
Defuddle returns an object with the following properties:
Property | Type | Description |
---|---|---|
content | string | Cleaned up string of the extracted content |
title | string | Title of the article |
description | string | Description or summary of the article |
domain | string | Domain name of the website |
favicon | string | URL of the website's favicon |
image | string | URL of the article's main image |
parseTime | number | Time taken to parse the page in milliseconds |
published | string | Publication date of the article |
author | string | Author of the article |
site | string | Name of the website |
schemaOrgData | object | Raw schema.org data extracted from the page |
wordCount | number | Total number of words in the extracted content |
Defuddle is available in three different bundles:
defuddle
): The main bundle for browser usage. No dependencies.defuddle/full
): Includes additional features for math equation parsing.defuddle/node
): Optimized for Node.js environments using JSDOM. Includes full capabilities for math and Markdown conversion.The core bundle is recommended for most use cases. It still handles math content, but doesn't include fallbacks for converting between MathML and LaTeX formats. The full bundle adds the ability to create reliable <math>
elements using mathml-to-latex
and temml
libraries.
Option | Type | Description |
---|---|---|
debug | boolean | Enable debug logging |
url | string | URL of the page being parsed |
markdown | boolean | Convert content to Markdown |
separateMarkdown | boolean | Keep content as HTML and return contentMarkdown as Markdown |
removeExactSelectors | boolean | Whether to remove elements matching exact selectors like ads, social buttons, etc. Defaults to true. |
removePartialSelectors | boolean | Whether to remove elements matching partial selectors like ads, social buttons, etc. Defaults to true. |
You can enable debug mode by passing an options object when creating a new Defuddle instance:
const article = new Defuddle(document, { debug: true }).parse();
Defuddle attempts to standardize HTML elements to provide a consistent input for subsequent manipulation such as conversion to Markdown.
Code block are standardized. If present, line numbers and syntax highlighting are removed, but the language is retained and added as a data attribute and class.
<pre>
<code data-lang="js" class="language-js">
// code
</code>
</pre>
Inline references and footnotes are converted to a standard format:
Inline reference<sup id="fnref:1"><a href="#fn:1">1</a></sup>.
<div id="footnotes">
<ol>
<li class="footnote" id="fn:1">
<p>
Footnote content. <a href="#fnref:1" class="footnote-backref">↩</a>
</p>
</li>
</ol>
</div>
Math elements, including MathJax and KaTeX, are converted to standard MathML:
<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline" data-latex="a \neq 0">
<mi>a</mi>
<mo>≠</mo>
<mn>0</mn>
</math>
To build the package, you'll need Node.js and npm installed. Then run:
# Install dependencies
npm install
# Clean and build
npm run build
FAQs
Extract article content and metadata from web pages.
The npm package defuddle receives a total of 1,495 weekly downloads. As such, defuddle popularity was classified as popular.
We found that defuddle demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Django has updated its security policies to reject AI-generated vulnerability reports that include fabricated or unverifiable content.
Security News
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Security News
A new Node.js homepage button linking to paid support for EOL versions has sparked a heated discussion among contributors and the wider community.