
Security News
Crates.io Users Targeted by Phishing Emails
The Rust Security Response WG is warning of phishing emails from rustfoundation.dev targeting crates.io users.
@harutakax/html-rag-optimizer
Advanced tools
HTML optimization tool for RAG (Retrieval-Augmented Generation) systems
A powerful HTML optimization tool designed specifically for RAG (Retrieval-Augmented Generation) systems. This library removes unnecessary HTML elements, attributes, and formatting to create clean, search-optimized content while preserving semantic structure.
npm install @harutakax/html-rag-optimizer
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const html = `
<div class="container">
<h1 id="title">Welcome</h1>
<p>This is a <strong>sample</strong> paragraph.</p>
<script>console.log('remove me');</script>
<style>.container { margin: 0; }</style>
</div>
`;
// Basic optimization
const optimized = optimizeHtml(html);
console.log(optimized);
// Output: <div><h1>Welcome</h1><p>This is a <strong>sample</strong> paragraph.</p></div>
# Optimize a single file
npx @harutakax/html-rag-optimizer input.html -o output.html
# Optimize an entire directory
@harutakax/html-rag-optimizer --input-dir ./docs --output-dir ./optimized
# With custom options
@harutakax/html-rag-optimizer input.html -o output.html --keep-attributes --exclude-tags script,style
Option | Type | Default | Description |
---|---|---|---|
keepAttributes | boolean | false | Preserve HTML attributes |
removeEmpty | boolean | true | Remove empty elements |
preserveWhitespace | boolean | false | Preserve whitespace formatting |
excludeTags | string[] | [] | Tags to exclude from removal |
keepTags | string[] | [] | Only keep specified tags (removes others) |
removeComments | boolean | true | Remove HTML comments |
minifyText | boolean | true | Normalize and minify text content |
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const options = {
keepAttributes: false,
removeEmpty: true,
preserveWhitespace: false,
excludeTags: ['code', 'pre'], // Don't remove code blocks
keepTags: ['h1', 'h2', 'h3', 'p', 'div', 'article'], // Only keep these tags
removeComments: true,
minifyText: true
};
const optimized = optimizeHtml(html, options);
import { optimizeHtmlFile, optimizeHtmlDir } from '@harutakax/html-rag-optimizer';
// Process single file
await optimizeHtmlFile('input.html', 'output.html', options);
// Process entire directory
await optimizeHtmlDir('./docs', './optimized', options);
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
import { promises as fs } from 'fs';
async function processBatch(files: string[]) {
const results = await Promise.all(
files.map(async (file) => {
const html = await fs.readFile(file, 'utf-8');
return optimizeHtml(html, {
keepTags: ['h1', 'h2', 'h3', 'p', 'article'],
removeComments: true
});
})
);
return results;
}
# Help
@harutakax/html-rag-optimizer --help
# Version
@harutakax/html-rag-optimizer --version
# Single file
@harutakax/html-rag-optimizer input.html -o output.html
# Directory processing
@harutakax/html-rag-optimizer --input-dir ./src --output-dir ./dist
-o, --output <path> Output file or directory
--input-dir <path> Input directory
--output-dir <path> Output directory
--keep-attributes Keep HTML attributes
--exclude-tags <tags> Exclude tags (comma-separated)
--keep-tags <tags> Keep only specified tags (comma-separated)
--preserve-whitespace Preserve whitespace
--config <path> Configuration file path
-h, --help Show help
-v, --version Show version
Create a html-rag-optimizer.json
file:
{
"keepAttributes": false,
"removeEmpty": true,
"excludeTags": ["code", "pre"],
"keepTags": ["h1", "h2", "h3", "p", "div", "article"],
"removeComments": true,
"minifyText": true
}
Use with: @harutakax/html-rag-optimizer --config html-rag-optimizer.json input.html -o output.html
<script>
tags and content<style>
tags and content<meta>
tags<!-- -->
)<div></div>
, <p> </p>
)&
, <
, etc.)<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Sample Page</title>
<style>body { font-family: Arial; }</style>
</head>
<body>
<div class="container" id="main">
<h1 class="title"> Welcome to Our Site </h1>
<!-- Navigation goes here -->
<p class="intro">This is a sample paragraph.</p>
<div></div>
<script>console.log('hello');</script>
</div>
</body>
</html>
<html><head><title>Sample Page</title></head><body><div><h1>Welcome to Our Site</h1><p>This is a sample paragraph.</p></div></body></html>
Perfect for preparing HTML content for vector databases and search systems:
// Optimize content before indexing
const webContent = await fetchWebPage(url);
const optimizedForRAG = optimizeHtml(webContent, {
keepTags: ['h1', 'h2', 'h3', 'p', 'article', 'section'],
removeComments: true,
minifyText: true
});
// Index optimizedForRAG in your vector database
Clean up documentation before feeding to LLMs:
const docs = await fs.readFile('documentation.html', 'utf-8');
const cleanDocs = optimizeHtml(docs, {
excludeTags: ['code', 'pre'], // Keep code examples
keepTags: ['h1', 'h2', 'h3', 'p', 'ul', 'ol', 'li', 'code', 'pre']
});
Clean scraped content for analysis:
const scrapedHTML = await scrapeWebsite(url);
const cleanContent = optimizeHtml(scrapedHTML, {
keepTags: ['p', 'h1', 'h2', 'h3', 'article'],
removeComments: true,
minifyText: true
});
# Clone the repository
git clone https://github.com/your-org/html-rag-optimizer.git
cd html-rag-optimizer
# Install dependencies
pnpm install
# Run tests
pnpm test
# Build
pnpm build
# Run examples
pnpm tsx examples/basic-usage.ts
FAQs
HTML optimization tool for RAG (Retrieval-Augmented Generation) systems
The npm package @harutakax/html-rag-optimizer receives a total of 500 weekly downloads. As such, @harutakax/html-rag-optimizer popularity was classified as not popular.
We found that @harutakax/html-rag-optimizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The Rust Security Response WG is warning of phishing emails from rustfoundation.dev targeting crates.io users.
Product
Socket now lets you customize pull request alert headers, helping security teams share clear guidance right in PRs to speed reviews and reduce back-and-forth.
Product
Socket's Rust support is moving to Beta: all users can scan Cargo projects and generate SBOMs, including Cargo.toml-only crates, with Rust-aware supply chain checks.