
Product
Announcing Socket Fix 2.0
Socket Fix 2.0 brings targeted CVE remediation, smarter upgrade planning, and broader ecosystem support to help developers get to zero alerts.
@harutakax/html-rag-optimizer
Advanced tools
HTML optimization tool for RAG (Retrieval-Augmented Generation) systems
A powerful HTML optimization tool designed specifically for RAG (Retrieval-Augmented Generation) systems. This library removes unnecessary HTML elements, attributes, and formatting to create clean, search-optimized content while preserving semantic structure.
npm install @harutakax/html-rag-optimizer
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const html = `
<div class="container">
<h1 id="title">Welcome</h1>
<p>This is a <strong>sample</strong> paragraph.</p>
<script>console.log('remove me');</script>
<style>.container { margin: 0; }</style>
</div>
`;
// Basic optimization
const optimized = optimizeHtml(html);
console.log(optimized);
// Output: <div><h1>Welcome</h1><p>This is a<strong>sample</strong>paragraph.</p></div>
# Optimize a single file
npx @harutakax/html-rag-optimizer input.html -o output.html
npm install -g @harutakax/html-rag-optimizer
# Use only html-rag-optimizer if installed globally
## Optimize an entire directory
html-rag-optimizer --input-dir ./docs --output-dir ./optimized
## With custom options
html-rag-optimizer input.html -o output.html --keep-attributes --exclude-tags script,style
Option | Type | Default | Description |
---|---|---|---|
keepAttributes | boolean | false | Preserve HTML attributes |
removeEmpty | boolean | true | Remove empty elements |
preserveWhitespace | boolean | false | Preserve whitespace formatting |
excludeTags | string[] | [] | Tags to exclude from removal |
removeComments | boolean | true | Remove HTML comments |
minifyText | boolean | true | Normalize and minify text content |
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const options = {
keepAttributes: false,
removeEmpty: true,
preserveWhitespace: false,
excludeTags: ['code', 'pre'], // Don't remove code blocks
removeComments: true,
minifyText: true
};
const optimized = optimizeHtml(html, options);
import { optimizeHtmlFile, optimizeHtmlDir } from '@harutakax/html-rag-optimizer';
// Process single file
await optimizeHtmlFile('input.html', 'output.html', options);
// Process entire directory
await optimizeHtmlDir('./docs', './optimized', options);
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
import { promises as fs } from 'fs';
async function processBatch(files: string[]) {
const results = await Promise.all(
files.map(async (file) => {
const html = await fs.readFile(file, 'utf-8');
return optimizeHtml(html, {
removeComments: true
});
})
);
return results;
}
It is assumed that it is installed globally.
# Help
html-rag-optimizer --help
# Version
html-rag-optimizer --version
# Single file
html-rag-optimizer input.html -o output.html
# Directory processing
html-rag-optimizer --input-dir ./src --output-dir ./dist
-o, --output <path> Output file or directory
--input-dir <path> Input directory
--output-dir <path> Output directory
--keep-attributes Keep HTML attributes
--exclude-tags <tags> Exclude tags (comma-separated)
--preserve-whitespace Preserve whitespace
--config <path> Configuration file path
-h, --help Show help
-v, --version Show version
Create a html-rag-optimizer.json
file:
{
"keepAttributes": false,
"removeEmpty": true,
"excludeTags": ["code", "pre"],
"removeComments": true,
"minifyText": true
}
Use with: html-rag-optimizer --config html-rag-optimizer.json input.html -o output.html
<script>
tags and content<style>
tags and content<meta>
tags<!-- -->
)<div></div>
, <p> </p>
)&
, <
, etc.)<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Sample Page</title>
<style>body { font-family: Arial; }</style>
</head>
<body>
<div class="container" id="main">
<h1 class="title"> Welcome to Our Site </h1>
<!-- Navigation goes here -->
<p class="intro">This is a sample paragraph.</p>
<div></div>
<script>console.log('hello');</script>
</div>
</body>
</html>
<html><head><title>Sample Page</title></head><body><div><h1>Welcome to Our Site</h1><p>This is a sample paragraph.</p></div></body></html>
Perfect for preparing HTML content for vector databases and search systems:
// Optimize content before indexing
const webContent = await fetchWebPage(url);
const optimizedForRAG = optimizeHtml(webContent, {
removeComments: true,
minifyText: true
});
// Index optimizedForRAG in your vector database
Clean up documentation before feeding to LLMs:
const docs = await fs.readFile('documentation.html', 'utf-8');
const cleanDocs = optimizeHtml(docs, {
excludeTags: ['code', 'pre'], // Keep code examples
});
Clean scraped content for analysis:
const scrapedHTML = await scrapeWebsite(url);
const cleanContent = optimizeHtml(scrapedHTML, {
removeComments: true,
minifyText: true
});
# Clone the repository
git clone https://github.com/your-org/html-rag-optimizer.git
cd html-rag-optimizer
# Install dependencies
pnpm install
# Run tests
pnpm test
# Build
pnpm build
# Run examples
pnpm dlx tsx examples/basic-usage.ts
FAQs
HTML optimization tool for RAG (Retrieval-Augmented Generation) systems
The npm package @harutakax/html-rag-optimizer receives a total of 500 weekly downloads. As such, @harutakax/html-rag-optimizer popularity was classified as not popular.
We found that @harutakax/html-rag-optimizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket Fix 2.0 brings targeted CVE remediation, smarter upgrade planning, and broader ecosystem support to help developers get to zero alerts.
Security News
Socket CEO Feross Aboukhadijeh joins Risky Business Weekly to unpack recent npm phishing attacks, their limited impact, and the risks if attackers get smarter.
Product
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.