
Research
Two Malicious Rust Crates Impersonate Popular Logger to Steal Wallet Keys
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
@harutakax/html-rag-optimizer
Advanced tools
HTML optimization tool for RAG (Retrieval-Augmented Generation) systems
A powerful HTML optimization tool designed specifically for RAG (Retrieval-Augmented Generation) systems. This library removes unnecessary HTML elements, attributes, and formatting to create clean, search-optimized content while preserving semantic structure.
npm install @harutakax/html-rag-optimizer
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const html = `
<div class="container">
<h1 id="title">Welcome</h1>
<p>This is a <strong>sample</strong> paragraph.</p>
<script>console.log('remove me');</script>
<style>.container { margin: 0; }</style>
</div>
`;
// Basic optimization
const optimized = optimizeHtml(html);
console.log(optimized);
// Output: <div><h1>Welcome</h1><p>This is a <strong>sample</strong> paragraph.</p></div>
# Optimize a single file
npx @harutakax/html-rag-optimizer input.html -o output.html
# Optimize an entire directory
html-rag-optimizer --input-dir ./docs --output-dir ./optimized
# With custom options
html-rag-optimizer input.html -o output.html --keep-attributes --exclude-tags script,style
Option | Type | Default | Description |
---|---|---|---|
keepAttributes | boolean | false | Preserve HTML attributes |
removeEmpty | boolean | true | Remove empty elements |
preserveWhitespace | boolean | false | Preserve whitespace formatting |
excludeTags | string[] | [] | Tags to exclude from removal |
keepTags | string[] | [] | Only keep specified tags (removes others) |
removeComments | boolean | true | Remove HTML comments |
minifyText | boolean | true | Normalize and minify text content |
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const options = {
keepAttributes: false,
removeEmpty: true,
preserveWhitespace: false,
excludeTags: ['code', 'pre'], // Don't remove code blocks
keepTags: ['h1', 'h2', 'h3', 'p', 'div', 'article'], // Only keep these tags
removeComments: true,
minifyText: true
};
const optimized = optimizeHtml(html, options);
import { optimizeHtmlFile, optimizeHtmlDir } from '@harutakax/html-rag-optimizer';
// Process single file
await optimizeHtmlFile('input.html', 'output.html', options);
// Process entire directory
await optimizeHtmlDir('./docs', './optimized', options);
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
import { promises as fs } from 'fs';
async function processBatch(files: string[]) {
const results = await Promise.all(
files.map(async (file) => {
const html = await fs.readFile(file, 'utf-8');
return optimizeHtml(html, {
keepTags: ['h1', 'h2', 'h3', 'p', 'article'],
removeComments: true
});
})
);
return results;
}
# Help
html-rag-optimizer --help
# Version
html-rag-optimizer --version
# Single file
html-rag-optimizer input.html -o output.html
# Directory processing
html-rag-optimizer --input-dir ./src --output-dir ./dist
-o, --output <path> Output file or directory
--input-dir <path> Input directory
--output-dir <path> Output directory
--keep-attributes Keep HTML attributes
--exclude-tags <tags> Exclude tags (comma-separated)
--keep-tags <tags> Keep only specified tags (comma-separated)
--preserve-whitespace Preserve whitespace
--config <path> Configuration file path
-h, --help Show help
-v, --version Show version
Create a html-rag-optimizer.json
file:
{
"keepAttributes": false,
"removeEmpty": true,
"excludeTags": ["code", "pre"],
"keepTags": ["h1", "h2", "h3", "p", "div", "article"],
"removeComments": true,
"minifyText": true
}
Use with: html-rag-optimizer --config html-rag-optimizer.json input.html -o output.html
<script>
tags and content<style>
tags and content<meta>
tags<!-- -->
)<div></div>
, <p> </p>
)&
, <
, etc.)<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Sample Page</title>
<style>body { font-family: Arial; }</style>
</head>
<body>
<div class="container" id="main">
<h1 class="title"> Welcome to Our Site </h1>
<!-- Navigation goes here -->
<p class="intro">This is a sample paragraph.</p>
<div></div>
<script>console.log('hello');</script>
</div>
</body>
</html>
<html><head><title>Sample Page</title></head><body><div><h1>Welcome to Our Site</h1><p>This is a sample paragraph.</p></div></body></html>
Perfect for preparing HTML content for vector databases and search systems:
// Optimize content before indexing
const webContent = await fetchWebPage(url);
const optimizedForRAG = optimizeHtml(webContent, {
keepTags: ['h1', 'h2', 'h3', 'p', 'article', 'section'],
removeComments: true,
minifyText: true
});
// Index optimizedForRAG in your vector database
Clean up documentation before feeding to LLMs:
const docs = await fs.readFile('documentation.html', 'utf-8');
const cleanDocs = optimizeHtml(docs, {
excludeTags: ['code', 'pre'], // Keep code examples
keepTags: ['h1', 'h2', 'h3', 'p', 'ul', 'ol', 'li', 'code', 'pre']
});
Clean scraped content for analysis:
const scrapedHTML = await scrapeWebsite(url);
const cleanContent = optimizeHtml(scrapedHTML, {
keepTags: ['p', 'h1', 'h2', 'h3', 'article'],
removeComments: true,
minifyText: true
});
# Clone the repository
git clone https://github.com/your-org/html-rag-optimizer.git
cd html-rag-optimizer
# Install dependencies
pnpm install
# Run tests
pnpm test
# Build
pnpm build
# Run examples
pnpm tsx examples/basic-usage.ts
FAQs
HTML optimization tool for RAG (Retrieval-Augmented Generation) systems
The npm package @harutakax/html-rag-optimizer receives a total of 11 weekly downloads. As such, @harutakax/html-rag-optimizer popularity was classified as not popular.
We found that @harutakax/html-rag-optimizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
Research
A malicious package uses a QR code as steganography in an innovative technique.
Research
/Security News
Socket identified 80 fake candidates targeting engineering roles, including suspected North Korean operators, exposing the new reality of hiring as a security function.