Socket
Book a DemoInstallSign in
Socket

@harutakax/html-rag-optimizer

Package Overview
Dependencies
Maintainers
1
Versions
7
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@harutakax/html-rag-optimizer

HTML optimization tool for RAG (Retrieval-Augmented Generation) systems

latest
Source
npmnpm
Version
0.1.1
Version published
Weekly downloads
500
Maintainers
1
Weekly downloads
 
Created
Source

HTML RAG Optimizer

A powerful HTML optimization tool designed specifically for RAG (Retrieval-Augmented Generation) systems. This library removes unnecessary HTML elements, attributes, and formatting to create clean, search-optimized content while preserving semantic structure.

Features

  • 🚀 Fast Processing: Optimizes large HTML files (1MB+) in seconds
  • 🎯 RAG-Focused: Designed specifically for information retrieval systems
  • ⚙️ Highly Configurable: Extensive options for customizing optimization behavior
  • 📝 TypeScript Support: Full TypeScript support with detailed type definitions
  • 🛠️ CLI & API: Both command-line interface and programmatic API
  • 🔄 Batch Processing: Supports single files and entire directories
  • 📊 Performance Optimized: Efficient memory usage and concurrent processing

Installation

npm install @harutakax/html-rag-optimizer

Quick Start

Programmatic API

import { optimizeHtml } from '@harutakax/html-rag-optimizer';

const html = `
<div class="container">
  <h1 id="title">Welcome</h1>
  <p>This is a <strong>sample</strong> paragraph.</p>
  <script>console.log('remove me');</script>
  <style>.container { margin: 0; }</style>
</div>
`;

// Basic optimization
const optimized = optimizeHtml(html);
console.log(optimized);
// Output: <div><h1>Welcome</h1><p>This is a<strong>sample</strong>paragraph.</p></div>

CLI Usage

# Optimize a single file
npx @harutakax/html-rag-optimizer input.html -o output.html
npm install -g @harutakax/html-rag-optimizer

# Use only html-rag-optimizer if installed globally
## Optimize an entire directory
html-rag-optimizer --input-dir ./docs --output-dir ./optimized

## With custom options
html-rag-optimizer input.html -o output.html --keep-attributes --exclude-tags script,style

Configuration Options

OptionTypeDefaultDescription
keepAttributesbooleanfalsePreserve HTML attributes
removeEmptybooleantrueRemove empty elements
preserveWhitespacebooleanfalsePreserve whitespace formatting
excludeTagsstring[][]Tags to exclude from removal
removeCommentsbooleantrueRemove HTML comments
minifyTextbooleantrueNormalize and minify text content

Advanced Usage

Custom Configuration

import { optimizeHtml } from '@harutakax/html-rag-optimizer';

const options = {
  keepAttributes: false,
  removeEmpty: true,
  preserveWhitespace: false,
  excludeTags: ['code', 'pre'], // Don't remove code blocks
  removeComments: true,
  minifyText: true
};

const optimized = optimizeHtml(html, options);

File Processing

import { optimizeHtmlFile, optimizeHtmlDir } from '@harutakax/html-rag-optimizer';

// Process single file
await optimizeHtmlFile('input.html', 'output.html', options);

// Process entire directory
await optimizeHtmlDir('./docs', './optimized', options);

Batch Processing with Custom Logic

import { optimizeHtml } from '@harutakax/html-rag-optimizer';
import { promises as fs } from 'fs';

async function processBatch(files: string[]) {
  const results = await Promise.all(
    files.map(async (file) => {
      const html = await fs.readFile(file, 'utf-8');
      return optimizeHtml(html, {
        removeComments: true
      });
    })
  );
  return results;
}

CLI Reference

Basic Commands

It is assumed that it is installed globally.

# Help
html-rag-optimizer --help

# Version
html-rag-optimizer --version

# Single file
html-rag-optimizer input.html -o output.html

# Directory processing
html-rag-optimizer --input-dir ./src --output-dir ./dist

CLI Options

-o, --output <path>           Output file or directory
--input-dir <path>           Input directory
--output-dir <path>          Output directory
--keep-attributes            Keep HTML attributes
--exclude-tags <tags>        Exclude tags (comma-separated)
--preserve-whitespace        Preserve whitespace
--config <path>              Configuration file path
-h, --help                   Show help
-v, --version                Show version

Configuration File

Create a html-rag-optimizer.json file:

{
  "keepAttributes": false,
  "removeEmpty": true,
  "excludeTags": ["code", "pre"],
  "removeComments": true,
  "minifyText": true
}

Use with: html-rag-optimizer --config html-rag-optimizer.json input.html -o output.html

What Gets Optimized

Removed by Default

  • <script> tags and content
  • <style> tags and content
  • <meta> tags
  • HTML comments (<!-- -->)
  • All HTML attributes (class, id, style, etc.)
  • Empty elements (<div></div>, <p> </p>)
  • Excess whitespace and formatting

Preserved

  • Semantic HTML structure
  • Text content
  • Essential tags (headings, paragraphs, lists, etc.)
  • HTML entities (&amp;, &lt;, etc.)

Before Optimization

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Sample Page</title>
  <style>body { font-family: Arial; }</style>
</head>
<body>
  <div class="container" id="main">
    <h1 class="title">   Welcome to Our Site   </h1>
    <!-- Navigation goes here -->
    <p class="intro">This is a   sample   paragraph.</p>
    <div></div>
    <script>console.log('hello');</script>
  </div>
</body>
</html>

After Optimization

<html><head><title>Sample Page</title></head><body><div><h1>Welcome to Our Site</h1><p>This is a sample paragraph.</p></div></body></html>

Performance

  • Large Files: Processes 1MB+ HTML files in under 5 seconds
  • Memory Efficient: Memory usage stays under 3x input file size
  • Concurrent Processing: Supports parallel processing of multiple files
  • Scalable: Performance scales linearly with input size

Use Cases

RAG Systems

Perfect for preparing HTML content for vector databases and search systems:

// Optimize content before indexing
const webContent = await fetchWebPage(url);
const optimizedForRAG = optimizeHtml(webContent, {
  removeComments: true,
  minifyText: true
});
// Index optimizedForRAG in your vector database

Documentation Processing

Clean up documentation before feeding to LLMs:

const docs = await fs.readFile('documentation.html', 'utf-8');
const cleanDocs = optimizeHtml(docs, {
  excludeTags: ['code', 'pre'], // Keep code examples
});

Web Scraping Cleanup

Clean scraped content for analysis:

const scrapedHTML = await scrapeWebsite(url);
const cleanContent = optimizeHtml(scrapedHTML, {
  removeComments: true,
  minifyText: true
});

Requirements

  • Node.js 18 or higher
  • TypeScript 5.0+ (for development)

Development

# Clone the repository
git clone https://github.com/your-org/html-rag-optimizer.git
cd html-rag-optimizer

# Install dependencies
pnpm install

# Run tests
pnpm test

# Build
pnpm build

# Run examples
pnpm dlx tsx examples/basic-usage.ts

Keywords

html

FAQs

Package last updated on 05 Sep 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.