π UniParser: Universal File Parsing for Node.js
UniParser is a powerful, lightweight Node.js library designed to handle parsing of multiple file formatsβsuch as PDF, DOCX, TXT, HTML, and Markdownβand convert them into plain text with ease.
π Say goodbye to file format limitations! UniParser extracts text content from all these formats, providing a consistent text output for your applications.
β¨ Features
- π PDF Parsing: Extracts plain text from PDF documents.
- π DOCX Parsing: Reads and extracts text from Microsoft Word
.docx
files.
- π TXT Parsing: Handles plain text files with no special formatting.
- π HTML Parsing: Extracts text from the body of HTML documents.
- π¨ Markdown Parsing: Converts Markdown files to plain text, stripping out all formatting syntax.
- π Auto-detection: Automatically detects the file format and parses it using the
autoParse
function.
π¦ Installation
To install UniParser, simply run:
npm install uniparser
π οΈ Usage
CommonJS (CJS) Example
If youβre working in a Node.js environment with CommonJS (CJS), use require()
to import UniParser:
const { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } = require('uniparser');
(async () => {
const parsedText = await autoParse('./path/to/sample-file.pdf');
console.log(parsedText);
})();
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');
ES Modules (ESM) Example
If youβre working in an ES Module environment (modern JavaScript), use import
to load the functions:
import { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } from 'uniparser';
(async () => {
const parsedText = await autoParse('./path/to/sample-file.pdf');
console.log(parsedText);
})();
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');
β‘ Synchronous Usage (for small files)
For small files, you can use UniParser synchronously, but this should only be done for very lightweight files.
CommonJS (CJS):
const { parseTXT, parseMarkdown } = require('uniparser');
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);
const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);
ES Modules (ESM):
import { parseTXT, parseMarkdown } from 'uniparser';
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);
const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);
π Supported File Formats
- π PDF (
.pdf
): Converts PDF documents to plain text.
- π DOCX (
.docx
): Extracts text from Microsoft Word .docx
files.
- ποΈ TXT (
.txt
): Reads plain text from simple text files.
- π HTML (
.html
): Strips HTML tags and returns the text content.
- βοΈ Markdown (
.md
): Converts Markdown files to plain text, removing all formatting.
- π Auto-detection: Detects file types automatically via
autoParse
and processes them accordingly.
π― Example
Here's a quick example to get you started with DOCX parsing:
CommonJS (CJS):
const { parseDOCX } = require('uniparser');
(async () => {
const docxText = await parseDOCX('./path/to/sample-file.docx');
console.log(docxText);
})();
ES Modules (ESM):
import { parseDOCX } from 'uniparser';
(async () => {
const docxText = await parseDOCX('./path/to/sample-file.docx');
console.log(docxText);
})();
π License
This project is licensed under the MIT License. See the LICENSE file for more information.
π€ Contributing
Contributions are welcome! If you'd like to improve UniParser, feel free to fork the repository and submit a pull request. We appreciate your feedback and contributions!
π‘ UniParser makes it easier than ever to extract content from a wide range of file formatsβTry it now and streamline your file processing tasks! π