office-text-extractor

Yet another library to extract text from MS Office and PDF files

3.0.0

source

npm

Version published: 10 months ago

Maintainers: 1

Created: 3 years ago

Readme

Source

`office-text-extractor`

Yet another library to extract text from MS Office (docx, pptx, xlsx) and PDF (pdf) files.

Similar projects

There are other great projects that do the same job and have inspired this project, such as:

How is this project different?

Parses file based on its mime type, not its file extension.
Does not spawn a child process to use a tool installed on the device.
Reads and returns text from the file if it contains plain text.

Libraries used

This module uses some amazing existing libraries that perform better than the ones that originally existed in this module, and are therefore used instead:

pdf-parse, for parsing PDF files
xlsx, for parsing MS Excel files
mammoth, for parsing MS Word files

This module also uses:

xml2js - to convert the MS Office XML files into JSON
js-yaml - to convert JSON into YAML
file-type - to detect the mime type of files
fflate - to unzip files

A big thank you to the contributors of these projects!

Installation

NodeJs

Note

This package is now pure ESM (from version 2.0.0 onwards). Please read this article for a guide on how to ensure your project can import this library.

To use this in an Node project, install it using npm/pnpm/yarn:

# Using npm
> npm install office-text-extractor

# Using pnpm
> pnpm add office-text-extractor

# Using yarn
> yarn add office-text-extractor

Browser

To use this package in the browser, fetch it using your preferred CDN:

<script src="https://unpkg.com/office-text-extractor@latest/build/index.js"></script>

Usage

import { getTextExtractor } from 'office-text-extractor'

// Create a new instance of the extractor.
const extractor = getTextExtractor()

// Extract text from a URL, file or buffer.
const location =
	'https://raw.githubusercontent.com/gamemaker1/office-text-extractor/rewrite/test/fixtures/docs/pptx.pptx'
const text = await extractor.extractText({
	input: location, // this can be a file path or a buffer
	type: 'url', // this is can be 'url', 'file' or 'buffer'
})

console.log(text)

License

This project is licensed under the ISC license. Please see license.md for more details.

Keywords

FAQs

What is office-text-extractor?

Is office-text-extractor well maintained?

Last updated on 10 Jul 2023

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install