Yet another library to extract text from MS Office (docx
, pptx
, xlsx
)
and PDF (pdf
) files.
Similar projects
There are other great projects that do the same job and have inspired this
project, such as:
How is this project different?
- Parses file based on its mime type, not its file extension.
- Does not spawn a child process to use a tool installed on the device.
- Reads and returns text from the file if it contains plain text.
Libraries used
This module uses some amazing existing libraries that perform better than the
ones that originally existed in this module, and are therefore used instead:
This module also uses:
xml2js
- to convert the MS Office
XML files into JSONjs-yaml
- to convert JSON into YAMLfile-type
- to detect the mime
type of filesfflate
- to unzip files
A big thank you to the contributors of these projects!
Installation
NodeJs
Note
This package is now pure ESM (from version 2.0.0 onwards). Please read
this article
for a guide on how to ensure your project can import this library.
To use this in an Node project, install it using npm
/pnpm
/yarn
:
> npm install office-text-extractor
> pnpm add office-text-extractor
> yarn add office-text-extractor
Browser
To use this package in the browser, fetch it using your preferred CDN:
<script src="https://unpkg.com/office-text-extractor@latest/build/index.js"></script>
Usage
import { getTextExtractor } from 'office-text-extractor'
const extractor = getTextExtractor()
const location =
'https://raw.githubusercontent.com/gamemaker1/office-text-extractor/rewrite/test/fixtures/docs/pptx.pptx'
const text = await extractor.extractText({
input: location,
type: 'url',
})
console.log(text)
License
This project is licensed under the ISC license. Please see
license.md
for more details.