yet another library to extract text from docx, pptx, xlsx, and pdf files.
similar libraries
there are other great libraries that do the same job and have inspired this
project, such as:
however, office-text-extractor has the following differences:
- parses file based on its mime type, not its file extension.
- does not spawn a child process to use a tool installed on the device.
- reads and returns text from the file if it contains plain text.
libraries used
this package uses some amazing existing libraries that perform better than the
ones that originally existed in this module, and are therefore used instead:
a big thank you to the contributors of these projects!
installation
node
from version 2.0.0 onwards, this package is pure esm. please read
this article
for a guide on how to ensure your project can import this library.
to use office-text-extractor in an Node project, install it using npm
/pnpm
/yarn
:
> npm install office-text-extractor
> pnpm add office-text-extractor
> yarn add office-text-extractor
browser
the library currently cannot be used in the browser due to its usage of the node:buffer
library. pull requests that can replace node:buffer
with a different library are welcome!
usage
an example of using the library to extract text is as follows:
import { readFile } from 'node:fs/promises'
import { getTextExtractor } from 'office-text-extractor'
const extractor = getTextExtractor()
const url = 'https://raw.githubusercontent.com/gamemaker1/office-text-extractor/rewrite/test/fixtures/docs/pptx.pptx'
const text = await extractor.extractText({ input: url, type: 'url' })
const path = 'stuff/boring.pdf'
const text = await extractor.extractText({ input: path, type: 'file' })
const buffer = await readFile(path)
const text = await extractor.extractText({ input: buffer, type: 'buffer' })
console.log(text)
the following is an example of how to create and use your own text extraction method:
import { type Buffer } from 'node:buffer'
import { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'
class ImageExtractor implements TextExtractionMethod {
mimes = ['image/png', 'image/jpeg']
apply = async (input: Buffer): Promise<string> {
const text = await processImage(input)
return text
}
}
const extractor = new TextExtractor()
extractor.addMethod(new ImageExtractor())
const text = await extractor.extractText({ input: '...', type: '...' }
console.log(text)
license
this project is licensed under the ISC license. please see license.md
for more details.