
Security News
CISA Kills Off RSS Feeds for KEVs and Cyber Alerts
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
@arbs.io/extract-text-content
Advanced tools
Package to parse and retreive text from documents
This npm package, @arbs.io/extract-text-content, offers a straightforward method to extract text content from various binary and text file formats. The package comes with a pre-built configuration that works out-of-the-box, requiring no additional setup. It is designed for use in Node.js environments, including Visual Studio Code extensions.
The current version of the package supports extraction from the following MIME types:
application/pdf
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/markdown
text/csv
text/html
text/plain
If you would like to request support for additional file formats, please submit an enhancement issue on the project's repository. We appreciate your feedback and contributions to improve this package for developers.
Feel free to explore the documentation for more details on how to use this package effectively in your projects. Happy coding!
npm install @arbs.io/extract-text-content
If you use it with Webpack, you need the latest Webpack version and ensure you configure it correctly for ESM.
Extract text from file using binary format. If the file type is binary the mime-type is verified using file-type.
import { extractTextFromFile } from '@arbs.io/extract-text-content'
const pdfPath = './data/microservices.pdf'
extractTextFromFile({
filepath: pdfPath,
}).then((results) => {
console.log(`pdf (${pdfPath})`)
console.log(`\t- mime-type: ${results.mimeType}`)
console.log(`\t- char-count: ${results.content.length}`)
console.log(`\t- random-read: ${results.content.substring(2500, 2540)}`)
})
Extract text from file using text format specifiying the mime-type to be used.
const htmlType = 'text/html'
const htmlPath = './data/microservices.htm'
extractTextFromFile({
filepath: htmlPath,
filetype: htmlType,
}).then((results) => {
console.log(`html (${htmlPath})`)
console.log(`\t- mime-type: ${results.mimeType}`)
console.log(`\t- char-count: ${results.content.length}`)
console.log(`\t- random-read: ${results.content.substring(2500, 2540)}`)
})
The TextExtract
object provides the following properties
interface TextExtract {
mimeType: string
content: string
}
This package also offers a convenient function, extractTextFromFile
, which extracts text content from various file formats using the provided file path or URL. Below is a detailed explanation of the parameters accepted by this function:
extractTextFromFile(filepath: string, filetype?: string): Promise Parameters
filepath
(Required): A string representing the path or URL to the file from which you want to extract text content. This parameter must be provided for the function to locate and process the input file.
filetype
(Optional): A string that serves as a hint for the file format being loaded. For binary formats, this hint will be validated based on the binary format's magic number. If not provided, the function will attempt to determine the file type automatically.
function extractTextFromFile({
filepath,
filetype,
}: {
filepath: string
filetype?: string
}): Promise<TextExtract>
By using these parameters with the extractTextFromFile
function, you can easily extract text content from supported file formats in your projects by providing a file path or URL.
This package offers a primary function, extractTextFromBuffer
, which is used to extract text content from various file formats. Below is a detailed explanation of the parameters accepted by this function:
extractTextFromBuffer(bufferArray: Uint8Array, filetype?: string): Promise
Parameters
bufferArray
(Required): A Uint8Array representation of the data blob. This parameter must be provided for the function to process and extract text content from the input file.
filetype
(Optional): A string that serves as a hint for the file format being loaded. For binary formats, this hint will be validated based on the binary format's magic number. If not provided, the function will attempt to determine the file type automatically.
function extractTextFromBuffer({
bufferArray,
filetype,
}: {
bufferArray: Uint8Array
filetype: string
}): Promise<TextExtract>
By using these parameters with the extractTextFromBuffer
function, you can easily extract text content from supported file formats in your projects.
The liberary uses the following packages (many thanks for the authors)
FAQs
Package to parse and retreive text from documents
The npm package @arbs.io/extract-text-content receives a total of 0 weekly downloads. As such, @arbs.io/extract-text-content popularity was classified as not popular.
We found that @arbs.io/extract-text-content demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.