
Security News
Axios Maintainer Confirms Social Engineering Attack Behind npm Compromise
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.
DocumentChunker
Advanced tools
Document Chunker SDK for splitting large text contents from documents like DOCX, PDF, and HTML into smaller chunks.
This library provides utility classes to break down large files, such as PDF, DOCX, and HTML, into smaller text chunks for creating corpora for RAG prototyping.
The primary goal of this library is to assist in creating a corpus for prototyping or testing Retrieval-Augmented Generation (RAG) systems. However, the use case of this library should not be limited to this specific purpose. It can be utilized for any application that requires splitting large text files into manageable pieces.
This library is provided under the Apache License. Refer to the repository's NOTICE file for information on the open-source projects leveraged by this library, which are distributed under various permissive open source licenses.
You can include this library in your .NET project using your preferred method (e.g., NuGet, project reference, etc.).
Here's a basic example of how to use the library:
// Example usage of the Document Chunker Library
var config = new ChunkerConfig(maxWordsPerChunk: 11, chunkType: ChunkType.Sentence);
var chunker = new PdfDocumentChunker(config);
var filePath = "example.pdf";
var chunker = new PdfDocumentChunker();
await foreach (var chunk in chunker.ExtractChunksAsync(testPdfPath))
{
Console.WriteLine(chunk);
}
.Net Frameworkd 4.6.2 and above.NET 6.0 or above for modern .NET platforms..NET Standard 2.0 for broader compatibility.Contributions are welcome! Please feel free to submit issues or pull requests to improve the library.
For more details about its features and implementation, check out the NOTICE file and LICENSE file included in the project.
FAQs
Document Chunker SDK for splitting large text contents from documents like DOCX, PDF, and HTML into smaller chunks.
We found that documentchunker demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.

Security News
The Axios compromise shows how time-dependent dependency resolution makes exposure harder to detect and contain.