
Research
2025 Report: Destructive Malware in Open Source Packages
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.
DocumentAtom provides a light, fast library for breaking input PDF documents into constituent parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.
DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
DocumentAtom requires that Tesseract v5.0 be installed on the host. This is required as certain document types can have embedded images which are parsed using OCR via Tesseract.
Parsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.
Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.
DocumentAtom supports the following input file types:
Refer to the various Test projects for working examples.
The following example shows processing a markdown (.md) file.
using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;
MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (Atom atom in processor.Extract(filename))
Console.WriteLine(atom.ToString());
DocumentAtom parses input data assets into a variety of Atom objects. Each Atom includes top-level metadata including:
GUIDType - including Text, Image, Binary, Table, and ListPageNumber - where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when renderedPosition - the ordinal position of the Atom, relative to othersLength - the length of the Atom's contentMD5Hash - the MD5 hash of the Atom contentSHA1Hash - the SHA1 hash of the Atom contentSHA256Hash - the SHA256 hash of the Atom contentQuarks - sub-atomic particles created from the Atom content, for instance, when chunking textThe AtomBase class provides the aforementioned metadata, and several type-specific Atoms are returned from the various processors, including:
BinaryAtom - includes a Bytes propertyDocxAtom - includes Text, HeaderLevel, UnorderedList, OrderedList, Table, and Binary propertiesImageAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary propertiesMarkdownAtom - includes Formatting, Text, UnorderedList, OrderedList, and Table propertiesPdfAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary propertiesPptxAtom - includes Title, Subtitle, Text, UnorderedList, OrderedList, Table, and Binary propertiesTableAtom - includes Rows, Columns, Irregular, and Table propertiesTextAtom - includes TextXlsxAtom - includes SheetName, CellIdentifier, Text, Table, and Binary propertiesTable objects inside of Atom objects are always presented as SerializableDataTable objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable objects.
DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.
Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.
My libraries used within DocumentAtom:
Run the DocumentAtom.Server project to start a RESTful server listening on localhost:8000. Modify the documentatom.json file to change the webserver, logging, or Tesseract settings. Alternatively, you can pull jchristn/documentatom from Docker Hub.
Refer to the Postman collection for examples exercising the APIs.
Please refer to CHANGELOG.md for version history.
Special thanks to iconduck.com and the content authors for producing this icon.
FAQs
DocumentAtom provides a light, fast library for breaking input PDF documents into constituent parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.
We found that documentatom.pdf demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.

Security News
Socket CTO Ahmad Nassri shares practical AI coding techniques, tools, and team workflows, plus what still feels noisy and why shipping remains human-led.

Research
/Security News
A five-month operation turned 27 npm packages into durable hosting for browser-run lures that mimic document-sharing portals and Microsoft sign-in, targeting 25 organizations across manufacturing, industrial automation, plastics, and healthcare for credential theft.