
Product
Socket Now Supports pylock.toml Files
Socket now supports pylock.toml, enabling secure, reproducible Python builds with advanced scanning and full alignment with PEP 751's new standard.
DocumentAtom provides a light, fast library for breaking input images into constituent text parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.
DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
DocumentAtom requires that Tesseract v5.0 be installed on the host. This is required as certain document types can have embedded images which are parsed using OCR via Tesseract.
Package | Version | Downloads |
---|---|---|
DocumentAtom.Excel | ||
DocumentAtom.Image | ||
DocumentAtom.Markdown | ||
DocumentAtom.Pdf | ||
DocumentAtom.PowerPoint | ||
DocumentAtom.Ocr | ||
DocumentAtom.Text | ||
DocumentAtom.TypeDetection | ||
DocumentAtom.Word |
Parsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.
Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.
DocumentAtom supports the following input file types:
Refer to the various Test
projects for working examples.
The following example shows processing a markdown (.md
) file.
using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;
MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (Atom atom in processor.Extract(filename))
Console.WriteLine(atom.ToString());
DocumentAtom parses input data assets into a variety of Atom
objects. Each Atom
includes top-level metadata including:
GUID
Type
- including Text
, Image
, Binary
, Table
, and List
PageNumber
- where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when renderedPosition
- the ordinal position of the Atom
, relative to othersLength
- the length of the Atom
's contentMD5Hash
- the MD5 hash of the Atom
contentSHA1Hash
- the SHA1 hash of the Atom
contentSHA256Hash
- the SHA256 hash of the Atom
contentQuarks
- sub-atomic particles created from the Atom
content, for instance, when chunking textThe AtomBase
class provides the aforementioned metadata, and several type-specific Atom
s are returned from the various processors, including:
BinaryAtom
- includes a Bytes
propertyDocxAtom
- includes Text
, HeaderLevel
, UnorderedList
, OrderedList
, Table
, and Binary
propertiesImageAtom
- includes BoundingBox
, Text
, UnorderedList
, OrderedList
, Table
, and Binary
propertiesMarkdownAtom
- includes Formatting
, Text
, UnorderedList
, OrderedList
, and Table
propertiesPdfAtom
- includes BoundingBox
, Text
, UnorderedList
, OrderedList
, Table
, and Binary
propertiesPptxAtom
- includes Title
, Subtitle
, Text
, UnorderedList
, OrderedList
, Table
, and Binary
propertiesTableAtom
- includes Rows
, Columns
, Irregular
, and Table
propertiesTextAtom
- includes Text
XlsxAtom
- includes SheetName
, CellIdentifier
, Text
, Table
, and Binary
propertiesTable
objects inside of Atom
objects are always presented as SerializableDataTable
objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable
objects.
DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.
Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.
My libraries used within DocumentAtom:
Run the DocumentAtom.Server
project to start a RESTful server listening on localhost:8000
. Modify the documentatom.json
file to change the webserver, logging, or Tesseract settings. Alternatively, you can pull jchristn/documentatom
from Docker Hub.
Refer to the Postman collection for examples exercising the APIs.
Please refer to CHANGELOG.md
for version history.
Special thanks to iconduck.com and the content authors for producing this icon.
FAQs
DocumentAtom provides a light, fast library for breaking input images into constituent text parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.
We found that documentatom.image demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket now supports pylock.toml, enabling secure, reproducible Python builds with advanced scanning and full alignment with PEP 751's new standard.
Security News
Research
Socket uncovered two npm packages that register hidden HTTP endpoints to delete all files on command.
Research
Security News
Malicious Ruby gems typosquat Fastlane plugins to steal Telegram bot tokens, messages, and files, exploiting demand after Vietnam’s Telegram ban.