DocumentAtom
DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
DocumentAtom requires that Tesseract v5.0 be installed on the host. This is required as certain document types can have embedded images which are parsed using OCR via Tesseract.
DocumentAtom.Excel |  |  |
DocumentAtom.Image |  |  |
DocumentAtom.Markdown |  |  |
DocumentAtom.Pdf |  |  |
DocumentAtom.PowerPoint |  |  |
DocumentAtom.Ocr |  |  |
DocumentAtom.Text |  |  |
DocumentAtom.TypeDetection |  |  |
DocumentAtom.Word |  |  |
New in v1.0.x
Motivation
Parsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.
Bugs, Quality, Feedback, or Enhancement Requests
Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.
Types Supported
DocumentAtom supports the following input file types:
- Text
- Markdown
- Microsoft Word (.docx)
- Microsoft Excel (.xlsx)
- Microsoft PowerPoint (.pptx)
- PNG images (requires Tesseract on the host)
- PDF
Simple Example
Refer to the various Test
projects for working examples.
The following example shows processing a markdown (.md
) file.
using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;
MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (Atom atom in processor.Extract(filename))
Console.WriteLine(atom.ToString());
Atom Types
DocumentAtom parses input data assets into a variety of Atom
objects. Each Atom
includes top-level metadata including:
GUID
Type
- including Text
, Image
, Binary
, Table
, and List
PageNumber
- where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when rendered
Position
- the ordinal position of the Atom
, relative to others
Length
- the length of the Atom
's content
MD5Hash
- the MD5 hash of the Atom
content
SHA1Hash
- the SHA1 hash of the Atom
content
SHA256Hash
- the SHA256 hash of the Atom
content
Quarks
- sub-atomic particles created from the Atom
content, for instance, when chunking text
The AtomBase
class provides the aforementioned metadata, and several type-specific Atom
s are returned from the various processors, including:
BinaryAtom
- includes a Bytes
property
DocxAtom
- includes Text
, HeaderLevel
, UnorderedList
, OrderedList
, Table
, and Binary
properties
ImageAtom
- includes BoundingBox
, Text
, UnorderedList
, OrderedList
, Table
, and Binary
properties
MarkdownAtom
- includes Formatting
, Text
, UnorderedList
, OrderedList
, and Table
properties
PdfAtom
- includes BoundingBox
, Text
, UnorderedList
, OrderedList
, Table
, and Binary
properties
PptxAtom
- includes Title
, Subtitle
, Text
, UnorderedList
, OrderedList
, Table
, and Binary
properties
TableAtom
- includes Rows
, Columns
, Irregular
, and Table
properties
TextAtom
- includes Text
XlsxAtom
- includes SheetName
, CellIdentifier
, Text
, Table
, and Binary
properties
Table
objects inside of Atom
objects are always presented as SerializableDataTable
objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable
objects.
Underlying Libraries
DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.
Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.
My libraries used within DocumentAtom:
RESTful API and Docker
Run the DocumentAtom.Server
project to start a RESTful server listening on localhost:8000
. Modify the documentatom.json
file to change the webserver, logging, or Tesseract settings. Alternatively, you can pull jchristn/documentatom
from Docker Hub.
Refer to the Postman collection for examples exercising the APIs.
Version History
Please refer to CHANGELOG.md
for version history.
Thanks
Special thanks to iconduck.com and the content authors for producing this icon.