New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

groupdocs-parser-net

Package Overview
Dependencies
Maintainers
1
Versions
3
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

groupdocs-parser-net

GroupDocs.Parser for Python via .NET is a powerful API designed for advanced document parsing, offering extensive features like text extraction, metadata retrieval, and image extraction across various document formats, including PDFs, Word, Excel, and PowerPoint.

pipPyPI
Version
25.12
Maintainers
1

Advanced Document Parsing API for Python via .NET

banner

Product Page | Docs | Demos | API Reference | Blog | Search | Free Support | Temporary License

GroupDocs.Parser for Python via .NET is a powerful on-premise document parsing library that lets you extract text, parser, images, attachments, barcodes and structured content from dozens of popular formats – including PDF, Word, Excel, PowerPoint, emails, archives, images and more.

You can embed GroupDocs.Parser into your own Python applications without installing any 3rd-party office suites. GroupDocs also provides free online apps built on top of the same APIs that allow users to parse PDF, Office and other documents right in the browser.

Document Parser API Features

GroupDocs.Parser for Python via .NET provides a single, unified API for advanced document parsing and data extraction:

  • Text extraction

    • Extract text from PDF, Word, Excel, PowerPoint, e-books, emails and many other formats.
    • Work in accurate or raw text modes depending on your scenario.
    • Keep track of pages and logical blocks of text.
  • Preserve structure & formatting

    • Retrieve formatted text with font styles, sizes and basic layout information.
    • Analyze document structure – paragraphs, lists, headings, table cells, etc.
  • Text search

    • Search for specific words or phrases in documents.
    • Use advanced search options such as case sensitivity, whole-word matching or regular expressions.
  • OCR text extraction

    • Extract text from scanned PDFs and raster images using OCR options.
    • Combine OCR with spell-checking in supported environments for better recognition quality.
  • Parser extraction

    • Read common parser properties like author, title, subject and keywords.
    • Extract creation / modification dates and other technical properties.
    • Retrieve custom fields such as invoice numbers or business IDs.
  • Image & attachment extraction

    • Extract embedded images from Office documents, PDFs, e-books and more.
    • Pull file attachments from PDFs and email messages.
    • Extract barcodes from supported document and image formats.
  • Document structure analysis

    • Parse tables, including rows, columns and individual cells.
    • Detect text areas and content blocks for fine-grained extraction.
    • Extract hyperlinks, bookmarks and table of contents (TOC) where supported.
  • PDF-specific parsing

    • Extract text, images, parser and attachments from PDFs.
    • Get PDF page count and PDF-specific document information.
    • Work with bookmarks, forms and PDF portfolios.
  • Email parsing

    • Extract sender, recipients, subject and body from emails.
    • Get email parser and embedded attachments.
    • Work with formats like MSG, EML, EMLX, PST and OST.
  • Spreadsheet parsing

    • Extract text and data from Excel and other spreadsheet formats.
    • Work with specific sheets, ranges or individual cells.
    • Extract spreadsheet parser and images.
  • Presentation parsing

    • Extract text, notes, images and parser from PowerPoint files.
    • Work with slide-by-slide content, including shapes and notes.
  • Template-based data extraction

    • Define parsing templates to extract structured fields (e.g. invoices, receipts).
    • Use templates to describe positions of fields, tables and patterns.
    • Apply your own parsing rules for domain-specific scenarios.
  • Advanced & batch features

    • High-performance processing for large documents and document batches.
    • Cross-platform support (Windows, Linux, macOS) via .NET runtime.
    • Build scalable, secure parsing workflows in your Python applications.

Supported Document Formats

GroupDocs.Parser for Python via .NET supports a wide range of document families. Below is an overview of the most important ones.

Word Processing

  • DOC, DOT – Microsoft Word binary documents & templates
  • DOCX, DOCM, DOTX, DOTM – Office Open XML documents & templates
  • RTF – Rich Text Format
  • TXT – Plain text
  • ODT, OTT – OpenDocument text documents & templates

Typical operations: text extraction (accurate & raw), structured text parsing, text areas, parser, images, attachments, TOC, barcode scanning.

PDF

  • PDF – Portable Document Format

Operations: template-based parsing, accurate & raw text extraction, text areas, parser, images, attachments/containers, forms, TOC, barcode scanning.

Markup

  • XHTML – Extensible Hypertext Markup Language
  • MHTML – MIME HTML
  • MD – Markdown
  • XML – XML files

Operations: text extraction (including formatted text for supported types) and parser extraction.

eBook

  • CHM – Compiled HTML Help
  • EPUB – Digital e-book format
  • FB2 – FictionBook 2.0
  • MOBI, AZW3 – Mobile/Kindle formats

Operations: text extraction, structured text, parser, containers, TOC support for selected formats, barcode scanning for supported types.

Spreadsheets

  • XLS, XLT, XLSX, XLSM, XLSB
  • XLTX, XLTM
  • ODS, OTS – OpenDocument spreadsheets
  • CSV – Comma-Separated Values
  • XLA, XLAM – add-ins
  • NUMBERS – Apple iWork Numbers

Operations: text & data extraction, structured content, text areas, parser, images, containers/attachments.

Presentations

  • PPT, PPS, POT – binary PowerPoint
  • PPTX, PPTM, PPSX, PPSM, POTX, POTM – Office Open XML
  • ODP, OTP – OpenDocument presentations

Operations: slide text and notes, structured text, text areas, parser, images, attachments, TOC, barcode scanning.

Email

  • PST, OST – Outlook data files
  • EML, EMLX, MSG – email messages

Operations: email body text, parser (from/to/subject), attachments, images and containers.

Notes

  • ONE – Microsoft OneNote documents

Operations: text extraction and basic parser support.

Archives

  • 7Z, ZIP, RAR, TAR, GZ, BZ2

Operations: work with containers – extract inner documents and attachments, including images.

Encrypted 7Z archives are not supported.

Images

  • BMP, GIF, JP2, JPG/JPEG, PNG, TIF/TIFF
  • DICOM, DJVU, EMF, J2K, PS, PSD, SVG, SVGZ, WEBP, WMF

Operations: text extraction (for some formats via OCR), parser, barcode scanning (where supported).

Databases

  • ADO.NET-based data sources and supported database formats

Operations: text and structured data extraction using database-specific options.

Platform Independence

GroupDocs.Parser for Python via .NET can be used to build 32-bit and 64-bit applications for different operating systems, such as Windows, Linux and macOS, where a supported Python 3.x version is installed.

The parsing engine is powered by the same core technology as the GroupDocs.Parser .NET library, giving you production-ready performance and compatibility in Python environments.

Get Started

Ready to try GroupDocs.Parser for Python via .NET?

You can install the Python package from PyPI and reference it in your project. The exact package name and version may depend on the final distribution, but the flow will be similar to other GroupDocs Python via .NET libraries:

Install GroupDocs.Parser for Python via .NET from PyPI

pip install groupdocs.parser-net

Upgrade to the latest version

pip install --upgrade groupdocs.parser-net

Or

Download Package from Official Website

To download the GroupDocs.Parser package for your operating system, please visit the official GroupDocs Releases website. Currently, four OS-specific packages are available:

  • Windows 64-bit: Package name ends with amd64.whl
  • Windows 32-bit: Package name ends with win32.whl
  • Linux 64-bit: Package name ends with linux1_x86_64.whl
  • macOS Intel Silicon: Package name ends with macosx_10_14_x86_64.whl

Choose the appropriate package based on your system's architecture.

Quick Text Extraction Example

The snippet below demonstrates how a typical usage scenario for extracting text from a PDF document might look in Python.

import groupdocs.parser as gp

def run():
    # Load the PDF document
    with gp.Parser("sample.pdf") as parser:
        # Extract text from the document
        text = parser.GetText()

        # Output the extracted text
        print(text)

Extract Images from a Word Document

This example shows how to iterate over images embedded in a Word document and save them to disk.

import groupdocs.parser as gp

def run():
    # Load the Word document
    with gp.Parser("sample.docx") as parser:
        # Get images from the document
        images = parser.GetImages()

        # Save each image to a PNG file
        index = 1
        for image in images:
            image.Save(f"image{index}.png")
            index += 1

GroupDocs.Parser for Python requires you to use python programming language. For Node.js, Java and .NET languages, we recommend you get GroupDocs.Parser for Node.js, GroupDocs.Parser for Java and GroupDocs.Parser for .NET, respectively.

Product Page | Docs | Demos | API Reference | Blog | Search | Free Support | Temporary License

Keywords

parser

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts