Socket
Socket
Sign inDemoInstall

github.com/CerviTalotekniikka/pdf

Package Overview
Dependencies
0
Alerts
File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

    github.com/CerviTalotekniikka/pdf

Package pdf implements reading of PDF files. PDF is Adobe's Portable Document Format, ubiquitous on the internet. A PDF document is a complex data format built on a fairly simple structure. This package exposes the simple structure along with some wrappers to extract basic information. If more complex information is needed, it is possible to extract that information by interpreting the structure exposed by this package. Specifically, a PDF is a data structure built from Values, each of which has one of the following Kinds: The accessors on Value—Int64, Float64, Bool, Name, and so on—return a view of the data as the given type. When there is no appropriate view, the accessor returns a zero result. For example, the Name accessor returns the empty string if called on a Value v for which v.Kind() != Name. Returning zero values this way, especially from the Dict and Array accessors, which themselves return Values, makes it possible to traverse a PDF quickly without writing any error checking. On the other hand, it means that mistakes can go unreported. The basic structure of the PDF file is exposed as the graph of Values. Most richer data structures in a PDF file are dictionaries with specific interpretations of the name-value pairs. The Font and Page wrappers make the interpretation of a specific Value as the corresponding type easier. They are only helpers, though: they are implemented only in terms of the Value API and could be moved outside the package. Equally important, traversal of other PDF data structures can be implemented in other packages as needed.


Version published

Readme

Source

Purpose of the fork

This fork of rsc.io/pdf extends the package API with:

  • Implement the method GetPlainText() from object Page. Use to get plain text content (without format)

How to read all text from PDF:

  1. Get the library with command go get -u github.com/ledongthuc/pdf

  2. I write an example function to read file from PATH and return the content of PDF

package main

import (
	"bytes"
	"fmt"

	"github.com/ledongthuc/pdf"
)

func main() {
	content, err := readPdf("test.pdf") // Read local pdf file
	if err != nil {
		panic(err)
	}
	fmt.Println(content)
	return
}

func readPdf(path string) (string, error) {
	r, err := pdf.Open(path)
	if err != nil {
		return "", err
	}
	totalPage := r.NumPage()

	var textBuilder bytes.Buffer
	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
		p := r.Page(pageIndex)
		if p.V.IsNull() {
			continue
		}
		textBuilder.WriteString(p.GetPlainText("\n"))
	}
	return textBuilder.String(), nil
}

How to read all text with styles from PDF

func readPdf2(path string) (string, error) {
	r, err := pdf.Open(path)
	if err != nil {
		return "", err
	}
	totalPage := r.NumPage()

	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
		p := r.Page(pageIndex)
		if p.V.IsNull() {
			continue
		}
		var lastTextStyle pdf.Text
		texts := p.Content().Text
		for _, text := range texts {
			if isSameSentence(text, lastTextStyle) {
				lastTextStyle.S = lastTextStyle.S + text.S
			} else {
				fmt.Printf("Font: %s, Font-size: %f, x: %f, y: %f, content: %s \n", lastTextStyle.Font, lastTextStyle.FontSize, lastTextStyle.X, lastTextStyle.Y, lastTextStyle.S)
				lastTextStyle = text
			}
		}
	}
	return "", nil
}

Demo

Run example

FAQs

Last updated on 02 Jul 2017

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc