
Security News
Meet Socket at Black Hat and DEF CON 2025 in Las Vegas
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
ββββββββββββββββββββββββββββββββββββββββββ TABLE DETECTION IN IMAGES AND OCR TO CSV
Eric Ihli
ββββββββββββββββββββββββββββββββββββββββββ
Table of Contents βββββββββββββββββ
1 Overview ββββββββββ
This python package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.
Given an image that contains a tableβ¦
file:resources/examples/example-page.png
Extract the the text into a CSV formatβ¦
βββββ β PRIZE,ODDS 1 IN:,# OF WINNERS* β $3,9.09,"282,447" β $5,16.66,"154,097" β $7,40.01,"64,169" β $10,26.67,"96,283" β $20,100.00,"25,677" β $30,290.83,"8,829" β $50,239.66,"10,714" β $100,919.66,"2,792" β $500,"6,652.07",386 β "$40,000","855,899.99",3 β 1,i223, β Toa,, β ,, β ,,"* Based upon 2,567,700" βββββ
2 Requirements ββββββββββββββ
Along with the python requirements that are listed in setup.py and that are automatically installed when installing this package through pip, there are a few external requirements for some of the modules.
I havenβt looked into the minimum required versions of these dependencies, but Iβll list the versions that Iβm using.
β’ pdfimages' 20.09.0 of [Poppler] β’
tesseract' 5.0.0 of [Tesseract]
β’ `mogrify' 7.0.10 of [ImageMagick]
[Poppler] https://poppler.freedesktop.org/
[Tesseract] https://github.com/tesseract-ocr/tesseract
[ImageMagick] https://imagemagick.org/index.php
3 Demo ββββββ
There is a demo module that will download an image given a URL and try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.
That will run against the following image:
file:resources/test_data/simple.png
The following should be printed to your terminal after running the above commands.
βββββ
β Running extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).
β Extracted the following tables from the image:
β [('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
β Processing tables for /tmp/demo_p9on6m8o/simple.png.
β Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
β Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
β Cells:
β /tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
β /tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
β /tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
β ...
β
β Here is the entire CSV output:
β
β Cell,Format,Formula
β B4,Percentage,None
β C4,General,None
β D4,Accounting,None
β E4,Currency,"=PMT(B4/12,C4,D4)"
β F4,Currency,=E4*C4
βββββ
4 Modules βββββββββ
The package is split into modules with narrow focuses.
β’ pdf_to_images' uses Poppler and ImageMagick to extract images from a PDF. β’
extract_tables' finds and extracts table-looking things from an
image.
β’ extract_cells' extracts and orders cells from a table. β’
ocr_image' uses Tesseract to OCR the text from an image of a cell.
β’ ocr_to_csv' converts into a CSV the directory structure that
ocr_image' outputs.
The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow, as demonstrated by the following shell script.
βββββ β #!/bin/sh β β PDF=$1 β β python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt β cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt β cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt β cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {} β β for image in $(cat /tmp/extracted-tables.txt); do β dir=$(dirname $image) β python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt") β done βββββ
The package was written in a [literate programming] style. The source code at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html is meant to act as the documentation and reference material.
[literate programming] https://en.wikipedia.org/wiki/Literate_programming
FAQs
Extract text from tables in images.
We found that table-ocr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Meet Socket at Black Hat & DEF CON 2025 for 1:1s, insider security talks at Allegiant Stadium, and a private dinner with top minds in software supply chain security.
Security News
CAI is a new open source AI framework that automates penetration testing tasks like scanning and exploitation up to 3,600Γ faster than humans.
Security News
Deno 2.4 brings back bundling, improves dependency updates and telemetry, and makes the runtime more practical for real-world JavaScript projects.