Tabledetector
data:image/s3,"s3://crabby-images/392e9/392e958e4af56bb7cc74d885177065bcf6e6cd78" alt="PyPI"
Tabledetector is a Python package that takes PDFs or Images as input, checks the alignment, re-aligns if required, detects the table structure, extracts data, return as pandas dataframe for further use. The current implementation focuses on bordered, semibordered and unbordered table structures.
Features
- PDF Input: Accepts PDF/Image files as input for table detection.
- Alignment Check: Verifies and adjusts alignment of input.
- Table Detection: Identifies bordered, semibordered and unbordered tables in the PDF/Image File.
- Table Extraction: Extract the tabular data in the form of dataframe.
Libraries Used
- Python 3.x
- OpenCV
- NumPy
- pdf2image
- Pillow
- scipy
- jinja2
- easyocr
- pandas
Create and Activate Environment
conda create -n <env_name> python=3.7
conda activate <env_name>
Installation of package using pip
pip install tabledetector
Clone the repository for latest development release
git clone https://github.com/rajban94/TableDetector.git
Dependency
To utilize this library on Windows, ensure that Poppler is installed and its path is added to the environment variables.
Usage
Detection
For bordered table detection and if rotation not required:
import tabledetector as td
result = td.detect(pdf_path="pdf_path", type="bordered", rotation=False, method='detect')
For semibordered table detection and if rotation not required:
import tabledetector as td
result = td.detect(pdf_path="pdf_path", method="semibordered", rotation=False, method='detect')
For unbordered table detection and if rotation not required:
import tabledetector as td
result = td.detect(pdf_path="pdf_path", method="unbordered", rotation=False, method='detect')
For bordered table detection and extraction and if rotation not required:
import tabledetector as td
result = td.detect(pdf_path="pdf_path", type="bordered", rotation=False, method='extract')
For semibordered table detection and extraction and if rotation not required:
import tabledetector as td
result = td.detect(pdf_path="pdf_path", method="semibordered", rotation=False, method='extract')
For unbordered table detection and extraction and if rotation not required:
import tabledetector as td
result = td.detect(pdf_path="pdf_path", method="unbordered", rotation=False, method='extract')
If no method is mentioned in that case it will check for all the methods and will provide the result accordingly. Also if rotation required make the rotation = True.