🚀 Big News:Socket Has Acquired Secure Annex.Learn More
Socket
Book a DemoSign in
Socket

pdfplumber

Package Overview
Dependencies
Maintainers
1
Versions
75
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdfplumber - pypi Package Compare versions

Comparing version
0.11.2
to
0.11.3
+21
-0
CHANGELOG.md

@@ -5,2 +5,23 @@ # Changelog

## [0.11.3] - 2024-08-07
### Added
- Add `Table.columns`, analogous to `Table.rows` (h/t @Pk13055). ([#1050](https://github.com/jsvine/pdfplumber/issues/1050) + [d39302f](https://github.com/jsvine/pdfplumber/commit/d39302f))
- Add `Page.extract_words(return_chars=True)`, mirroring `Page.search(..., return_chars=True)`; if this argument is passed, each word dictionary will include an additional key-value pair: `"chars": [char_object, ...]` (h/t @cmdlineluser). ([#1173](https://github.com/jsvine/pdfplumber/issues/1173) + [1496cbd](https://github.com/jsvine/pdfplumber/commit/1496cbd))
- Add `pdfplumber.open(unicode_norm="NFC"/"NFD"/"NFKC"/NFKD")`, where the values are the [four options for Unicode normalization](https://unicode.org/reports/tr15/#Normalization_Forms_Table) (h/t @petermr + @agusluques). ([#905](https://github.com/jsvine/pdfplumber/issues/905) + [03a477f](https://github.com/jsvine/pdfplumber/commit/03a477f))
### Changed
- Change default setting `pdfplumber.repair(...)` passes to Ghostscript's `-dPDFSETTINGS` parameter, from `prepress` to `default`, and make that setting modifiable via `.repair(setting=...)`, where the value is one of `"default"`, `"prepress"`, `"printer"`, or `"ebook"` (h/t @Laubeee). ([#874](https://github.com/jsvine/pdfplumber/issues/874) + [48cab3f](https://github.com/jsvine/pdfplumber/commit/48cab3f))
### Fixed
- Fix handling of object coordinates when `mediabox` does not begin at `(0,0)` (h/t @wodny). ([#1181](https://github.com/jsvine/pdfplumber/issues/1181) + [9025c3f](https://github.com/jsvine/pdfplumber/commit/9025c3f) + [046bd87](https://github.com/jsvine/pdfplumber/commit/046bd87))
- Fix error on getting `.annots`/`.hyperlinks` from `CroppedPage` (due to missing `.rotation` and `.initial_doctop` attributes) (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [e5737d2](https://github.com/jsvine/pdfplumber/commit/e5737d2))
- Fix problem where `Page.crop(...)` was not cropping `.annots/.hyperlinks` (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [22494e8](https://github.com/jsvine/pdfplumber/commit/22494e8))
- Fix calculation of coordinates for `.annots` on `CroppedPage`s. ([0bbb340](https://github.com/jsvine/pdfplumber/commit/0bbb340) + [b16acc3](https://github.com/jsvine/pdfplumber/commit/b16acc3))
- Dereference structure element attributes (h/t @dhdaines). ([#1169](https://github.com/jsvine/pdfplumber/pull/1169) + [3f16180](https://github.com/jsvine/pdfplumber/commit/3f16180))
- Fix `Page.get_attr(...)` so that it fully resolves references before determining whether the attribute's value is `None` (h/t @zzhangyun + @mkl-public). ([#1176](https://github.com/jsvine/pdfplumber/issues/1176) + [c20cd3b](https://github.com/jsvine/pdfplumber/commit/c20cd3b))
## [0.11.2] - 2024-07-06

@@ -7,0 +28,0 @@

+29
-6
Metadata-Version: 2.1
Name: pdfplumber
Version: 0.11.2
Version: 0.11.3
Summary: Plumb a PDF for detailed information about each char, rectangle, and line.

@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber

> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).
## Table of Contents

@@ -106,2 +104,4 @@

To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.
Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.

@@ -281,4 +281,25 @@

[To be completed.]
*Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).*
| Property | Description |
|----------|-------------|
|`page_number`| Page number on which the image was found.|
|`height`| Height of the image.|
|`width`| Width of the image.|
|`x0`| Distance of left side of the image from left side of page.|
|`x1`| Distance of right side of the image from left side of page.|
|`y0`| Distance of bottom of the image from bottom of page.|
|`y1`| Distance of top of the image from bottom of page.|
|`top`| Distance of top of the image from top of page.|
|`bottom`| Distance of bottom of the image from top of page.|
|`doctop`| Distance of top of rectangle from top of document.|
|`srcsize`| The image original dimensions, as a `(width, height)` tuple.|
|`colorspace`| Color domain of the image (e.g., RGB).|
|`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).|
|`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.|
|`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."|
|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*|
|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*|
|`object_type`| "image"|
### Obtaining higher-level layout objects via `pdfminer.six`

@@ -353,3 +374,3 @@

|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.|
|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|

@@ -375,3 +396,3 @@ |`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. |

|--------|-------------|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.|

@@ -573,2 +594,4 @@ |`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.|

- [Quentin André](https://github.com/QuentinAndre11)
- [Léo Roux](https://github.com/leorouxx)
- [@wodny](https://github.com/wodny)

@@ -575,0 +598,0 @@ ## Contributing

+1
-1

@@ -1,2 +0,2 @@

version_info = (0, 11, 2)
version_info = (0, 11, 3)
__version__ = ".".join(map(str, version_info))

@@ -16,11 +16,13 @@ import csv

@property
def pages(self) -> Optional[List[Any]]:
... # pragma: nocover
def pages(self) -> Optional[List[Any]]: # pragma: nocover
raise NotImplementedError
@property
def objects(self) -> Dict[str, T_obj_list]:
... # pragma: nocover
def objects(self) -> Dict[str, T_obj_list]: # pragma: nocover
raise NotImplementedError
def to_dict(self, object_types: Optional[List[str]] = None) -> Dict[str, Any]:
... # pragma: nocover
def to_dict(
self, object_types: Optional[List[str]] = None
) -> Dict[str, Any]: # pragma: nocover
raise NotImplementedError

@@ -27,0 +29,0 @@ def flush_cache(self, properties: Optional[List[str]] = None) -> None:

@@ -15,2 +15,3 @@ import re

)
from unicodedata import normalize as normalize_unicode

@@ -220,4 +221,4 @@ from pdfminer.converter import PDFPageAggregator

def get_attr(key: str, default: Any = None) -> Any:
ref = page_obj.attrs.get(key)
return default if ref is None else resolve_all(ref)
value = resolve_all(page_obj.attrs.get(key))
return default if value is None else value

@@ -296,3 +297,4 @@ # Per PDF Reference Table 3.27: "The number of degrees by which the

pt1 = rotate_point((_c, _d), self.rotation)
x0, top, x1, bottom = _invert_box(_normalize_box((*pt0, *pt1)), self.height)
rh = self.root_page.height
x0, top, x1, bottom = _invert_box(_normalize_box((*pt0, *pt1)), rh)

@@ -316,5 +318,5 @@ a = annot.get("A", {})

"x0": x0,
"y0": self.height - bottom,
"y0": rh - bottom,
"x1": x1,
"y1": self.height - top,
"y1": rh - top,
"doctop": self.initial_doctop + top,

@@ -335,3 +337,7 @@ "top": top,

raw = resolve_all(self.page_obj.annots) or []
return list(map(parse, raw))
parsed = list(map(parse, raw))
if isinstance(self, CroppedPage):
return self._crop_fn(parsed)
else:
return parsed

@@ -350,3 +356,4 @@ @property

def point2coord(self, pt: Tuple[T_num, T_num]) -> Tuple[T_num, T_num]:
return (pt[0], self.height - pt[1])
# See note below re. #1181 and mediabox-adjustment reversions
return (self.mediabox[0] + pt[0], self.mediabox[1] + self.height - pt[1])

@@ -386,3 +393,8 @@ def process_object(self, obj: LTItem) -> T_obj:

if isinstance(obj, (LTChar, LTTextContainer)):
attr["text"] = obj.get_text()
text = obj.get_text()
attr["text"] = (
normalize_unicode(self.pdf.unicode_norm, text)
if self.pdf.unicode_norm is not None
else text
)

@@ -414,7 +426,16 @@ if isinstance(obj, LTChar):

# As noted in #1181, `pdfminer.six` adjusts objects'
# coordinates relative to the MediaBox:
# https://github.com/pdfminer/pdfminer.six/blob/1a8bd2f730295b31d6165e4d95fcb5a03793c978/pdfminer/converter.py#L79-L84
mb_x0, mb_top = self.mediabox[:2]
if "y0" in attr:
attr["top"] = self.height - attr["y1"]
attr["bottom"] = self.height - attr["y0"]
attr["top"] = (self.height - attr["y1"]) + mb_top
attr["bottom"] = (self.height - attr["y0"]) + mb_top
attr["doctop"] = self.initial_doctop + attr["top"]
if "x0" in attr and mb_x0 != 0:
attr["x0"] = attr["x0"] + mb_x0
attr["x1"] = attr["x1"] + mb_x0
return attr

@@ -644,2 +665,4 @@

self.page_number = parent_page.page_number
self.initial_doctop = parent_page.initial_doctop
self.rotation = parent_page.rotation
self.mediabox = parent_page.mediabox

@@ -646,0 +669,0 @@ self.cropbox = parent_page.cropbox

@@ -6,3 +6,3 @@ import itertools

from types import TracebackType
from typing import Any, Dict, List, Optional, Tuple, Type, Union
from typing import Any, Dict, List, Literal, Optional, Tuple, Type, Union

@@ -19,3 +19,3 @@ from pdfminer.layout import LAParams

from .page import Page
from .repair import _repair
from .repair import T_repair_setting, _repair
from .structure import PDFStructTree, StructTreeMissing

@@ -39,2 +39,3 @@ from .utils import resolve_and_decode

strict_metadata: bool = False,
unicode_norm: Optional[Literal["NFC", "NFKC", "NFD", "NFKD"]] = None,
):

@@ -47,2 +48,3 @@ self.stream = stream

self.password = password
self.unicode_norm = unicode_norm

@@ -77,4 +79,6 @@ self.doc = PDFDocument(PDFParser(stream), password=password or "")

strict_metadata: bool = False,
unicode_norm: Optional[Literal["NFC", "NFKC", "NFD", "NFKD"]] = None,
repair: bool = False,
gs_path: Optional[Union[str, pathlib.Path]] = None,
repair_setting: T_repair_setting = "default",
) -> "PDF":

@@ -85,3 +89,5 @@

if repair:
stream = _repair(path_or_fp, password=password, gs_path=gs_path)
stream = _repair(
path_or_fp, password=password, gs_path=gs_path, setting=repair_setting
)
stream_is_external = False

@@ -108,2 +114,3 @@ # Although the original file has a path,

strict_metadata=strict_metadata,
unicode_norm=unicode_norm,
stream_is_external=stream_is_external,

@@ -110,0 +117,0 @@ )

@@ -5,5 +5,7 @@ import pathlib

from io import BufferedReader, BytesIO
from typing import Optional, Union
from typing import Literal, Optional, Union
T_repair_setting = Literal["default", "prepress", "printer", "ebook", "screen"]
def _repair(

@@ -13,2 +15,3 @@ path_or_fp: Union[str, pathlib.Path, BufferedReader, BytesIO],

gs_path: Optional[Union[str, pathlib.Path]] = None,
setting: T_repair_setting = "default",
) -> BytesIO:

@@ -34,3 +37,3 @@

"-sDEVICE=pdfwrite",
"-dPDFSETTINGS=/prepress",
f"-dPDFSETTINGS=/{setting}",
]

@@ -68,4 +71,5 @@

gs_path: Optional[Union[str, pathlib.Path]] = None,
setting: T_repair_setting = "default",
) -> Optional[BytesIO]:
repaired = _repair(path_or_fp, password, gs_path=gs_path)
repaired = _repair(path_or_fp, password, gs_path=gs_path, setting=setting)
if outfile:

@@ -72,0 +76,0 @@ with open(outfile, "wb") as f:

@@ -288,7 +288,9 @@ import itertools

attributes = self._make_attributes(obj, revision)
element_id = decode_text(obj["ID"]) if "ID" in obj else None
title = decode_text(obj["T"]) if "T" in obj else None
lang = decode_text(obj["Lang"]) if "Lang" in obj else None
alt_text = decode_text(obj["Alt"]) if "Alt" in obj else None
actual_text = decode_text(obj["ActualText"]) if "ActualText" in obj else None
element_id = decode_text(resolve1(obj["ID"])) if "ID" in obj else None
title = decode_text(resolve1(obj["T"])) if "T" in obj else None
lang = decode_text(resolve1(obj["Lang"])) if "Lang" in obj else None
alt_text = decode_text(resolve1(obj["Alt"])) if "Alt" in obj else None
actual_text = (
decode_text(resolve1(obj["ActualText"])) if "ActualText" in obj else None
)
element = PDFStructElement(

@@ -295,0 +297,0 @@ type=obj_tag,

@@ -373,2 +373,6 @@ import itertools

class Column(CellGroup):
pass
class Table(object):

@@ -389,13 +393,31 @@ def __init__(self, page: "Page", cells: List[T_bbox]):

@property
def rows(self) -> List[Row]:
_sorted = sorted(self.cells, key=itemgetter(1, 0))
xs = list(sorted(set(map(itemgetter(0), self.cells))))
def _get_rows_or_cols(self, kind: type[CellGroup]) -> List[CellGroup]:
axis = 0 if kind is Row else 1
antiaxis = int(not axis)
# Sort first by top/x0, then by x0/top
_sorted = sorted(self.cells, key=itemgetter(antiaxis, axis))
# Sort get all x0s/tops
xs = list(sorted(set(map(itemgetter(axis), self.cells))))
# Group by top/x0
grouped = itertools.groupby(_sorted, itemgetter(antiaxis))
rows = []
for y, row_cells in itertools.groupby(_sorted, itemgetter(1)):
xdict = {cell[0]: cell for cell in row_cells}
row = Row([xdict.get(x) for x in xs])
# for y/x, row/column-cells ...
for y, row_cells in grouped:
xdict = {cell[axis]: cell for cell in row_cells}
row = kind([xdict.get(x) for x in xs])
rows.append(row)
return rows
@property
def rows(self) -> List[CellGroup]:
return self._get_rows_or_cols(Row)
@property
def columns(self) -> List[CellGroup]:
return self._get_rows_or_cols(Column)
def extract(self, **kwargs: Any) -> List[List[Optional[str]]]:

@@ -484,3 +506,3 @@

def __post_init__(self) -> "TableSettings":
def __post_init__(self) -> None:
"""Clean up user-provided table settings.

@@ -536,4 +558,2 @@

return self
@classmethod

@@ -540,0 +560,0 @@ def resolve(cls, settings: Optional[T_table_settings]) -> "TableSettings":

import itertools
from collections.abc import Hashable
from operator import itemgetter
from typing import Callable, Dict, Iterable, List, TypeVar, Union
from typing import Any, Callable, Dict, Iterable, List, Tuple, TypeVar, Union
from .._typing import T_num
from .._typing import T_num, T_obj

@@ -39,11 +39,11 @@

R = TypeVar("R")
Clusterable = TypeVar("Clusterable", T_obj, Tuple[Any, ...])
def cluster_objects(
xs: List[R],
key_fn: Union[Hashable, Callable[[R], T_num]],
xs: List[Clusterable],
key_fn: Union[Hashable, Callable[[Clusterable], T_num]],
tolerance: T_num,
preserve_order: bool = False,
) -> List[List[R]]:
) -> List[List[Clusterable]]:

@@ -50,0 +50,0 @@ if not callable(key_fn):

@@ -33,3 +33,4 @@ import itertools

"""
return bbox_getter(obj)
bbox: T_bbox = bbox_getter(obj)
return bbox

@@ -36,0 +37,0 @@

@@ -683,8 +683,18 @@ import inspect

def extract_words(self, chars: T_obj_list) -> T_obj_list:
return list(word for word, word_chars in self.iter_extract_tuples(chars))
def extract_words(
self, chars: T_obj_list, return_chars: bool = False
) -> T_obj_list:
if return_chars:
return list(
{**word, "chars": word_chars}
for word, word_chars in self.iter_extract_tuples(chars)
)
else:
return list(word for word, word_chars in self.iter_extract_tuples(chars))
def extract_words(chars: T_obj_list, **kwargs: Any) -> T_obj_list:
return WordExtractor(**kwargs).extract_words(chars)
def extract_words(
chars: T_obj_list, return_chars: bool = False, **kwargs: Any
) -> T_obj_list:
return WordExtractor(**kwargs).extract_words(chars, return_chars)

@@ -691,0 +701,0 @@

Metadata-Version: 2.1
Name: pdfplumber
Version: 0.11.2
Version: 0.11.3
Summary: Plumb a PDF for detailed information about each char, rectangle, and line.

@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber

> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).
## Table of Contents

@@ -106,2 +104,4 @@

To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.
Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.

@@ -281,4 +281,25 @@

[To be completed.]
*Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).*
| Property | Description |
|----------|-------------|
|`page_number`| Page number on which the image was found.|
|`height`| Height of the image.|
|`width`| Width of the image.|
|`x0`| Distance of left side of the image from left side of page.|
|`x1`| Distance of right side of the image from left side of page.|
|`y0`| Distance of bottom of the image from bottom of page.|
|`y1`| Distance of top of the image from bottom of page.|
|`top`| Distance of top of the image from top of page.|
|`bottom`| Distance of bottom of the image from top of page.|
|`doctop`| Distance of top of rectangle from top of document.|
|`srcsize`| The image original dimensions, as a `(width, height)` tuple.|
|`colorspace`| Color domain of the image (e.g., RGB).|
|`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).|
|`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.|
|`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."|
|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*|
|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*|
|`object_type`| "image"|
### Obtaining higher-level layout objects via `pdfminer.six`

@@ -353,3 +374,3 @@

|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.|
|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|

@@ -375,3 +396,3 @@ |`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. |

|--------|-------------|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.|

@@ -573,2 +594,4 @@ |`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.|

- [Quentin André](https://github.com/QuentinAndre11)
- [Léo Roux](https://github.com/leorouxx)
- [@wodny](https://github.com/wodny)

@@ -575,0 +598,0 @@ ## Contributing

@@ -15,4 +15,2 @@ # pdfplumber

> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).
## Table of Contents

@@ -85,2 +83,4 @@

To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.
Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.

@@ -260,4 +260,25 @@

[To be completed.]
*Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).*
| Property | Description |
|----------|-------------|
|`page_number`| Page number on which the image was found.|
|`height`| Height of the image.|
|`width`| Width of the image.|
|`x0`| Distance of left side of the image from left side of page.|
|`x1`| Distance of right side of the image from left side of page.|
|`y0`| Distance of bottom of the image from bottom of page.|
|`y1`| Distance of top of the image from bottom of page.|
|`top`| Distance of top of the image from top of page.|
|`bottom`| Distance of bottom of the image from top of page.|
|`doctop`| Distance of top of rectangle from top of document.|
|`srcsize`| The image original dimensions, as a `(width, height)` tuple.|
|`colorspace`| Color domain of the image (e.g., RGB).|
|`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).|
|`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.|
|`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."|
|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*|
|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*|
|`object_type`| "image"|
### Obtaining higher-level layout objects via `pdfminer.six`

@@ -332,3 +353,3 @@

|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.|
|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|

@@ -354,3 +375,3 @@ |`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. |

|--------|-------------|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.|

@@ -552,2 +573,4 @@ |`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.|

- [Quentin André](https://github.com/QuentinAndre11)
- [Léo Roux](https://github.com/leorouxx)
- [@wodny](https://github.com/wodny)

@@ -554,0 +577,0 @@ ## Contributing

@@ -1,8 +0,8 @@

black==22.3.0
flake8==7.1.0
isort==5.10.1
black==24.8.0
flake8==7.1.1
isort==5.13.2
jupyterlab==3.6.7
mypy==0.981
mypy==1.11.1
nbexec==0.2.0
pandas-stubs==2.2.2.240603
pandas-stubs==2.2.2.240805
pandas==2.2.2

@@ -12,4 +12,4 @@ py==1.11.0

pytest-parallel==0.1.1
pytest==8.2.2
pytest==8.3.2
setuptools==68.2.2
types-Pillow==10.2.0.20240520

@@ -6,2 +6,3 @@ [flake8]

W503
E704

@@ -8,0 +9,0 @@ [tool:pytest]

@@ -62,2 +62,16 @@ #!/usr/bin/env python

def test_annots_cropped(self):
pdf = self.pdf_2
page = pdf.pages[0]
assert len(page.annots) == 13
assert len(page.hyperlinks) == 1
cropped = page.crop(page.bbox)
assert len(cropped.annots) == 13
assert len(cropped.hyperlinks) == 1
h0_bbox = pdfplumber.utils.obj_to_bbox(page.hyperlinks[0])
cropped = page.crop(h0_bbox)
assert len(cropped.annots) == len(cropped.hyperlinks) == 1
def test_annots_rotated(self):

@@ -182,2 +196,15 @@ def get_annot(filename, n=0):

def test_unicode_normalization(self):
path = os.path.join(HERE, "pdfs/issue-905.pdf")
with pdfplumber.open(path) as pdf:
page = pdf.pages[0]
print(page.extract_text())
assert ord(page.chars[0]["text"]) == 894
with pdfplumber.open(path, unicode_norm="NFC") as pdf:
page = pdf.pages[0]
assert ord(page.chars[0]["text"]) == 59
assert page.extract_text() == ";;"
def test_colors(self):

@@ -184,0 +211,0 @@ rect = self.pdf.pages[0].rects[0]

@@ -30,3 +30,6 @@ #!/usr/bin/env python

assert last_line_without_drop == "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些"
assert (
last_line_without_drop
== "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些"
)
assert last_line_with_drop == "微软 培训课程: 名模意义一些有意义一些"

@@ -50,3 +53,6 @@

assert last_words_without_drop["upright"] == 1
assert last_words_without_drop["text"] == "名名模模意意义义一一些些有有意意义义一一些些"
assert (
last_words_without_drop["text"]
== "名名模模意意义义一一些些有有意意义义一一些些"
)

@@ -65,3 +71,6 @@ assert round(last_words_with_drop["x0"], 3) == x0

assert last_line_without_drop == "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些"
assert (
last_line_without_drop
== "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些"
)
assert last_line_with_drop == "微软 培训课程: 名模意义一些有意义一些"

@@ -68,0 +77,0 @@

@@ -313,1 +313,23 @@ #!/usr/bin/env python

assert page.extract_text()
def test_issue_1181(self):
"""
Correctly re-calculate coordinates when MediaBox does not start at (0,0)
"""
path = os.path.join(HERE, "pdfs/issue-1181.pdf")
with pdfplumber.open(path) as pdf:
p0, p1 = pdf.pages
assert p0.crop(p0.bbox).extract_table() == [
["FooCol1", "FooCol2", "FooCol3"],
["Foo4", "Foo5", "Foo6"],
["Foo7", "Foo8", "Foo9"],
["Foo10", "Foo11", "Foo12"],
["", "", ""],
]
assert p1.crop(p1.bbox).extract_table() == [
["BarCol1", "BarCol2", "BarCol3"],
["Bar4", "Bar5", "Bar6"],
["Bar7", "Bar8", "Bar9"],
["Bar10", "Bar11", "Bar12"],
["", "", ""],
]

@@ -56,2 +56,14 @@ #!/usr/bin/env python

def test_repair_setting(self):
path = os.path.join(HERE, "pdfs/malformed-from-issue-932.pdf")
with tempfile.NamedTemporaryFile("wb") as out:
pdfplumber.repair(path, outfile=out.name)
size_default = os.stat(out.name).st_size
with tempfile.NamedTemporaryFile("wb") as out:
pdfplumber.repair(path, outfile=out.name, setting="prepress")
size_prepress = os.stat(out.name).st_size
assert size_default > size_prepress
def test_repair_password(self):

@@ -58,0 +70,0 @@ path = os.path.join(HERE, "pdfs/password-example.pdf")

@@ -76,2 +76,28 @@ #!/usr/bin/env python

def test_rows_and_columns(self):
path = os.path.join(HERE, "pdfs/issue-140-example.pdf")
with pdfplumber.open(path) as pdf:
page = pdf.pages[0]
table = page.find_table()
row = [page.crop(bbox).extract_text() for bbox in table.rows[0].cells]
assert row == [
"Line no",
"UPC code",
"Location",
"Item Description",
"Item Quantity",
"Bill Amount",
"Accrued Amount",
"Handling Rate",
"PO number",
]
col = [page.crop(bbox).extract_text() for bbox in table.columns[1].cells]
assert col == [
"UPC code",
"0085648100305",
"0085648100380",
"0085648100303",
"0085648100300",
]
def test_explicit_desc_decimalization(self):

@@ -78,0 +104,0 @@ """

@@ -102,2 +102,14 @@ #!/usr/bin/env python

def test_extract_words_return_chars(self):
path = os.path.join(HERE, "pdfs/extra-attrs-example.pdf")
with pdfplumber.open(path) as pdf:
page = pdf.pages[0]
words = page.extract_words()
assert "chars" not in words[0]
words = page.extract_words(return_chars=True)
assert "chars" in words[0]
assert "".join(c["text"] for c in words[0]["chars"]) == words[0]["text"]
def test_text_rotation(self):

@@ -104,0 +116,0 @@ rotations = {