pdfplumber
Advanced tools
+21
-0
@@ -5,2 +5,23 @@ # Changelog | ||
| ## [0.11.3] - 2024-08-07 | ||
| ### Added | ||
| - Add `Table.columns`, analogous to `Table.rows` (h/t @Pk13055). ([#1050](https://github.com/jsvine/pdfplumber/issues/1050) + [d39302f](https://github.com/jsvine/pdfplumber/commit/d39302f)) | ||
| - Add `Page.extract_words(return_chars=True)`, mirroring `Page.search(..., return_chars=True)`; if this argument is passed, each word dictionary will include an additional key-value pair: `"chars": [char_object, ...]` (h/t @cmdlineluser). ([#1173](https://github.com/jsvine/pdfplumber/issues/1173) + [1496cbd](https://github.com/jsvine/pdfplumber/commit/1496cbd)) | ||
| - Add `pdfplumber.open(unicode_norm="NFC"/"NFD"/"NFKC"/NFKD")`, where the values are the [four options for Unicode normalization](https://unicode.org/reports/tr15/#Normalization_Forms_Table) (h/t @petermr + @agusluques). ([#905](https://github.com/jsvine/pdfplumber/issues/905) + [03a477f](https://github.com/jsvine/pdfplumber/commit/03a477f)) | ||
| ### Changed | ||
| - Change default setting `pdfplumber.repair(...)` passes to Ghostscript's `-dPDFSETTINGS` parameter, from `prepress` to `default`, and make that setting modifiable via `.repair(setting=...)`, where the value is one of `"default"`, `"prepress"`, `"printer"`, or `"ebook"` (h/t @Laubeee). ([#874](https://github.com/jsvine/pdfplumber/issues/874) + [48cab3f](https://github.com/jsvine/pdfplumber/commit/48cab3f)) | ||
| ### Fixed | ||
| - Fix handling of object coordinates when `mediabox` does not begin at `(0,0)` (h/t @wodny). ([#1181](https://github.com/jsvine/pdfplumber/issues/1181) + [9025c3f](https://github.com/jsvine/pdfplumber/commit/9025c3f) + [046bd87](https://github.com/jsvine/pdfplumber/commit/046bd87)) | ||
| - Fix error on getting `.annots`/`.hyperlinks` from `CroppedPage` (due to missing `.rotation` and `.initial_doctop` attributes) (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [e5737d2](https://github.com/jsvine/pdfplumber/commit/e5737d2)) | ||
| - Fix problem where `Page.crop(...)` was not cropping `.annots/.hyperlinks` (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [22494e8](https://github.com/jsvine/pdfplumber/commit/22494e8)) | ||
| - Fix calculation of coordinates for `.annots` on `CroppedPage`s. ([0bbb340](https://github.com/jsvine/pdfplumber/commit/0bbb340) + [b16acc3](https://github.com/jsvine/pdfplumber/commit/b16acc3)) | ||
| - Dereference structure element attributes (h/t @dhdaines). ([#1169](https://github.com/jsvine/pdfplumber/pull/1169) + [3f16180](https://github.com/jsvine/pdfplumber/commit/3f16180)) | ||
| - Fix `Page.get_attr(...)` so that it fully resolves references before determining whether the attribute's value is `None` (h/t @zzhangyun + @mkl-public). ([#1176](https://github.com/jsvine/pdfplumber/issues/1176) + [c20cd3b](https://github.com/jsvine/pdfplumber/commit/c20cd3b)) | ||
| ## [0.11.2] - 2024-07-06 | ||
@@ -7,0 +28,0 @@ |
| Metadata-Version: 2.1 | ||
| Name: pdfplumber | ||
| Version: 0.11.2 | ||
| Version: 0.11.3 | ||
| Summary: Plumb a PDF for detailed information about each char, rectangle, and line. | ||
@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber | ||
| > 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction). | ||
| ## Table of Contents | ||
@@ -106,2 +104,4 @@ | ||
| To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`. | ||
| Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata. | ||
@@ -281,4 +281,25 @@ | ||
| [To be completed.] | ||
| *Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).* | ||
| | Property | Description | | ||
| |----------|-------------| | ||
| |`page_number`| Page number on which the image was found.| | ||
| |`height`| Height of the image.| | ||
| |`width`| Width of the image.| | ||
| |`x0`| Distance of left side of the image from left side of page.| | ||
| |`x1`| Distance of right side of the image from left side of page.| | ||
| |`y0`| Distance of bottom of the image from bottom of page.| | ||
| |`y1`| Distance of top of the image from bottom of page.| | ||
| |`top`| Distance of top of the image from top of page.| | ||
| |`bottom`| Distance of bottom of the image from top of page.| | ||
| |`doctop`| Distance of top of rectangle from top of document.| | ||
| |`srcsize`| The image original dimensions, as a `(width, height)` tuple.| | ||
| |`colorspace`| Color domain of the image (e.g., RGB).| | ||
| |`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).| | ||
| |`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.| | ||
| |`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."| | ||
| |`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*| | ||
| |`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*| | ||
| |`object_type`| "image"| | ||
| ### Obtaining higher-level layout objects via `pdfminer.six` | ||
@@ -353,3 +374,3 @@ | ||
| |`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.| | ||
| |`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).| | ||
| |`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.| | ||
| |`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.| | ||
@@ -375,3 +396,3 @@ |`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. | | ||
| |--------|-------------| | ||
| |`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.| | ||
| |`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.| | ||
| |`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.| | ||
@@ -573,2 +594,4 @@ |`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.| | ||
| - [Quentin André](https://github.com/QuentinAndre11) | ||
| - [Léo Roux](https://github.com/leorouxx) | ||
| - [@wodny](https://github.com/wodny) | ||
@@ -575,0 +598,0 @@ ## Contributing |
@@ -1,2 +0,2 @@ | ||
| version_info = (0, 11, 2) | ||
| version_info = (0, 11, 3) | ||
| __version__ = ".".join(map(str, version_info)) |
@@ -16,11 +16,13 @@ import csv | ||
| @property | ||
| def pages(self) -> Optional[List[Any]]: | ||
| ... # pragma: nocover | ||
| def pages(self) -> Optional[List[Any]]: # pragma: nocover | ||
| raise NotImplementedError | ||
| @property | ||
| def objects(self) -> Dict[str, T_obj_list]: | ||
| ... # pragma: nocover | ||
| def objects(self) -> Dict[str, T_obj_list]: # pragma: nocover | ||
| raise NotImplementedError | ||
| def to_dict(self, object_types: Optional[List[str]] = None) -> Dict[str, Any]: | ||
| ... # pragma: nocover | ||
| def to_dict( | ||
| self, object_types: Optional[List[str]] = None | ||
| ) -> Dict[str, Any]: # pragma: nocover | ||
| raise NotImplementedError | ||
@@ -27,0 +29,0 @@ def flush_cache(self, properties: Optional[List[str]] = None) -> None: |
+33
-10
@@ -15,2 +15,3 @@ import re | ||
| ) | ||
| from unicodedata import normalize as normalize_unicode | ||
@@ -220,4 +221,4 @@ from pdfminer.converter import PDFPageAggregator | ||
| def get_attr(key: str, default: Any = None) -> Any: | ||
| ref = page_obj.attrs.get(key) | ||
| return default if ref is None else resolve_all(ref) | ||
| value = resolve_all(page_obj.attrs.get(key)) | ||
| return default if value is None else value | ||
@@ -296,3 +297,4 @@ # Per PDF Reference Table 3.27: "The number of degrees by which the | ||
| pt1 = rotate_point((_c, _d), self.rotation) | ||
| x0, top, x1, bottom = _invert_box(_normalize_box((*pt0, *pt1)), self.height) | ||
| rh = self.root_page.height | ||
| x0, top, x1, bottom = _invert_box(_normalize_box((*pt0, *pt1)), rh) | ||
@@ -316,5 +318,5 @@ a = annot.get("A", {}) | ||
| "x0": x0, | ||
| "y0": self.height - bottom, | ||
| "y0": rh - bottom, | ||
| "x1": x1, | ||
| "y1": self.height - top, | ||
| "y1": rh - top, | ||
| "doctop": self.initial_doctop + top, | ||
@@ -335,3 +337,7 @@ "top": top, | ||
| raw = resolve_all(self.page_obj.annots) or [] | ||
| return list(map(parse, raw)) | ||
| parsed = list(map(parse, raw)) | ||
| if isinstance(self, CroppedPage): | ||
| return self._crop_fn(parsed) | ||
| else: | ||
| return parsed | ||
@@ -350,3 +356,4 @@ @property | ||
| def point2coord(self, pt: Tuple[T_num, T_num]) -> Tuple[T_num, T_num]: | ||
| return (pt[0], self.height - pt[1]) | ||
| # See note below re. #1181 and mediabox-adjustment reversions | ||
| return (self.mediabox[0] + pt[0], self.mediabox[1] + self.height - pt[1]) | ||
@@ -386,3 +393,8 @@ def process_object(self, obj: LTItem) -> T_obj: | ||
| if isinstance(obj, (LTChar, LTTextContainer)): | ||
| attr["text"] = obj.get_text() | ||
| text = obj.get_text() | ||
| attr["text"] = ( | ||
| normalize_unicode(self.pdf.unicode_norm, text) | ||
| if self.pdf.unicode_norm is not None | ||
| else text | ||
| ) | ||
@@ -414,7 +426,16 @@ if isinstance(obj, LTChar): | ||
| # As noted in #1181, `pdfminer.six` adjusts objects' | ||
| # coordinates relative to the MediaBox: | ||
| # https://github.com/pdfminer/pdfminer.six/blob/1a8bd2f730295b31d6165e4d95fcb5a03793c978/pdfminer/converter.py#L79-L84 | ||
| mb_x0, mb_top = self.mediabox[:2] | ||
| if "y0" in attr: | ||
| attr["top"] = self.height - attr["y1"] | ||
| attr["bottom"] = self.height - attr["y0"] | ||
| attr["top"] = (self.height - attr["y1"]) + mb_top | ||
| attr["bottom"] = (self.height - attr["y0"]) + mb_top | ||
| attr["doctop"] = self.initial_doctop + attr["top"] | ||
| if "x0" in attr and mb_x0 != 0: | ||
| attr["x0"] = attr["x0"] + mb_x0 | ||
| attr["x1"] = attr["x1"] + mb_x0 | ||
| return attr | ||
@@ -644,2 +665,4 @@ | ||
| self.page_number = parent_page.page_number | ||
| self.initial_doctop = parent_page.initial_doctop | ||
| self.rotation = parent_page.rotation | ||
| self.mediabox = parent_page.mediabox | ||
@@ -646,0 +669,0 @@ self.cropbox = parent_page.cropbox |
+10
-3
@@ -6,3 +6,3 @@ import itertools | ||
| from types import TracebackType | ||
| from typing import Any, Dict, List, Optional, Tuple, Type, Union | ||
| from typing import Any, Dict, List, Literal, Optional, Tuple, Type, Union | ||
@@ -19,3 +19,3 @@ from pdfminer.layout import LAParams | ||
| from .page import Page | ||
| from .repair import _repair | ||
| from .repair import T_repair_setting, _repair | ||
| from .structure import PDFStructTree, StructTreeMissing | ||
@@ -39,2 +39,3 @@ from .utils import resolve_and_decode | ||
| strict_metadata: bool = False, | ||
| unicode_norm: Optional[Literal["NFC", "NFKC", "NFD", "NFKD"]] = None, | ||
| ): | ||
@@ -47,2 +48,3 @@ self.stream = stream | ||
| self.password = password | ||
| self.unicode_norm = unicode_norm | ||
@@ -77,4 +79,6 @@ self.doc = PDFDocument(PDFParser(stream), password=password or "") | ||
| strict_metadata: bool = False, | ||
| unicode_norm: Optional[Literal["NFC", "NFKC", "NFD", "NFKD"]] = None, | ||
| repair: bool = False, | ||
| gs_path: Optional[Union[str, pathlib.Path]] = None, | ||
| repair_setting: T_repair_setting = "default", | ||
| ) -> "PDF": | ||
@@ -85,3 +89,5 @@ | ||
| if repair: | ||
| stream = _repair(path_or_fp, password=password, gs_path=gs_path) | ||
| stream = _repair( | ||
| path_or_fp, password=password, gs_path=gs_path, setting=repair_setting | ||
| ) | ||
| stream_is_external = False | ||
@@ -108,2 +114,3 @@ # Although the original file has a path, | ||
| strict_metadata=strict_metadata, | ||
| unicode_norm=unicode_norm, | ||
| stream_is_external=stream_is_external, | ||
@@ -110,0 +117,0 @@ ) |
@@ -5,5 +5,7 @@ import pathlib | ||
| from io import BufferedReader, BytesIO | ||
| from typing import Optional, Union | ||
| from typing import Literal, Optional, Union | ||
| T_repair_setting = Literal["default", "prepress", "printer", "ebook", "screen"] | ||
| def _repair( | ||
@@ -13,2 +15,3 @@ path_or_fp: Union[str, pathlib.Path, BufferedReader, BytesIO], | ||
| gs_path: Optional[Union[str, pathlib.Path]] = None, | ||
| setting: T_repair_setting = "default", | ||
| ) -> BytesIO: | ||
@@ -34,3 +37,3 @@ | ||
| "-sDEVICE=pdfwrite", | ||
| "-dPDFSETTINGS=/prepress", | ||
| f"-dPDFSETTINGS=/{setting}", | ||
| ] | ||
@@ -68,4 +71,5 @@ | ||
| gs_path: Optional[Union[str, pathlib.Path]] = None, | ||
| setting: T_repair_setting = "default", | ||
| ) -> Optional[BytesIO]: | ||
| repaired = _repair(path_or_fp, password, gs_path=gs_path) | ||
| repaired = _repair(path_or_fp, password, gs_path=gs_path, setting=setting) | ||
| if outfile: | ||
@@ -72,0 +76,0 @@ with open(outfile, "wb") as f: |
@@ -288,7 +288,9 @@ import itertools | ||
| attributes = self._make_attributes(obj, revision) | ||
| element_id = decode_text(obj["ID"]) if "ID" in obj else None | ||
| title = decode_text(obj["T"]) if "T" in obj else None | ||
| lang = decode_text(obj["Lang"]) if "Lang" in obj else None | ||
| alt_text = decode_text(obj["Alt"]) if "Alt" in obj else None | ||
| actual_text = decode_text(obj["ActualText"]) if "ActualText" in obj else None | ||
| element_id = decode_text(resolve1(obj["ID"])) if "ID" in obj else None | ||
| title = decode_text(resolve1(obj["T"])) if "T" in obj else None | ||
| lang = decode_text(resolve1(obj["Lang"])) if "Lang" in obj else None | ||
| alt_text = decode_text(resolve1(obj["Alt"])) if "Alt" in obj else None | ||
| actual_text = ( | ||
| decode_text(resolve1(obj["ActualText"])) if "ActualText" in obj else None | ||
| ) | ||
| element = PDFStructElement( | ||
@@ -295,0 +297,0 @@ type=obj_tag, |
+30
-10
@@ -373,2 +373,6 @@ import itertools | ||
| class Column(CellGroup): | ||
| pass | ||
| class Table(object): | ||
@@ -389,13 +393,31 @@ def __init__(self, page: "Page", cells: List[T_bbox]): | ||
| @property | ||
| def rows(self) -> List[Row]: | ||
| _sorted = sorted(self.cells, key=itemgetter(1, 0)) | ||
| xs = list(sorted(set(map(itemgetter(0), self.cells)))) | ||
| def _get_rows_or_cols(self, kind: type[CellGroup]) -> List[CellGroup]: | ||
| axis = 0 if kind is Row else 1 | ||
| antiaxis = int(not axis) | ||
| # Sort first by top/x0, then by x0/top | ||
| _sorted = sorted(self.cells, key=itemgetter(antiaxis, axis)) | ||
| # Sort get all x0s/tops | ||
| xs = list(sorted(set(map(itemgetter(axis), self.cells)))) | ||
| # Group by top/x0 | ||
| grouped = itertools.groupby(_sorted, itemgetter(antiaxis)) | ||
| rows = [] | ||
| for y, row_cells in itertools.groupby(_sorted, itemgetter(1)): | ||
| xdict = {cell[0]: cell for cell in row_cells} | ||
| row = Row([xdict.get(x) for x in xs]) | ||
| # for y/x, row/column-cells ... | ||
| for y, row_cells in grouped: | ||
| xdict = {cell[axis]: cell for cell in row_cells} | ||
| row = kind([xdict.get(x) for x in xs]) | ||
| rows.append(row) | ||
| return rows | ||
| @property | ||
| def rows(self) -> List[CellGroup]: | ||
| return self._get_rows_or_cols(Row) | ||
| @property | ||
| def columns(self) -> List[CellGroup]: | ||
| return self._get_rows_or_cols(Column) | ||
| def extract(self, **kwargs: Any) -> List[List[Optional[str]]]: | ||
@@ -484,3 +506,3 @@ | ||
| def __post_init__(self) -> "TableSettings": | ||
| def __post_init__(self) -> None: | ||
| """Clean up user-provided table settings. | ||
@@ -536,4 +558,2 @@ | ||
| return self | ||
| @classmethod | ||
@@ -540,0 +560,0 @@ def resolve(cls, settings: Optional[T_table_settings]) -> "TableSettings": |
| import itertools | ||
| from collections.abc import Hashable | ||
| from operator import itemgetter | ||
| from typing import Callable, Dict, Iterable, List, TypeVar, Union | ||
| from typing import Any, Callable, Dict, Iterable, List, Tuple, TypeVar, Union | ||
| from .._typing import T_num | ||
| from .._typing import T_num, T_obj | ||
@@ -39,11 +39,11 @@ | ||
| R = TypeVar("R") | ||
| Clusterable = TypeVar("Clusterable", T_obj, Tuple[Any, ...]) | ||
| def cluster_objects( | ||
| xs: List[R], | ||
| key_fn: Union[Hashable, Callable[[R], T_num]], | ||
| xs: List[Clusterable], | ||
| key_fn: Union[Hashable, Callable[[Clusterable], T_num]], | ||
| tolerance: T_num, | ||
| preserve_order: bool = False, | ||
| ) -> List[List[R]]: | ||
| ) -> List[List[Clusterable]]: | ||
@@ -50,0 +50,0 @@ if not callable(key_fn): |
@@ -33,3 +33,4 @@ import itertools | ||
| """ | ||
| return bbox_getter(obj) | ||
| bbox: T_bbox = bbox_getter(obj) | ||
| return bbox | ||
@@ -36,0 +37,0 @@ |
@@ -683,8 +683,18 @@ import inspect | ||
| def extract_words(self, chars: T_obj_list) -> T_obj_list: | ||
| return list(word for word, word_chars in self.iter_extract_tuples(chars)) | ||
| def extract_words( | ||
| self, chars: T_obj_list, return_chars: bool = False | ||
| ) -> T_obj_list: | ||
| if return_chars: | ||
| return list( | ||
| {**word, "chars": word_chars} | ||
| for word, word_chars in self.iter_extract_tuples(chars) | ||
| ) | ||
| else: | ||
| return list(word for word, word_chars in self.iter_extract_tuples(chars)) | ||
| def extract_words(chars: T_obj_list, **kwargs: Any) -> T_obj_list: | ||
| return WordExtractor(**kwargs).extract_words(chars) | ||
| def extract_words( | ||
| chars: T_obj_list, return_chars: bool = False, **kwargs: Any | ||
| ) -> T_obj_list: | ||
| return WordExtractor(**kwargs).extract_words(chars, return_chars) | ||
@@ -691,0 +701,0 @@ |
+29
-6
| Metadata-Version: 2.1 | ||
| Name: pdfplumber | ||
| Version: 0.11.2 | ||
| Version: 0.11.3 | ||
| Summary: Plumb a PDF for detailed information about each char, rectangle, and line. | ||
@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber | ||
| > 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction). | ||
| ## Table of Contents | ||
@@ -106,2 +104,4 @@ | ||
| To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`. | ||
| Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata. | ||
@@ -281,4 +281,25 @@ | ||
| [To be completed.] | ||
| *Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).* | ||
| | Property | Description | | ||
| |----------|-------------| | ||
| |`page_number`| Page number on which the image was found.| | ||
| |`height`| Height of the image.| | ||
| |`width`| Width of the image.| | ||
| |`x0`| Distance of left side of the image from left side of page.| | ||
| |`x1`| Distance of right side of the image from left side of page.| | ||
| |`y0`| Distance of bottom of the image from bottom of page.| | ||
| |`y1`| Distance of top of the image from bottom of page.| | ||
| |`top`| Distance of top of the image from top of page.| | ||
| |`bottom`| Distance of bottom of the image from top of page.| | ||
| |`doctop`| Distance of top of rectangle from top of document.| | ||
| |`srcsize`| The image original dimensions, as a `(width, height)` tuple.| | ||
| |`colorspace`| Color domain of the image (e.g., RGB).| | ||
| |`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).| | ||
| |`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.| | ||
| |`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."| | ||
| |`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*| | ||
| |`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*| | ||
| |`object_type`| "image"| | ||
| ### Obtaining higher-level layout objects via `pdfminer.six` | ||
@@ -353,3 +374,3 @@ | ||
| |`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.| | ||
| |`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).| | ||
| |`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.| | ||
| |`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.| | ||
@@ -375,3 +396,3 @@ |`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. | | ||
| |--------|-------------| | ||
| |`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.| | ||
| |`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.| | ||
| |`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.| | ||
@@ -573,2 +594,4 @@ |`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.| | ||
| - [Quentin André](https://github.com/QuentinAndre11) | ||
| - [Léo Roux](https://github.com/leorouxx) | ||
| - [@wodny](https://github.com/wodny) | ||
@@ -575,0 +598,0 @@ ## Contributing |
+28
-5
@@ -15,4 +15,2 @@ # pdfplumber | ||
| > 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction). | ||
| ## Table of Contents | ||
@@ -85,2 +83,4 @@ | ||
| To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`. | ||
| Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata. | ||
@@ -260,4 +260,25 @@ | ||
| [To be completed.] | ||
| *Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).* | ||
| | Property | Description | | ||
| |----------|-------------| | ||
| |`page_number`| Page number on which the image was found.| | ||
| |`height`| Height of the image.| | ||
| |`width`| Width of the image.| | ||
| |`x0`| Distance of left side of the image from left side of page.| | ||
| |`x1`| Distance of right side of the image from left side of page.| | ||
| |`y0`| Distance of bottom of the image from bottom of page.| | ||
| |`y1`| Distance of top of the image from bottom of page.| | ||
| |`top`| Distance of top of the image from top of page.| | ||
| |`bottom`| Distance of bottom of the image from top of page.| | ||
| |`doctop`| Distance of top of rectangle from top of document.| | ||
| |`srcsize`| The image original dimensions, as a `(width, height)` tuple.| | ||
| |`colorspace`| Color domain of the image (e.g., RGB).| | ||
| |`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).| | ||
| |`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.| | ||
| |`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."| | ||
| |`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*| | ||
| |`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*| | ||
| |`object_type`| "image"| | ||
| ### Obtaining higher-level layout objects via `pdfminer.six` | ||
@@ -332,3 +353,3 @@ | ||
| |`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.| | ||
| |`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).| | ||
| |`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.| | ||
| |`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.| | ||
@@ -354,3 +375,3 @@ |`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. | | ||
| |--------|-------------| | ||
| |`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.| | ||
| |`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.| | ||
| |`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.| | ||
@@ -552,2 +573,4 @@ |`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.| | ||
| - [Quentin André](https://github.com/QuentinAndre11) | ||
| - [Léo Roux](https://github.com/leorouxx) | ||
| - [@wodny](https://github.com/wodny) | ||
@@ -554,0 +577,0 @@ ## Contributing |
@@ -1,8 +0,8 @@ | ||
| black==22.3.0 | ||
| flake8==7.1.0 | ||
| isort==5.10.1 | ||
| black==24.8.0 | ||
| flake8==7.1.1 | ||
| isort==5.13.2 | ||
| jupyterlab==3.6.7 | ||
| mypy==0.981 | ||
| mypy==1.11.1 | ||
| nbexec==0.2.0 | ||
| pandas-stubs==2.2.2.240603 | ||
| pandas-stubs==2.2.2.240805 | ||
| pandas==2.2.2 | ||
@@ -12,4 +12,4 @@ py==1.11.0 | ||
| pytest-parallel==0.1.1 | ||
| pytest==8.2.2 | ||
| pytest==8.3.2 | ||
| setuptools==68.2.2 | ||
| types-Pillow==10.2.0.20240520 |
+1
-0
@@ -6,2 +6,3 @@ [flake8] | ||
| W503 | ||
| E704 | ||
@@ -8,0 +9,0 @@ [tool:pytest] |
+27
-0
@@ -62,2 +62,16 @@ #!/usr/bin/env python | ||
| def test_annots_cropped(self): | ||
| pdf = self.pdf_2 | ||
| page = pdf.pages[0] | ||
| assert len(page.annots) == 13 | ||
| assert len(page.hyperlinks) == 1 | ||
| cropped = page.crop(page.bbox) | ||
| assert len(cropped.annots) == 13 | ||
| assert len(cropped.hyperlinks) == 1 | ||
| h0_bbox = pdfplumber.utils.obj_to_bbox(page.hyperlinks[0]) | ||
| cropped = page.crop(h0_bbox) | ||
| assert len(cropped.annots) == len(cropped.hyperlinks) == 1 | ||
| def test_annots_rotated(self): | ||
@@ -182,2 +196,15 @@ def get_annot(filename, n=0): | ||
| def test_unicode_normalization(self): | ||
| path = os.path.join(HERE, "pdfs/issue-905.pdf") | ||
| with pdfplumber.open(path) as pdf: | ||
| page = pdf.pages[0] | ||
| print(page.extract_text()) | ||
| assert ord(page.chars[0]["text"]) == 894 | ||
| with pdfplumber.open(path, unicode_norm="NFC") as pdf: | ||
| page = pdf.pages[0] | ||
| assert ord(page.chars[0]["text"]) == 59 | ||
| assert page.extract_text() == ";;" | ||
| def test_colors(self): | ||
@@ -184,0 +211,0 @@ rect = self.pdf.pages[0].rects[0] |
@@ -30,3 +30,6 @@ #!/usr/bin/env python | ||
| assert last_line_without_drop == "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些" | ||
| assert ( | ||
| last_line_without_drop | ||
| == "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些" | ||
| ) | ||
| assert last_line_with_drop == "微软 培训课程: 名模意义一些有意义一些" | ||
@@ -50,3 +53,6 @@ | ||
| assert last_words_without_drop["upright"] == 1 | ||
| assert last_words_without_drop["text"] == "名名模模意意义义一一些些有有意意义义一一些些" | ||
| assert ( | ||
| last_words_without_drop["text"] | ||
| == "名名模模意意义义一一些些有有意意义义一一些些" | ||
| ) | ||
@@ -65,3 +71,6 @@ assert round(last_words_with_drop["x0"], 3) == x0 | ||
| assert last_line_without_drop == "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些" | ||
| assert ( | ||
| last_line_without_drop | ||
| == "微微软软 培培训训课课程程:: 名名模模意意义义一一些些有有意意义义一一些些" | ||
| ) | ||
| assert last_line_with_drop == "微软 培训课程: 名模意义一些有意义一些" | ||
@@ -68,0 +77,0 @@ |
+22
-0
@@ -313,1 +313,23 @@ #!/usr/bin/env python | ||
| assert page.extract_text() | ||
| def test_issue_1181(self): | ||
| """ | ||
| Correctly re-calculate coordinates when MediaBox does not start at (0,0) | ||
| """ | ||
| path = os.path.join(HERE, "pdfs/issue-1181.pdf") | ||
| with pdfplumber.open(path) as pdf: | ||
| p0, p1 = pdf.pages | ||
| assert p0.crop(p0.bbox).extract_table() == [ | ||
| ["FooCol1", "FooCol2", "FooCol3"], | ||
| ["Foo4", "Foo5", "Foo6"], | ||
| ["Foo7", "Foo8", "Foo9"], | ||
| ["Foo10", "Foo11", "Foo12"], | ||
| ["", "", ""], | ||
| ] | ||
| assert p1.crop(p1.bbox).extract_table() == [ | ||
| ["BarCol1", "BarCol2", "BarCol3"], | ||
| ["Bar4", "Bar5", "Bar6"], | ||
| ["Bar7", "Bar8", "Bar9"], | ||
| ["Bar10", "Bar11", "Bar12"], | ||
| ["", "", ""], | ||
| ] |
+12
-0
@@ -56,2 +56,14 @@ #!/usr/bin/env python | ||
| def test_repair_setting(self): | ||
| path = os.path.join(HERE, "pdfs/malformed-from-issue-932.pdf") | ||
| with tempfile.NamedTemporaryFile("wb") as out: | ||
| pdfplumber.repair(path, outfile=out.name) | ||
| size_default = os.stat(out.name).st_size | ||
| with tempfile.NamedTemporaryFile("wb") as out: | ||
| pdfplumber.repair(path, outfile=out.name, setting="prepress") | ||
| size_prepress = os.stat(out.name).st_size | ||
| assert size_default > size_prepress | ||
| def test_repair_password(self): | ||
@@ -58,0 +70,0 @@ path = os.path.join(HERE, "pdfs/password-example.pdf") |
+26
-0
@@ -76,2 +76,28 @@ #!/usr/bin/env python | ||
| def test_rows_and_columns(self): | ||
| path = os.path.join(HERE, "pdfs/issue-140-example.pdf") | ||
| with pdfplumber.open(path) as pdf: | ||
| page = pdf.pages[0] | ||
| table = page.find_table() | ||
| row = [page.crop(bbox).extract_text() for bbox in table.rows[0].cells] | ||
| assert row == [ | ||
| "Line no", | ||
| "UPC code", | ||
| "Location", | ||
| "Item Description", | ||
| "Item Quantity", | ||
| "Bill Amount", | ||
| "Accrued Amount", | ||
| "Handling Rate", | ||
| "PO number", | ||
| ] | ||
| col = [page.crop(bbox).extract_text() for bbox in table.columns[1].cells] | ||
| assert col == [ | ||
| "UPC code", | ||
| "0085648100305", | ||
| "0085648100380", | ||
| "0085648100303", | ||
| "0085648100300", | ||
| ] | ||
| def test_explicit_desc_decimalization(self): | ||
@@ -78,0 +104,0 @@ """ |
+12
-0
@@ -102,2 +102,14 @@ #!/usr/bin/env python | ||
| def test_extract_words_return_chars(self): | ||
| path = os.path.join(HERE, "pdfs/extra-attrs-example.pdf") | ||
| with pdfplumber.open(path) as pdf: | ||
| page = pdf.pages[0] | ||
| words = page.extract_words() | ||
| assert "chars" not in words[0] | ||
| words = page.extract_words(return_chars=True) | ||
| assert "chars" in words[0] | ||
| assert "".join(c["text"] for c in words[0]["chars"]) == words[0]["text"] | ||
| def test_text_rotation(self): | ||
@@ -104,0 +116,0 @@ rotations = { |
Alert delta unavailable
Currently unable to show alert delta for PyPI packages.
448734
3.46%7036
2.22%