@@ -5,2 +5,23 @@ # Changelog

		## [0.11.3] - 2024-08-07

		### Added

		- Add `Table.columns`, analogous to `Table.rows` (h/t @Pk13055). ([#1050](https://github.com/jsvine/pdfplumber/issues/1050) + [d39302f](https://github.com/jsvine/pdfplumber/commit/d39302f))
		- Add `Page.extract_words(return_chars=True)`, mirroring `Page.search(..., return_chars=True)`; if this argument is passed, each word dictionary will include an additional key-value pair: `"chars": [char_object, ...]` (h/t @cmdlineluser). ([#1173](https://github.com/jsvine/pdfplumber/issues/1173) + [1496cbd](https://github.com/jsvine/pdfplumber/commit/1496cbd))
		- Add `pdfplumber.open(unicode_norm="NFC"/"NFD"/"NFKC"/NFKD")`, where the values are the [four options for Unicode normalization](https://unicode.org/reports/tr15/#Normalization_Forms_Table) (h/t @petermr + @agusluques). ([#905](https://github.com/jsvine/pdfplumber/issues/905) + [03a477f](https://github.com/jsvine/pdfplumber/commit/03a477f))

		### Changed

		- Change default setting `pdfplumber.repair(...)` passes to Ghostscript's `-dPDFSETTINGS` parameter, from `prepress` to `default`, and make that setting modifiable via `.repair(setting=...)`, where the value is one of `"default"`, `"prepress"`, `"printer"`, or `"ebook"` (h/t @Laubeee). ([#874](https://github.com/jsvine/pdfplumber/issues/874) + [48cab3f](https://github.com/jsvine/pdfplumber/commit/48cab3f))

		### Fixed

		- Fix handling of object coordinates when `mediabox` does not begin at `(0,0)` (h/t @wodny). ([#1181](https://github.com/jsvine/pdfplumber/issues/1181) + [9025c3f](https://github.com/jsvine/pdfplumber/commit/9025c3f) + [046bd87](https://github.com/jsvine/pdfplumber/commit/046bd87))
		- Fix error on getting `.annots`/`.hyperlinks` from `CroppedPage` (due to missing `.rotation` and `.initial_doctop` attributes) (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [e5737d2](https://github.com/jsvine/pdfplumber/commit/e5737d2))
		- Fix problem where `Page.crop(...)` was not cropping `.annots/.hyperlinks` (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [22494e8](https://github.com/jsvine/pdfplumber/commit/22494e8))
		- Fix calculation of coordinates for `.annots` on `CroppedPage`s. ([0bbb340](https://github.com/jsvine/pdfplumber/commit/0bbb340) + [b16acc3](https://github.com/jsvine/pdfplumber/commit/b16acc3))
		- Dereference structure element attributes (h/t @dhdaines). ([#1169](https://github.com/jsvine/pdfplumber/pull/1169) + [3f16180](https://github.com/jsvine/pdfplumber/commit/3f16180))
		- Fix `Page.get_attr(...)` so that it fully resolves references before determining whether the attribute's value is `None` (h/t @zzhangyun + @mkl-public). ([#1176](https://github.com/jsvine/pdfplumber/issues/1176) + [c20cd3b](https://github.com/jsvine/pdfplumber/commit/c20cd3b))

		## [0.11.2] - 2024-07-06
		@@ -7,0 +28,0 @@

+29

-6

pdfplumber.egg-info/PKG-INFO

		Metadata-Version: 2.1
		Name: pdfplumber
		Version: 0.11.2
		Version: 0.11.3
		Summary: Plumb a PDF for detailed information about each char, rectangle, and line.
		@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber

		> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).

		## Table of Contents
		@@ -106,2 +104,4 @@

		To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.

		Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.
		@@ -281,4 +281,25 @@

		[To be completed.]
		Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).

		\| Property \| Description \|
		\|----------\|-------------\|
		\|`page_number`\| Page number on which the image was found.\|
		\|`height`\| Height of the image.\|
		\|`width`\| Width of the image.\|
		\|`x0`\| Distance of left side of the image from left side of page.\|
		\|`x1`\| Distance of right side of the image from left side of page.\|
		\|`y0`\| Distance of bottom of the image from bottom of page.\|
		\|`y1`\| Distance of top of the image from bottom of page.\|
		\|`top`\| Distance of top of the image from top of page.\|
		\|`bottom`\| Distance of bottom of the image from top of page.\|
		\|`doctop`\| Distance of top of rectangle from top of document.\|
		\|`srcsize`\| The image original dimensions, as a `(width, height)` tuple.\|
		\|`colorspace`\| Color domain of the image (e.g., RGB).\|
		\|`bits`\| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).\|
		\|`stream`\| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.\|
		\|`imagemask`\| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."\|
		\|`mcid`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). Experimental attribute.\|
		\|`tag`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). Experimental attribute.\|
		\|`object_type`\| "image"\|

		### Obtaining higher-level layout objects via `pdfminer.six`
		@@ -353,3 +374,3 @@
		\|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`\| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`).\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.\|
		\|`.extract_text_lines(layout=False, strip=True, return_chars=True, *kwargs)`\|Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.\|
		@@ -375,3 +396,3 @@ \|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, *kwargs)`\|Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. \|
		\|--------\|-------------\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_table(table_settings={})`\|Similar to `.find_tables(...)`, but returns the largest table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.\|
		@@ -573,2 +594,4 @@ \|`.extract_tables(table_settings={})`\|Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.\|
		- [Quentin André](https://github.com/QuentinAndre11)
		- [Léo Roux](https://github.com/leorouxx)
		- [@wodny](https://github.com/wodny)

		@@ -575,0 +598,0 @@ ## Contributing

+1

-1

pdfplumber/_version.py

		@@ -1,2 +0,2 @@
		version_info = (0, 11, 2)
		version_info = (0, 11, 3)
		__version__ = ".".join(map(str, version_info))

+8

-6

pdfplumber/container.py

		@@ -16,11 +16,13 @@ import csv
		@property
		def pages(self) -> Optional[List[Any]]:
		... # pragma: nocover
		def pages(self) -> Optional[List[Any]]: # pragma: nocover
		raise NotImplementedError

		@property
		def objects(self) -> Dict[str, T_obj_list]:
		... # pragma: nocover
		def objects(self) -> Dict[str, T_obj_list]: # pragma: nocover
		raise NotImplementedError

		def to_dict(self, object_types: Optional[List[str]] = None) -> Dict[str, Any]:
		... # pragma: nocover
		def to_dict(
		self, object_types: Optional[List[str]] = None
		) -> Dict[str, Any]: # pragma: nocover
		raise NotImplementedError

		@@ -27,0 +29,0 @@ def flush_cache(self, properties: Optional[List[str]] = None) -> None:

+33

-10

pdfplumber/page.py

		@@ -15,2 +15,3 @@ import re
		)
		from unicodedata import normalize as normalize_unicode

		@@ -220,4 +221,4 @@ from pdfminer.converter import PDFPageAggregator
		def get_attr(key: str, default: Any = None) -> Any:
		ref = page_obj.attrs.get(key)
		return default if ref is None else resolve_all(ref)
		value = resolve_all(page_obj.attrs.get(key))
		return default if value is None else value

		@@ -296,3 +297,4 @@ # Per PDF Reference Table 3.27: "The number of degrees by which the
		pt1 = rotate_point((_c, _d), self.rotation)
		x0, top, x1, bottom = _invert_box(_normalize_box((pt0, pt1)), self.height)
		rh = self.root_page.height
		x0, top, x1, bottom = _invert_box(_normalize_box((pt0, pt1)), rh)

		@@ -316,5 +318,5 @@ a = annot.get("A", {})
		"x0": x0,
		"y0": self.height - bottom,
		"y0": rh - bottom,
		"x1": x1,
		"y1": self.height - top,
		"y1": rh - top,
		"doctop": self.initial_doctop + top,
		@@ -335,3 +337,7 @@ "top": top,
		raw = resolve_all(self.page_obj.annots) or []
		return list(map(parse, raw))
		parsed = list(map(parse, raw))
		if isinstance(self, CroppedPage):
		return self._crop_fn(parsed)
		else:
		return parsed

		@@ -350,3 +356,4 @@ @property
		def point2coord(self, pt: Tuple[T_num, T_num]) -> Tuple[T_num, T_num]:
		return (pt[0], self.height - pt[1])
		# See note below re. #1181 and mediabox-adjustment reversions
		return (self.mediabox[0] + pt[0], self.mediabox[1] + self.height - pt[1])

		@@ -386,3 +393,8 @@ def process_object(self, obj: LTItem) -> T_obj:
		if isinstance(obj, (LTChar, LTTextContainer)):
		attr["text"] = obj.get_text()
		text = obj.get_text()
		attr["text"] = (
		normalize_unicode(self.pdf.unicode_norm, text)
		if self.pdf.unicode_norm is not None
		else text
		)

		@@ -414,7 +426,16 @@ if isinstance(obj, LTChar):

		# As noted in #1181, `pdfminer.six` adjusts objects'
		# coordinates relative to the MediaBox:
		# https://github.com/pdfminer/pdfminer.six/blob/1a8bd2f730295b31d6165e4d95fcb5a03793c978/pdfminer/converter.py#L79-L84
		mb_x0, mb_top = self.mediabox[:2]

		if "y0" in attr:
		attr["top"] = self.height - attr["y1"]
		attr["bottom"] = self.height - attr["y0"]
		attr["top"] = (self.height - attr["y1"]) + mb_top
		attr["bottom"] = (self.height - attr["y0"]) + mb_top
		attr["doctop"] = self.initial_doctop + attr["top"]

		if "x0" in attr and mb_x0 != 0:
		attr["x0"] = attr["x0"] + mb_x0
		attr["x1"] = attr["x1"] + mb_x0

		return attr
		@@ -644,2 +665,4 @@
		self.page_number = parent_page.page_number
		self.initial_doctop = parent_page.initial_doctop
		self.rotation = parent_page.rotation
		self.mediabox = parent_page.mediabox
		@@ -646,0 +669,0 @@ self.cropbox = parent_page.cropbox

+10

-3

pdfplumber/pdf.py

		@@ -6,3 +6,3 @@ import itertools
		from types import TracebackType
		from typing import Any, Dict, List, Optional, Tuple, Type, Union
		from typing import Any, Dict, List, Literal, Optional, Tuple, Type, Union

		@@ -19,3 +19,3 @@ from pdfminer.layout import LAParams
		from .page import Page
		from .repair import _repair
		from .repair import T_repair_setting, _repair
		from .structure import PDFStructTree, StructTreeMissing
		@@ -39,2 +39,3 @@ from .utils import resolve_and_decode
		strict_metadata: bool = False,
		unicode_norm: Optional[Literal["NFC", "NFKC", "NFD", "NFKD"]] = None,
		):
		@@ -47,2 +48,3 @@ self.stream = stream
		self.password = password
		self.unicode_norm = unicode_norm

		@@ -77,4 +79,6 @@ self.doc = PDFDocument(PDFParser(stream), password=password or "")
		strict_metadata: bool = False,
		unicode_norm: Optional[Literal["NFC", "NFKC", "NFD", "NFKD"]] = None,
		repair: bool = False,
		gs_path: Optional[Union[str, pathlib.Path]] = None,
		repair_setting: T_repair_setting = "default",
		) -> "PDF":
		@@ -85,3 +89,5 @@
		if repair:
		stream = _repair(path_or_fp, password=password, gs_path=gs_path)
		stream = _repair(
		path_or_fp, password=password, gs_path=gs_path, setting=repair_setting
		)
		stream_is_external = False
		@@ -108,2 +114,3 @@ # Although the original file has a path,
		strict_metadata=strict_metadata,
		unicode_norm=unicode_norm,
		stream_is_external=stream_is_external,
		@@ -110,0 +117,0 @@ )

+7

-3

pdfplumber/repair.py

		@@ -5,5 +5,7 @@ import pathlib
		from io import BufferedReader, BytesIO
		from typing import Optional, Union
		from typing import Literal, Optional, Union

		T_repair_setting = Literal["default", "prepress", "printer", "ebook", "screen"]


		def _repair(
		@@ -13,2 +15,3 @@ path_or_fp: Union[str, pathlib.Path, BufferedReader, BytesIO],
		gs_path: Optional[Union[str, pathlib.Path]] = None,
		setting: T_repair_setting = "default",
		) -> BytesIO:
		@@ -34,3 +37,3 @@
		"-sDEVICE=pdfwrite",
		"-dPDFSETTINGS=/prepress",
		f"-dPDFSETTINGS=/{setting}",
		]
		@@ -68,4 +71,5 @@
		gs_path: Optional[Union[str, pathlib.Path]] = None,
		setting: T_repair_setting = "default",
		) -> Optional[BytesIO]:
		repaired = _repair(path_or_fp, password, gs_path=gs_path)
		repaired = _repair(path_or_fp, password, gs_path=gs_path, setting=setting)
		if outfile:
		@@ -72,0 +76,0 @@ with open(outfile, "wb") as f:

+7

-5

pdfplumber/structure.py

		@@ -288,7 +288,9 @@ import itertools
		attributes = self._make_attributes(obj, revision)
		element_id = decode_text(obj["ID"]) if "ID" in obj else None
		title = decode_text(obj["T"]) if "T" in obj else None
		lang = decode_text(obj["Lang"]) if "Lang" in obj else None
		alt_text = decode_text(obj["Alt"]) if "Alt" in obj else None
		actual_text = decode_text(obj["ActualText"]) if "ActualText" in obj else None
		element_id = decode_text(resolve1(obj["ID"])) if "ID" in obj else None
		title = decode_text(resolve1(obj["T"])) if "T" in obj else None
		lang = decode_text(resolve1(obj["Lang"])) if "Lang" in obj else None
		alt_text = decode_text(resolve1(obj["Alt"])) if "Alt" in obj else None
		actual_text = (
		decode_text(resolve1(obj["ActualText"])) if "ActualText" in obj else None
		)
		element = PDFStructElement(
		@@ -295,0 +297,0 @@ type=obj_tag,

+30

-10

pdfplumber/table.py

		@@ -373,2 +373,6 @@ import itertools

		class Column(CellGroup):
		pass


		class Table(object):
		@@ -389,13 +393,31 @@ def __init__(self, page: "Page", cells: List[T_bbox]):

		@property
		def rows(self) -> List[Row]:
		_sorted = sorted(self.cells, key=itemgetter(1, 0))
		xs = list(sorted(set(map(itemgetter(0), self.cells))))
		def _get_rows_or_cols(self, kind: type[CellGroup]) -> List[CellGroup]:
		axis = 0 if kind is Row else 1
		antiaxis = int(not axis)

		# Sort first by top/x0, then by x0/top
		_sorted = sorted(self.cells, key=itemgetter(antiaxis, axis))

		# Sort get all x0s/tops
		xs = list(sorted(set(map(itemgetter(axis), self.cells))))

		# Group by top/x0
		grouped = itertools.groupby(_sorted, itemgetter(antiaxis))

		rows = []
		for y, row_cells in itertools.groupby(_sorted, itemgetter(1)):
		xdict = {cell[0]: cell for cell in row_cells}
		row = Row([xdict.get(x) for x in xs])
		# for y/x, row/column-cells ...
		for y, row_cells in grouped:
		xdict = {cell[axis]: cell for cell in row_cells}
		row = kind([xdict.get(x) for x in xs])
		rows.append(row)
		return rows

		@property
		def rows(self) -> List[CellGroup]:
		return self._get_rows_or_cols(Row)

		@property
		def columns(self) -> List[CellGroup]:
		return self._get_rows_or_cols(Column)

		def extract(self, **kwargs: Any) -> List[List[Optional[str]]]:
		@@ -484,3 +506,3 @@

		def __post_init__(self) -> "TableSettings":
		def __post_init__(self) -> None:
		"""Clean up user-provided table settings.
		@@ -536,4 +558,2 @@

		return self

		@classmethod
		@@ -540,0 +560,0 @@ def resolve(cls, settings: Optional[T_table_settings]) -> "TableSettings":

+6

-6

pdfplumber/utils/clustering.py

		import itertools
		from collections.abc import Hashable
		from operator import itemgetter
		from typing import Callable, Dict, Iterable, List, TypeVar, Union
		from typing import Any, Callable, Dict, Iterable, List, Tuple, TypeVar, Union

		from .._typing import T_num
		from .._typing import T_num, T_obj

		@@ -39,11 +39,11 @@

		R = TypeVar("R")
		Clusterable = TypeVar("Clusterable", T_obj, Tuple[Any, ...])


		def cluster_objects(
		xs: List[R],
		key_fn: Union[Hashable, Callable[[R], T_num]],
		xs: List[Clusterable],
		key_fn: Union[Hashable, Callable[[Clusterable], T_num]],
		tolerance: T_num,
		preserve_order: bool = False,
		) -> List[List[R]]:
		) -> List[List[Clusterable]]:

		@@ -50,0 +50,0 @@ if not callable(key_fn):

+2

-1

pdfplumber/utils/geometry.py

		@@ -33,3 +33,4 @@ import itertools
		"""
		return bbox_getter(obj)
		bbox: T_bbox = bbox_getter(obj)
		return bbox

		@@ -36,0 +37,0 @@

+14

-4

pdfplumber/utils/text.py

		@@ -683,8 +683,18 @@ import inspect

		def extract_words(self, chars: T_obj_list) -> T_obj_list:
		return list(word for word, word_chars in self.iter_extract_tuples(chars))
		def extract_words(
		self, chars: T_obj_list, return_chars: bool = False
		) -> T_obj_list:
		if return_chars:
		return list(
		{**word, "chars": word_chars}
		for word, word_chars in self.iter_extract_tuples(chars)
		)
		else:
		return list(word for word, word_chars in self.iter_extract_tuples(chars))


		def extract_words(chars: T_obj_list, **kwargs: Any) -> T_obj_list:
		return WordExtractor(**kwargs).extract_words(chars)
		def extract_words(
		chars: T_obj_list, return_chars: bool = False, **kwargs: Any
		) -> T_obj_list:
		return WordExtractor(**kwargs).extract_words(chars, return_chars)

		@@ -691,0 +701,0 @@

+29

-6

PKG-INFO

		Metadata-Version: 2.1
		Name: pdfplumber
		Version: 0.11.2
		Version: 0.11.3
		Summary: Plumb a PDF for detailed information about each char, rectangle, and line.
		@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber

		> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).

		## Table of Contents
		@@ -106,2 +104,4 @@

		To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.

		Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.
		@@ -281,4 +281,25 @@

		[To be completed.]
		Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).

		\| Property \| Description \|
		\|----------\|-------------\|
		\|`page_number`\| Page number on which the image was found.\|
		\|`height`\| Height of the image.\|
		\|`width`\| Width of the image.\|
		\|`x0`\| Distance of left side of the image from left side of page.\|
		\|`x1`\| Distance of right side of the image from left side of page.\|
		\|`y0`\| Distance of bottom of the image from bottom of page.\|
		\|`y1`\| Distance of top of the image from bottom of page.\|
		\|`top`\| Distance of top of the image from top of page.\|
		\|`bottom`\| Distance of bottom of the image from top of page.\|
		\|`doctop`\| Distance of top of rectangle from top of document.\|
		\|`srcsize`\| The image original dimensions, as a `(width, height)` tuple.\|
		\|`colorspace`\| Color domain of the image (e.g., RGB).\|
		\|`bits`\| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).\|
		\|`stream`\| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.\|
		\|`imagemask`\| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."\|
		\|`mcid`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). Experimental attribute.\|
		\|`tag`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). Experimental attribute.\|
		\|`object_type`\| "image"\|

		### Obtaining higher-level layout objects via `pdfminer.six`
		@@ -353,3 +374,3 @@
		\|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`\| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`).\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.\|
		\|`.extract_text_lines(layout=False, strip=True, return_chars=True, *kwargs)`\|Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.\|
		@@ -375,3 +396,3 @@ \|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, *kwargs)`\|Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. \|
		\|--------\|-------------\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_table(table_settings={})`\|Similar to `.find_tables(...)`, but returns the largest table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.\|
		@@ -573,2 +594,4 @@ \|`.extract_tables(table_settings={})`\|Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.\|
		- [Quentin André](https://github.com/QuentinAndre11)
		- [Léo Roux](https://github.com/leorouxx)
		- [@wodny](https://github.com/wodny)

		@@ -575,0 +598,0 @@ ## Contributing

+28

-5

README.md

		@@ -15,4 +15,2 @@ # pdfplumber

		> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).

		## Table of Contents
		@@ -85,2 +83,4 @@

		To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.

		Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.
		@@ -260,4 +260,25 @@

		[To be completed.]
		Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).

		\| Property \| Description \|
		\|----------\|-------------\|
		\|`page_number`\| Page number on which the image was found.\|
		\|`height`\| Height of the image.\|
		\|`width`\| Width of the image.\|
		\|`x0`\| Distance of left side of the image from left side of page.\|
		\|`x1`\| Distance of right side of the image from left side of page.\|
		\|`y0`\| Distance of bottom of the image from bottom of page.\|
		\|`y1`\| Distance of top of the image from bottom of page.\|
		\|`top`\| Distance of top of the image from top of page.\|
		\|`bottom`\| Distance of bottom of the image from top of page.\|
		\|`doctop`\| Distance of top of rectangle from top of document.\|
		\|`srcsize`\| The image original dimensions, as a `(width, height)` tuple.\|
		\|`colorspace`\| Color domain of the image (e.g., RGB).\|
		\|`bits`\| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).\|
		\|`stream`\| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.\|
		\|`imagemask`\| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."\|
		\|`mcid`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). Experimental attribute.\|
		\|`tag`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). Experimental attribute.\|
		\|`object_type`\| "image"\|

		### Obtaining higher-level layout objects via `pdfminer.six`
		@@ -332,3 +353,3 @@
		\|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`\| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`).\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.\|
		\|`.extract_text_lines(layout=False, strip=True, return_chars=True, *kwargs)`\|Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.\|
		@@ -354,3 +375,3 @@ \|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, *kwargs)`\|Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. \|
		\|--------\|-------------\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_table(table_settings={})`\|Similar to `.find_tables(...)`, but returns the largest table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.\|
		@@ -552,2 +573,4 @@ \|`.extract_tables(table_settings={})`\|Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.\|
		- [Quentin André](https://github.com/QuentinAndre11)
		- [Léo Roux](https://github.com/leorouxx)
		- [@wodny](https://github.com/wodny)

		@@ -554,0 +577,0 @@ ## Contributing

+6

-6

requirements-dev.txt

		@@ -1,8 +0,8 @@
		black==22.3.0
		flake8==7.1.0
		isort==5.10.1
		black==24.8.0
		flake8==7.1.1
		isort==5.13.2
		jupyterlab==3.6.7
		mypy==0.981
		mypy==1.11.1
		nbexec==0.2.0
		pandas-stubs==2.2.2.240603
		pandas-stubs==2.2.2.240805
		pandas==2.2.2
		@@ -12,4 +12,4 @@ py==1.11.0
		pytest-parallel==0.1.1
		pytest==8.2.2
		pytest==8.3.2
		setuptools==68.2.2
		types-Pillow==10.2.0.20240520

+1

-0

setup.cfg

		@@ -6,2 +6,3 @@ [flake8]
		W503
		E704

		@@ -8,0 +9,0 @@ [tool:pytest]

+27

-0

tests/test_basics.py

		@@ -62,2 +62,16 @@ #!/usr/bin/env python

		def test_annots_cropped(self):
		pdf = self.pdf_2
		page = pdf.pages[0]
		assert len(page.annots) == 13
		assert len(page.hyperlinks) == 1

		cropped = page.crop(page.bbox)
		assert len(cropped.annots) == 13
		assert len(cropped.hyperlinks) == 1

		h0_bbox = pdfplumber.utils.obj_to_bbox(page.hyperlinks[0])
		cropped = page.crop(h0_bbox)
		assert len(cropped.annots) == len(cropped.hyperlinks) == 1

		def test_annots_rotated(self):
		@@ -182,2 +196,15 @@ def get_annot(filename, n=0):

		def test_unicode_normalization(self):
		path = os.path.join(HERE, "pdfs/issue-905.pdf")

		with pdfplumber.open(path) as pdf:
		page = pdf.pages[0]
		print(page.extract_text())
		assert ord(page.chars[0]["text"]) == 894

		with pdfplumber.open(path, unicode_norm="NFC") as pdf:
		page = pdf.pages[0]
		assert ord(page.chars[0]["text"]) == 59
		assert page.extract_text() == ";;"

		def test_colors(self):
		@@ -184,0 +211,0 @@ rect = self.pdf.pages[0].rects[0]

+12

-3

tests/test_dedupe_chars.py

		@@ -30,3 +30,6 @@ #!/usr/bin/env python

		assert last_line_without_drop == "微微软软培培训训课课程程：：名名模模意意义义一一些些有有意意义义一一些些"
		assert (
		last_line_without_drop
		== "微微软软培培训训课课程程：：名名模模意意义义一一些些有有意意义义一一些些"
		)
		assert last_line_with_drop == "微软培训课程：名模意义一些有意义一些"
		@@ -50,3 +53,6 @@
		assert last_words_without_drop["upright"] == 1
		assert last_words_without_drop["text"] == "名名模模意意义义一一些些有有意意义义一一些些"
		assert (
		last_words_without_drop["text"]
		== "名名模模意意义义一一些些有有意意义义一一些些"
		)

		@@ -65,3 +71,6 @@ assert round(last_words_with_drop["x0"], 3) == x0

		assert last_line_without_drop == "微微软软培培训训课课程程：：名名模模意意义义一一些些有有意意义义一一些些"
		assert (
		last_line_without_drop
		== "微微软软培培训训课课程程：：名名模模意意义义一一些些有有意意义义一一些些"
		)
		assert last_line_with_drop == "微软培训课程：名模意义一些有意义一些"
		@@ -68,0 +77,0 @@

+22

-0

tests/test_issues.py

		@@ -313,1 +313,23 @@ #!/usr/bin/env python
		assert page.extract_text()

		def test_issue_1181(self):
		"""
		Correctly re-calculate coordinates when MediaBox does not start at (0,0)
		"""
		path = os.path.join(HERE, "pdfs/issue-1181.pdf")
		with pdfplumber.open(path) as pdf:
		p0, p1 = pdf.pages
		assert p0.crop(p0.bbox).extract_table() == [
		["FooCol1", "FooCol2", "FooCol3"],
		["Foo4", "Foo5", "Foo6"],
		["Foo7", "Foo8", "Foo9"],
		["Foo10", "Foo11", "Foo12"],
		["", "", ""],
		]
		assert p1.crop(p1.bbox).extract_table() == [
		["BarCol1", "BarCol2", "BarCol3"],
		["Bar4", "Bar5", "Bar6"],
		["Bar7", "Bar8", "Bar9"],
		["Bar10", "Bar11", "Bar12"],
		["", "", ""],
		]

+12

-0

tests/test_repair.py

		@@ -56,2 +56,14 @@ #!/usr/bin/env python

		def test_repair_setting(self):
		path = os.path.join(HERE, "pdfs/malformed-from-issue-932.pdf")
		with tempfile.NamedTemporaryFile("wb") as out:
		pdfplumber.repair(path, outfile=out.name)
		size_default = os.stat(out.name).st_size

		with tempfile.NamedTemporaryFile("wb") as out:
		pdfplumber.repair(path, outfile=out.name, setting="prepress")
		size_prepress = os.stat(out.name).st_size

		assert size_default > size_prepress

		def test_repair_password(self):
		@@ -58,0 +70,0 @@ path = os.path.join(HERE, "pdfs/password-example.pdf")

+26

-0

tests/test_table.py

		@@ -76,2 +76,28 @@ #!/usr/bin/env python

		def test_rows_and_columns(self):
		path = os.path.join(HERE, "pdfs/issue-140-example.pdf")
		with pdfplumber.open(path) as pdf:
		page = pdf.pages[0]
		table = page.find_table()
		row = [page.crop(bbox).extract_text() for bbox in table.rows[0].cells]
		assert row == [
		"Line no",
		"UPC code",
		"Location",
		"Item Description",
		"Item Quantity",
		"Bill Amount",
		"Accrued Amount",
		"Handling Rate",
		"PO number",
		]
		col = [page.crop(bbox).extract_text() for bbox in table.columns[1].cells]
		assert col == [
		"UPC code",
		"0085648100305",
		"0085648100380",
		"0085648100303",
		"0085648100300",
		]

		def test_explicit_desc_decimalization(self):
		@@ -78,0 +104,0 @@ """

+12

-0

tests/test_utils.py

		@@ -102,2 +102,14 @@ #!/usr/bin/env python

		def test_extract_words_return_chars(self):
		path = os.path.join(HERE, "pdfs/extra-attrs-example.pdf")
		with pdfplumber.open(path) as pdf:
		page = pdf.pages[0]

		words = page.extract_words()
		assert "chars" not in words[0]

		words = page.extract_words(return_chars=True)
		assert "chars" in words[0]
		assert "".join(c["text"] for c in words[0]["chars"]) == words[0]["text"]

		def test_text_rotation(self):
		@@ -104,0 +116,0 @@ rotations = {

		Metadata-Version: 2.1
		Name: pdfplumber
		Version: 0.11.2
		Version: 0.11.3
		Summary: Plumb a PDF for detailed information about each char, rectangle, and line.
		@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber

		> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).

		## Table of Contents
		@@ -106,2 +104,4 @@

		To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.

		Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.
		@@ -281,4 +281,25 @@

		[To be completed.]
		Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).

		\| Property \| Description \|
		\|----------\|-------------\|
		\|`page_number`\| Page number on which the image was found.\|
		\|`height`\| Height of the image.\|
		\|`width`\| Width of the image.\|
		\|`x0`\| Distance of left side of the image from left side of page.\|
		\|`x1`\| Distance of right side of the image from left side of page.\|
		\|`y0`\| Distance of bottom of the image from bottom of page.\|
		\|`y1`\| Distance of top of the image from bottom of page.\|
		\|`top`\| Distance of top of the image from top of page.\|
		\|`bottom`\| Distance of bottom of the image from top of page.\|
		\|`doctop`\| Distance of top of rectangle from top of document.\|
		\|`srcsize`\| The image original dimensions, as a `(width, height)` tuple.\|
		\|`colorspace`\| Color domain of the image (e.g., RGB).\|
		\|`bits`\| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).\|
		\|`stream`\| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.\|
		\|`imagemask`\| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."\|
		\|`mcid`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). Experimental attribute.\|
		\|`tag`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). Experimental attribute.\|
		\|`object_type`\| "image"\|

		### Obtaining higher-level layout objects via `pdfminer.six`
		@@ -353,3 +374,3 @@
		\|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`\| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`).\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.\|
		\|`.extract_text_lines(layout=False, strip=True, return_chars=True, *kwargs)`\|Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.\|
		@@ -375,3 +396,3 @@ \|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, *kwargs)`\|Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. \|
		\|--------\|-------------\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_table(table_settings={})`\|Similar to `.find_tables(...)`, but returns the largest table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.\|
		@@ -573,2 +594,4 @@ \|`.extract_tables(table_settings={})`\|Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.\|
		- [Quentin André](https://github.com/QuentinAndre11)
		- [Léo Roux](https://github.com/leorouxx)
		- [@wodny](https://github.com/wodny)

		@@ -575,0 +598,0 @@ ## Contributing

		Metadata-Version: 2.1
		Name: pdfplumber
		Version: 0.11.2
		Version: 0.11.3
		Summary: Plumb a PDF for detailed information about each char, rectangle, and line.
		@@ -36,4 +36,2 @@ Home-page: https://github.com/jsvine/pdfplumber

		> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).

		## Table of Contents
		@@ -106,2 +104,4 @@

		To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.

		Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.
		@@ -281,4 +281,25 @@

		[To be completed.]
		Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).

		\| Property \| Description \|
		\|----------\|-------------\|
		\|`page_number`\| Page number on which the image was found.\|
		\|`height`\| Height of the image.\|
		\|`width`\| Width of the image.\|
		\|`x0`\| Distance of left side of the image from left side of page.\|
		\|`x1`\| Distance of right side of the image from left side of page.\|
		\|`y0`\| Distance of bottom of the image from bottom of page.\|
		\|`y1`\| Distance of top of the image from bottom of page.\|
		\|`top`\| Distance of top of the image from top of page.\|
		\|`bottom`\| Distance of bottom of the image from top of page.\|
		\|`doctop`\| Distance of top of rectangle from top of document.\|
		\|`srcsize`\| The image original dimensions, as a `(width, height)` tuple.\|
		\|`colorspace`\| Color domain of the image (e.g., RGB).\|
		\|`bits`\| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).\|
		\|`stream`\| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.\|
		\|`imagemask`\| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."\|
		\|`mcid`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). Experimental attribute.\|
		\|`tag`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). Experimental attribute.\|
		\|`object_type`\| "image"\|

		### Obtaining higher-level layout objects via `pdfminer.six`
		@@ -353,3 +374,3 @@
		\|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`\| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`).\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.\|
		\|`.extract_text_lines(layout=False, strip=True, return_chars=True, *kwargs)`\|Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.\|
		@@ -375,3 +396,3 @@ \|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, *kwargs)`\|Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. \|
		\|--------\|-------------\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_table(table_settings={})`\|Similar to `.find_tables(...)`, but returns the largest table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.\|
		@@ -573,2 +594,4 @@ \|`.extract_tables(table_settings={})`\|Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.\|
		- [Quentin André](https://github.com/QuentinAndre11)
		- [Léo Roux](https://github.com/leorouxx)
		- [@wodny](https://github.com/wodny)

		@@ -575,0 +598,0 @@ ## Contributing

		@@ -15,4 +15,2 @@ # pdfplumber

		> 👋 This repository’s maintainers are available to hire for PDF data-extraction consulting projects. To get a cost estimate, contact [Jeremy](https://www.jsvine.com/consulting/pdf-data-extraction/) (for projects of any size or complexity) and/or [Samkit](https://www.linkedin.com/in/samkit-jain/) (specifically for table extraction).

		## Table of Contents
		@@ -85,2 +83,4 @@

		To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.

		Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.
		@@ -260,4 +260,25 @@

		[To be completed.]
		Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).

		\| Property \| Description \|
		\|----------\|-------------\|
		\|`page_number`\| Page number on which the image was found.\|
		\|`height`\| Height of the image.\|
		\|`width`\| Width of the image.\|
		\|`x0`\| Distance of left side of the image from left side of page.\|
		\|`x1`\| Distance of right side of the image from left side of page.\|
		\|`y0`\| Distance of bottom of the image from bottom of page.\|
		\|`y1`\| Distance of top of the image from bottom of page.\|
		\|`top`\| Distance of top of the image from top of page.\|
		\|`bottom`\| Distance of bottom of the image from top of page.\|
		\|`doctop`\| Distance of top of rectangle from top of document.\|
		\|`srcsize`\| The image original dimensions, as a `(width, height)` tuple.\|
		\|`colorspace`\| Color domain of the image (e.g., RGB).\|
		\|`bits`\| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).\|
		\|`stream`\| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.\|
		\|`imagemask`\| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."\|
		\|`mcid`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). Experimental attribute.\|
		\|`tag`\| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). Experimental attribute.\|
		\|`object_type`\| "image"\|

		### Obtaining higher-level layout objects via `pdfminer.six`
		@@ -332,3 +353,3 @@
		\|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`\| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`).\|
		\|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`\| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.\|
		\|`.extract_text_lines(layout=False, strip=True, return_chars=True, *kwargs)`\|Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.\|
		@@ -354,3 +375,3 @@ \|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, *kwargs)`\|Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. \|
		\|--------\|-------------\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_tables(table_settings={})`\|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.\|
		\|`.find_table(table_settings={})`\|Similar to `.find_tables(...)`, but returns the largest table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.\|
		@@ -552,2 +573,4 @@ \|`.extract_tables(table_settings={})`\|Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.\|
		- [Quentin André](https://github.com/QuentinAndre11)
		- [Léo Roux](https://github.com/leorouxx)
		- [@wodny](https://github.com/wodny)

		@@ -554,0 +577,0 @@ ## Contributing

pdfplumber - pypi Package Compare versions

Improved metrics