spaczz
Advanced tools
+4
-5
| Metadata-Version: 2.1 | ||
| Name: spaczz | ||
| Version: 0.5.2 | ||
| Version: 0.5.3 | ||
| Summary: Adds fuzzy matching and additional regex matching support to spaCy. | ||
@@ -40,9 +40,8 @@ Home-page: https://github.com/gandersen101/spaczz | ||
| *v0.5.3 Release Notes:* | ||
| - *Fixed a "bug" in the `TokenMatcher`. Spaczz expects token matches returned in order of ascending match start, then descending match length. However, spaCy's `Matcher` does not return matches in this order by default. Added a sort in the `TokenMatcher` to ensure this.* | ||
| *v0.5.2 Release Notes:* | ||
| - *Minor updates to pre-commits and noxfile.* | ||
| *v0.5.1 Release Notes:* | ||
| - *Minor updates to allowed dependency versions and CI.* | ||
| - *Switched back to using typing types instead of generic types because spaCy v3 uses Pydantic and Pydantic does not support generic types in Python < 3.9. I don't know if this would actually cause any issues but I am playing it safe. Potentially more changes for spaczz to play nicely with Pydantic to follow.* | ||
| Please see the [changelog](https://github.com/gandersen101/spaczz/blob/master/CHANGELOG.md) for previous release notes. This will eventually be moved to the [Read the Docs](https://spaczz.readthedocs.io/en/latest/) page. | ||
@@ -49,0 +48,0 @@ |
+6
-2
| [tool.poetry] | ||
| name = "spaczz" | ||
| version = "0.5.2" | ||
| version = "0.5.3" | ||
| description = "Adds fuzzy matching and additional regex matching support to spaCy." | ||
@@ -39,3 +39,3 @@ license = "MIT" | ||
| sphinx-autodoc-typehints = ">=1.11.0" | ||
| sphinx-autobuild = "0.*" | ||
| sphinx-autobuild = ">=0.7.1" | ||
| codecov = ">=2.1.7" | ||
@@ -46,2 +46,6 @@ | ||
| [tool.pytest.ini_options] | ||
| filterwarnings = ["ignore::DeprecationWarning"] | ||
| testpaths = ["tests"] | ||
| [tool.coverage.paths] | ||
@@ -48,0 +52,0 @@ source = ["src", "*/site-packages"] |
+3
-4
@@ -17,9 +17,8 @@ [](https://github.com/gandersen101/spaczz/actions?workflow=Tests) | ||
| *v0.5.3 Release Notes:* | ||
| - *Fixed a "bug" in the `TokenMatcher`. Spaczz expects token matches returned in order of ascending match start, then descending match length. However, spaCy's `Matcher` does not return matches in this order by default. Added a sort in the `TokenMatcher` to ensure this.* | ||
| *v0.5.2 Release Notes:* | ||
| - *Minor updates to pre-commits and noxfile.* | ||
| *v0.5.1 Release Notes:* | ||
| - *Minor updates to allowed dependency versions and CI.* | ||
| - *Switched back to using typing types instead of generic types because spaCy v3 uses Pydantic and Pydantic does not support generic types in Python < 3.9. I don't know if this would actually cause any issues but I am playing it safe. Potentially more changes for spaczz to play nicely with Pydantic to follow.* | ||
| Please see the [changelog](https://github.com/gandersen101/spaczz/blob/master/CHANGELOG.md) for previous release notes. This will eventually be moved to the [Read the Docs](https://spaczz.readthedocs.io/en/latest/) page. | ||
@@ -26,0 +25,0 @@ |
+2
-2
@@ -24,5 +24,5 @@ # -*- coding: utf-8 -*- | ||
| 'name': 'spaczz', | ||
| 'version': '0.5.2', | ||
| 'version': '0.5.3', | ||
| 'description': 'Adds fuzzy matching and additional regex matching support to spaCy.', | ||
| 'long_description': '[](https://github.com/gandersen101/spaczz/actions?workflow=Tests)\n[](https://codecov.io/gh/gandersen101/spaczz)\n[](https://pypi.org/project/spaczz/)\n[](https://spaczz.readthedocs.io/)\n\n# spaczz: Fuzzy matching and more for spaCy\n\nSpaczz provides fuzzy matching and additional regex matching functionality for [spaCy](https://spacy.io/).\nSpaczz\'s components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.\n\nFuzzy matching is currently performed with matchers from [RapidFuzz](https://github.com/maxbachmann/rapidfuzz)\'s fuzz module and regex matching currently relies on the [regex](https://pypi.org/project/regex/) library. Spaczz certainly takes additional influence from other libraries and resources. For additional details see the references section.\n\n**Supports spaCy v3 and v2 (>= 2.2)!**\n\nSpaczz has been tested on Ubuntu 18.04, MacOS 10.15, and Windows Server 2019.\n\n*v0.5.2 Release Notes:*\n- *Minor updates to pre-commits and noxfile.*\n\n*v0.5.1 Release Notes:*\n- *Minor updates to allowed dependency versions and CI.*\n- *Switched back to using typing types instead of generic types because spaCy v3 uses Pydantic and Pydantic does not support generic types in Python < 3.9. I don\'t know if this would actually cause any issues but I am playing it safe. Potentially more changes for spaczz to play nicely with Pydantic to follow.*\n\nPlease see the [changelog](https://github.com/gandersen101/spaczz/blob/master/CHANGELOG.md) for previous release notes. This will eventually be moved to the [Read the Docs](https://spaczz.readthedocs.io/en/latest/) page.\n\n<h1>Table of Contents<span class="tocSkip"></span></h1>\n<div class="toc"><ul class="toc-item"><li><span><a href="#Installation" data-toc-modified-id="Installation-1">Installation</a></span></li><li><span><a href="#Basic-Usage" data-toc-modified-id="Basic-Usage-2">Basic Usage</a></span><ul class="toc-item"><li><span><a href="#FuzzyMatcher" data-toc-modified-id="FuzzyMatcher-2.1">FuzzyMatcher</a></span></li><li><span><a href="#RegexMatcher" data-toc-modified-id="RegexMatcher-2.2">RegexMatcher</a></span></li><li><span><a href="#SimilarityMatcher" data-toc-modified-id="SimilarityMatcher-2.3">SimilarityMatcher</a></span></li><li><span><a href="#TokenMatcher" data-toc-modified-id="TokenMatcher-2.4">TokenMatcher</a></span></li><li><span><a href="#SpaczzRuler" data-toc-modified-id="SpaczzRuler-2.5">SpaczzRuler</a></span></li><li><span><a href="#Custom-Attributes" data-toc-modified-id="Custom-Attributes-2.6">Custom Attributes</a></span></li><li><span><a href="#Saving/Loading" data-toc-modified-id="Saving/Loading-2.7">Saving/Loading</a></span></li></ul></li><li><span><a href="#Known-Issues" data-toc-modified-id="Known-Issues-3">Known Issues</a></span><ul class="toc-item"><li><span><a href="#Performance" data-toc-modified-id="Performance-3.1">Performance</a></span></li><li><span><a href="#SpaczzRuler-Inconsistencies" data-toc-modified-id="SpaczzRuler-Inconsistencies-3.2">SpaczzRuler Inconsistencies</a></span></li></ul></li><li><span><a href="#Roadmap" data-toc-modified-id="Roadmap-4">Roadmap</a></span></li><li><span><a href="#Development" data-toc-modified-id="Development-5">Development</a></span></li><li><span><a href="#References" data-toc-modified-id="References-6">References</a></span></li></ul></div>\n\n## Installation\n\nSpaczz can be installed using pip.\n\n\n```python\npip install spaczz\n```\n\n## Basic Usage\n\nSpaczz\'s primary features are the `FuzzyMatcher`, `RegexMatcher`, and "fuzzy" `TokenMatcher` that function similarly to spaCy\'s `Matcher` and `PhraseMatcher`, and the `SpaczzRuler` which integrates the spaczz matchers into a spaCy pipeline component similar to spaCy\'s `EntityRuler`.\n\n### FuzzyMatcher\n\nThe basic usage of the fuzzy matcher is similar to spaCy\'s `PhraseMatcher` except it returns the fuzzy ratio along with match id, start and end information, so make sure to include a variable for the ratio when unpacking results.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\ntext = """Grint Anderson created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")])\nmatcher.add("GPE", [nlp("Nashville")])\nmatches = matcher(doc)\n\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\n NAME Grint Anderson 86\n GPE Nashv1le 82\n\n\nUnlike spaCy matchers, spaczz matchers are written in pure Python. While they are required to have a spaCy vocab passed to them during initialization, this is purely for consistency as the spaczz matchers do not use currently use the spaCy vocab. This is why the `match_id` above is simply a string instead of an integer value like in spaCy matchers.\n\nSpaczz matchers can also make use of on-match rules via callback functions. These on-match callbacks need to accept the matcher itself, the doc the matcher was called on, the match index and the matches produced by the matcher.\n\n\n```python\nimport spacy\nfrom spacy.tokens import Span\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\ntext = """Grint Anderson created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\n\ndef add_name_ent(matcher, doc, i, matches):\n """Callback on match function. Adds "NAME" entities to doc."""\n # Get the current match and create tuple of entity label, start and end.\n # Append entity to the doc\'s entity. (Don\'t overwrite doc.ents!)\n _match_id, start, end, _ratio = matches[i]\n entity = Span(doc, start, end, label="NAME")\n doc.ents += (entity,)\n\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")], on_match=add_name_ent)\nmatches = matcher(doc)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_))\n```\n\n (\'Grint Anderson\', 0, 2, \'NAME\')\n\n\nLike spaCy\'s `EntityRuler`, a very similar entity updating logic has been implemented in the `SpaczzRuler`. The `SpaczzRuler` also takes care of handling overlapping matches. It is discussed in a later section.\n\nUnlike spaCy\'s matchers, rules added to spaczz matchers have optional keyword arguments that can modify the matching behavior. Take the below fuzzy matching examples:\n\n\n```python\nimport spacy\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\n# Let\'s modify the order of the name in the text.\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")])\nmatches = matcher(doc)\n\n# The default fuzzy matching settings will not find a match.\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\nNext we change the fuzzy matching behavior for the "NAME" rule.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\n# Let\'s modify the order of the name in the text.\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")], kwargs=[{"fuzzy_func": "token_sort"}])\nmatches = matcher(doc)\n\n# The default fuzzy matching settings will not find a match.\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\n NAME Anderson, Grint 86\n\n\nThe full list of keyword arguments available for fuzzy matching rules includes:\n\n- `fuzzy_func`: Key name of fuzzy matching function to use. All rapidfuzz matching functions with default settings are available. Default is `"simple"`:\n - "simple" = `ratio`\n - "partial" = `partial_ratio`\n - "token_set" = `token_set_ratio`\n - "token_sort" = `token_sort_ratio`\n - "partial_token_set" = `partial_token_set_ratio`\n - "partial_token_sort" = `partial_token_sort_ratio`\n - "quick" = `QRatio`\n - "weighted" = `WRatio`\n - "token" = `token_ratio`,\n - "partial_token" = `partial_token_ratio`\n Default is `"simple"`.\n- `ignore_case`: If strings should be lower-cased before comparison or not. Default is `True`.\n- `flex`: Number of tokens to move match match boundaries left and right during optimization. Can be an integer value with a max of `len(query)` and a min of `0` (will warn and change if higher or lower),or the strings "max", "min", or "default". Default is `"default"`: `len(query) // 2`.\n- `min_r1`: Minimum match ratio required forselection during the intial search over doc. If `flex == 0`, `min_r1` will be overwritten by `min_r2`. If `flex > 0`, `min_r1` must be lower than `min_r2` and "low" in general because match boundaries are not flexed initially. Default is `50`.\n- `min_r2`: Minimum match ratio required for selection during match optimization. Needs to be higher than `min_r1` and "high" in general to ensure only quality matches are returned. Default is `75`.\n- `thresh`: If this ratio is exceeded in initial scan, and `flex > 0`, no optimization will be attempted. If `flex == 0`, `thresh` has no effect. Default is `100`.\n\n### RegexMatcher\n\nThe basic usage of the regex matcher is also fairly similar to spaCy\'s `PhraseMatcher`. It accepts regex patterns as strings so flags must be inline. Regexes are compiled with the [regex](https://pypi.org/project/regex/) package so approximate "fuzzy" matching is supported. To provide access to these "fuzzy" match results the matcher returns the fuzzy count values along with match id, start and end information, so make sure to include a variable for the counts when unpacking results.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import RegexMatcher\n\nnlp = spacy.blank("en")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = RegexMatcher(nlp.vocab)\n# Use inline flags for regex strings as needed\nmatcher.add(\n "APT",\n [\n r"""(?ix)((?:apartment|apt|building|bldg|floor|fl|suite|ste|unit\n|room|rm|department|dept|row|rw)\\.?\\s?)#?\\d{1,4}[a-z]?"""\n ],\n) # Not the most robust regex.\nmatcher.add("GPE", [r"(USA){d<=1}"]) # Fuzzy regex.\nmatches = matcher(doc)\n\nfor match_id, start, end, counts in matches:\n print(match_id, doc[start:end], counts)\n```\n\n APT Apt 5 (0, 0, 0)\n GPE US (0, 0, 1)\n\n\nSpaczz matchers can also make use of on-match rules via callback functions. These on-match callbacks need to accept the matcher itself, the doc the matcher was called on, the match index and the matches produced by the matcher. See the fuzzy matcher usage example above for details.\n\nLike the fuzzy matcher, the regex matcher has optional keyword arguments that can modify matching behavior. Take the below regex matching example.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import RegexMatcher\n\nnlp = spacy.blank("en")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the USA.""" # Spelling errors intentional. Notice \'USA\' here.\ndoc = nlp(text)\n\nmatcher = RegexMatcher(nlp.vocab)\n# Use inline flags for regex strings as needed\nmatcher.add(\n "STREET", ["street_addresses"], kwargs=[{"predef": True}]\n) # Use predefined regex by key name.\n# Below will not expand partial matches to span boundaries.\nmatcher.add("GPE", [r"(?i)[U](nited|\\.?) ?[S](tates|\\.?)"], kwargs=[{"partial": False}])\nmatches = matcher(doc)\n\nfor match_id, start, end, counts in matches:\n print(\n match_id, doc[start:end], counts\n ) # comma in result isn\'t ideal - see "Roadmap"\n```\n\n STREET 555 Fake St, (0, 0, 0)\n\n\nThe full list of keyword arguments available for regex matching rules includes:\n\n- `partial`: Whether partial matches should be extended to existing span boundaries in doc or not, i.e. the regex only matches part of a token or span. Default is True.\n- `predef`: Whether the regex string should be interpreted as a key to a predefined regex pattern or not. Default is False. The included regexes are:\n - `"dates"`\n - `"times"`\n - `"phones"`\n - `"phones_with_exts"`\n - `"links"`\n - `"emails"`\n - `"ips"`\n - `"ipv6s"`\n - `"prices"`\n - `"hex_colors"`\n - `"credit_cards"`\n - `"btc_addresses"`\n - `"street_addresses"`\n - `"zip_codes"`\n - `"po_boxes"`\n - `"ssn_number"`\n\nThe above patterns are the same that the [commonregex](https://github.com/madisonmay/CommonRegex) package provides.\n\n### SimilarityMatcher\n\nThe basic usage of the similarity matcher is similar to spaCy\'s `PhraseMatcher` except it returns the vector similarity ratio along with match id, start and end information, so make sure to include a variable for the ratio when unpacking results.\n\nIn order to produce meaningful results from the similarity matcher, a spaCy model with word vectors (ex. medium or large English models) must be used to initialize the matcher, process the target document, and process any patterns added.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import SimilarityMatcher\n\nnlp = spacy.load("en_core_web_md")\ntext = "I like apples, grapes and bananas."\ndoc = nlp(text)\n\n# lowering min_r2 from default of 75 to produce matches in this example\nmatcher = SimilarityMatcher(nlp.vocab, min_r2=65)\nmatcher.add("FRUIT", [nlp("fruit")])\nmatches = matcher(doc)\n\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\n FRUIT apples 72\n FRUIT grapes 72\n FRUIT bananas 68\n\n\nPlease note that even for the mostly pure-Python spaczz, this process is currently extremely slow so be mindful of the scope in which it is applied. Enabling GPU support in spaCy ([see here](https://spacy.io/usage#gpu)) should improve the speed somewhat, but I believe the process will still be bottlenecked in the pure-Python search algorithm until I develop a better search algorithm and/or drop the search to lower-level code (ex C).\n\nAlso as a somewhat experimental feature, the similarity matcher is not currently part of the `SpaczzRuler` nor does it have a separate ruler. If you need to add similarity matches to a doc\'s entities you will need to use an on-match callback for the time being. Please see the fuzzy matcher on-match callback example above for ideas. If there is enough interest in integrating/creating a ruler for the similarity matcher this can be done.\n\nThe full list of keyword arguments available for similarity matching rules includes:\n\n- `flex`: Number of tokens to move match span boundaries left and right during match optimization. Can be an integer value with a max of `len(query)` and a min of `0` (will warn and change if higher or lower), `"max"`, `"min"`, or `"default"`. Default is `"default"`: `len(query) // 2`.\n- `min_r1`: Minimum similarity match ratio required for selection during the intial search over doc. This should be lower than `min_r2` and "low" in general because match span boundaries are not flexed initially. `0` means all spans of query length in doc will have their boundaries flexed and will be re-compared during match optimization. Lower `min_r1` will result in more fine-grained matching but will run slower. Default is `50`.\n- `min_r2`: Minimum similarity match ratio required for selection during match optimization. Should be higher than `min_r1` and "high" in general to ensure only quality matches are returned. Default is `75`.\n- `thresh`: If this ratio is exceeded in initial scan no optimization will be attempted. Default is `100`.\n\n### TokenMatcher\n\nThe basic usage of the token matcher is similar to spaCy\'s `Matcher`. It accepts labeled patterns in the form of lists of dictionaries where each list describes an individual pattern and each dictionary describes an individual token.\n\nThe token matcher accepts all the same token attributes and pattern syntax as it\'s spaCy counterpart but adds fuzzy and fuzzy-regex support.\n\n`"FUZZY"` and `"FREGEX"` are the two additional spaCy token pattern options.\n\nFor example:\n `{"TEXT": {"FREGEX": "(database){e<=1}"}},`\n `{"LOWER": {"FUZZY": "access", "MIN_R": 85, "FUZZY_FUNC": "quick_lev"}}`\n\n**Make sure to use uppercase dictionary keys in patterns.**\n\n\n```python\nimport spacy\nfrom spaczz.matcher import TokenMatcher\n\n# Using model results like POS tagging in token patterns requires model that provides these.\nnlp = spacy.load("en_core_web_md")\ntext = """The manager gave me SQL databesE acess so now I can acces the Sequal DB.\nMy manager\'s name is Grfield"""\ndoc = nlp(text)\n\nmatcher = TokenMatcher(vocab=nlp.vocab)\nmatcher.add(\n "DATA",\n [\n [\n {"TEXT": "SQL"},\n {"LOWER": {"FREGEX": "(database){s<=1}"}},\n {"LOWER": {"FUZZY": "access"}},\n ],\n [{"TEXT": {"FUZZY": "Sequel"}, "POS": "PROPN"}, {"LOWER": "db"}],\n ],\n)\nmatcher.add("NAME", [[{"TEXT": {"FUZZY": "Garfield"}}]])\nmatches = matcher(doc)\n\nfor match_id, start, end, _ in matches: # Note the _ here. Explained below.\n print(match_id, doc[start:end])\n```\n\n DATA SQL databesE acess\n DATA Sequal DB\n NAME Grfield\n\n\nPlease note that the way the token matcher is implemented does not currently have a way to return fuzzy ratio or fuzzy-regex counts like the fuzzy matcher and regex matcher provide. To keep the API consistent, the token matcher returns a placeholder of `None` as the fourth element of the tuples it returns, so be sure to account for this like we did with `_` in unpacking above.\n\nAlso, even though the token matcher can be a drop-in replacement for spaCy\'s `Matcher`, it is still recommended to use spaCy\'s `Matcher` if you do not need the spaczz token matcher\'s fuzzy capabilities - it will slow processing down unnecessarily.\n\n### SpaczzRuler\n\nThe spaczz ruler combines the fuzzy and regex phrase matchers, and the "fuzzy" token matcher, into one pipeline component that can update a doc entities similar to spaCy\'s `EntityRuler`.\n\nPatterns must be added as an iterable of dictionaries in the format of *{label (str), pattern(str or list), type(str), optional kwargs (dict), and optional id (str)}*.\n\nFor example, a fuzzy phrase pattern:\n\n`{\'label\': \'ORG\', \'pattern\': \'Apple\', \'type\': \'fuzzy\', \'kwargs\': {\'min_r2\': 90}}`\n\nOr, a token pattern:\n\n`{\'label\': \'ORG\', \'pattern\': [{\'TEXT\': {\'FUZZY\': \'Apple\'}}], \'type\': \'token\'}`\n\n\n```python\nimport spacy\nfrom spaczz.pipeline import SpaczzRuler\n\nnlp = spacy.blank("en")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the USA.\nSome of his favorite bands are Converg and Protet the Zero.""" # Spelling errors intentional.\ndoc = nlp(text)\n\npatterns = [\n {\n "label": "NAME",\n "pattern": "Grant Andersen",\n "type": "fuzzy",\n "kwargs": {"fuzzy_func": "token_sort"},\n },\n {\n "label": "STREET",\n "pattern": "street_addresses",\n "type": "regex",\n "kwargs": {"predef": True},\n },\n {"label": "GPE", "pattern": "Nashville", "type": "fuzzy"},\n {\n "label": "ZIP",\n "pattern": r"\\b(?:55554){s<=1}(?:(?:[-\\s])?\\d{4}\\b)",\n "type": "regex",\n }, # fuzzy regex\n {"label": "GPE", "pattern": "(?i)[U](nited|\\.?) ?[S](tates|\\.?)", "type": "regex"},\n {\n "label": "BAND",\n "pattern": [{"LOWER": {"FREGEX": "(converge){e<=1}"}}],\n "type": "token",\n },\n {\n "label": "BAND",\n "pattern": [\n {"TEXT": {"FUZZY": "Protest"}},\n {"IS_STOP": True},\n {"TEXT": {"FUZZY": "Hero"}},\n ],\n "type": "token",\n },\n]\n\nruler = SpaczzRuler(nlp)\nruler.add_patterns(patterns)\ndoc = ruler(doc)\n\nprint("Fuzzy Matches:")\nfor ent in doc.ents:\n if ent._.spaczz_type == "fuzzy":\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))\n\nprint("\\n", "Regex Matches:", sep="")\nfor ent in doc.ents:\n if ent._.spaczz_type == "regex":\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_counts))\n\nprint("\\n", "Token Matches:", sep="")\nfor ent in doc.ents:\n if ent._.spaczz_type == "token":\n # ._.spaczz_details is currently just placeholder value of 1\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_details))\n```\n\n Fuzzy Matches:\n (\'Anderson, Grint\', 0, 3, \'NAME\', 86)\n (\'Nashv1le\', 17, 18, \'GPE\', 82)\n\n Regex Matches:\n (\'555 Fake St,\', 9, 13, \'STREET\', (0, 0, 0))\n (\'55555-1234\', 20, 23, \'ZIP\', (1, 0, 0))\n (\'USA\', 25, 26, \'GPE\', (0, 0, 0))\n\n Token Matches:\n (\'Converg\', 34, 35, \'BAND\', 1)\n (\'Protet the Zero\', 36, 39, \'BAND\', 1)\n\n\nWe see in the example above that we are referencing some custom attributes, which are explained below.\n\nFor more `SpaczzRuler` examples see [here](https://github.com/gandersen101/spaczz/blob/master/examples/fuzzy_matching_tweaks.md). In particular this provides details about the ruler\'s sorting process and fuzzy matching parameters.\n\n### Custom Attributes\n\nSpaczz initializes some custom attributes upon importing. These are under spaCy\'s `._.` attribute and are further prepended with `spaczz_` so there should be not conflicts with your own custom attributes. If there are spaczz will force overwrite them.\n\nThese custom attributes are only set via the spaczz ruler at the token level. Span and doc versions of these attributes are getters that reference the token level attributes.\n\nThe following `Token` attributes are available. All are mutable:\n\n- `spaczz_token`: default = `False`. Boolean that denotes if the token is part of an ent set by the spaczz ruler.\n- `spaczz_type`: default = `None`. String that shows which matcher produced an ent using the token.\n- `spaczz_ratio`: default = `None`. If the token is part of fuzzy-phrase-matched ent, will return fuzzy ratio.\n- `spaczz_counts`: default = `None`. If the token is part of regex-phrase-matched ent, will return fuzzy counts.\n- `spaczz_details`: default = `None`. Placeholder for token matcher fuzzy ratio/counts. To be developed. Will return 1 if the token is part of a "fuzzy"-token-matched ent.\n\nThe following `Span` attributes reference the token attributes included in the span. All are immutable:\n\n- `spaczz_ent`: default = `False`. Boolean that denotes if all tokens in span are part of an ent set by the spaczz ruler.\n- `spaczz_type`: default = `None`. String that denotes which matcher produced an ent using the included tokens.\n- `spaczz_types`: default = `set()`. Set that shows which matchers produced ents using the included tokens. An entity span should only have one type, but this allows you to see the types included in any arbitrary span.\n- `spaczz_ratio`: default = `None`. If all the tokens in span are part of fuzzy-phrase-matched ent, will return fuzzy ratio.\n- `spaczz_counts`: default = `None`. If all the tokens in span are part of regex-phrase-matched ent, will return fuzzy counts.\n- `spaczz_details`: default = `None`. Placeholder for token matcher fuzzy ratio/counts. To be developed. Will return 1 if all the tokens in span are part of a "fuzzy"-token-matched ent.\n\nThe following `Doc` attributes reference the token attributes included in the doc. All are immutable:\n\n- `spaczz_doc`: default = `False`. Boolean that denotes if any tokens in the doc are part of an ent set by the spaczz ruler.\n- `spaczz_types`: default = `set()`. Set that shows which matchers produced ents in the doc.\n\n### Saving/Loading\n\nThe `SpaczzRuler` has it\'s own to/from disk/bytes methods and will accept `config` parameters passed to `spacy.load()`. It also has it\'s own spaCy factory entry point so spaCy is aware of the `SpaczzRuler`. Below is an example of saving and loading a spacy pipeline with the small English model, the `EntityRuler`, and the `SpaczzRuler`.\n\n\n```python\nimport spacy\nfrom spaczz.pipeline import SpaczzRuler\n\nnlp = spacy.load("en_core_web_sm")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the USA.\nSome of his favorite bands are Converg and Protet the Zero.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_))\n```\n\n (\'Anderson\', 0, 1, \'PERSON\')\n (\'Grint\', 2, 3, \'PERSON\')\n (\'555\', 9, 10, \'CARDINAL\')\n (\'5\', 15, 16, \'CARDINAL\')\n (\'TN 55555-1234\', 19, 23, \'DATE\')\n (\'USA\', 25, 26, \'GPE\')\n (\'Converg\', 34, 35, \'PERSON\')\n (\'Protet\', 36, 37, \'PERSON\')\n\n\nWhile spaCy does a decent job of identifying that named entities are present in this example, we can definitely improve the matches - particularly with the types of labels applied.\n\nLet\'s add an entity ruler for some rules-based matches.\n\n\n```python\nfrom spacy.pipeline import EntityRuler\n\nentity_ruler = nlp.add_pipe("entity_ruler", before="ner") #spaCy v3 syntax\nentity_ruler.add_patterns(\n [{"label": "GPE", "pattern": "Nashville"}, {"label": "GPE", "pattern": "TN"}]\n)\n\ndoc = nlp(text)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_))\n```\n\n (\'Anderson\', 0, 1, \'PERSON\')\n (\'Grint\', 2, 3, \'PERSON\')\n (\'555\', 9, 10, \'CARDINAL\')\n (\'5\', 15, 16, \'CARDINAL\')\n (\'TN\', 19, 20, \'GPE\')\n (\'USA\', 25, 26, \'GPE\')\n (\'Converg\', 34, 35, \'PERSON\')\n (\'Protet\', 36, 37, \'PERSON\')\n\n\nWe\'re making progress, but Nashville is spelled wrong in the text so the entity ruler does not find it, and we still have other entities to fix/find.\n\nLet\'s add a spaczz ruler to round this pipeline out. We will also include the `spaczz_span` custom attribute in the results to denote which entities were set via spaczz.\n\n\n```python\nspaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax\nspaczz_ruler.add_patterns(\n [\n {\n "label": "NAME",\n "pattern": "Grant Andersen",\n "type": "fuzzy",\n "kwargs": {"fuzzy_func": "token_sort"},\n },\n {\n "label": "STREET",\n "pattern": "street_addresses",\n "type": "regex",\n "kwargs": {"predef": True},\n },\n {"label": "GPE", "pattern": "Nashville", "type": "fuzzy"},\n {\n "label": "ZIP",\n "pattern": r"\\b(?:55554){s<=1}(?:[-\\s]\\d{4})?\\b",\n "type": "regex",\n }, # fuzzy regex\n {\n "label": "BAND",\n "pattern": [{"LOWER": {"FREGEX": "(converge){e<=1}"}}],\n "type": "token",\n },\n {\n "label": "BAND",\n "pattern": [\n {"TEXT": {"FUZZY": "Protest"}},\n {"IS_STOP": True},\n {"TEXT": {"FUZZY": "Hero"}},\n ],\n "type": "token",\n },\n ]\n)\n\ndoc = nlp(text)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ent))\n```\n\n (\'Anderson, Grint\', 0, 3, \'NAME\', True)\n (\'555 Fake St,\', 9, 13, \'STREET\', True)\n (\'5\', 15, 16, \'CARDINAL\', False)\n (\'Nashv1le\', 17, 18, \'GPE\', True)\n (\'TN\', 19, 20, \'GPE\', False)\n (\'55555-1234\', 20, 23, \'ZIP\', True)\n (\'USA\', 25, 26, \'GPE\', False)\n (\'Converg\', 34, 35, \'BAND\', True)\n (\'Protet the Zero\', 36, 39, \'BAND\', True)\n\n\nAwesome! The small English model still makes a named entity recognition mistake ("5" in "Apt 5" as `CARDINAL`), but we\'re satisfied overall.\n\nLet\'s save this pipeline to disk and make sure we can load it back correctly.\n\n\n```python\nnlp.to_disk("./example")\nnlp = spacy.load("./example")\nnlp.pipe_names\n```\n\n\n\n\n [\'tok2vec\',\n \'tagger\',\n \'parser\',\n \'entity_ruler\',\n \'spaczz_ruler\',\n \'ner\',\n \'attribute_ruler\',\n \'lemmatizer\']\n\n\n\nWe can even ensure all the spaczz ruler patterns are still present.\n\n\n```python\nspaczz_ruler = nlp.get_pipe("spaczz_ruler")\nspaczz_ruler.patterns\n```\n\n\n\n\n [{\'label\': \'NAME\',\n \'pattern\': \'Grant Andersen\',\n \'type\': \'fuzzy\',\n \'kwargs\': {\'fuzzy_func\': \'token_sort\'}},\n {\'label\': \'GPE\', \'pattern\': \'Nashville\', \'type\': \'fuzzy\'},\n {\'label\': \'STREET\',\n \'pattern\': \'street_addresses\',\n \'type\': \'regex\',\n \'kwargs\': {\'predef\': True}},\n {\'label\': \'ZIP\',\n \'pattern\': \'\\\\b(?:55554){s<=1}(?:[-\\\\s]\\\\d{4})?\\\\b\',\n \'type\': \'regex\'},\n {\'label\': \'BAND\',\n \'pattern\': [{\'LOWER\': {\'FREGEX\': \'(converge){e<=1}\'}}],\n \'type\': \'token\'},\n {\'label\': \'BAND\',\n \'pattern\': [{\'TEXT\': {\'FUZZY\': \'Protest\'}},\n {\'IS_STOP\': True},\n {\'TEXT\': {\'FUZZY\': \'Hero\'}}],\n \'type\': \'token\'}]\n\n\n\n## Known Issues\n\n### Performance\n\nThe main reason for spaczz\'s slower speed is that the *c* in it\'s name is not capitalized like it is in spa*C*y.\nSpaczz is written in pure Python and it\'s matchers do not currently utilize spaCy language vocabularies, which means following it\'s logic should be easy to those familiar with Python. However this means spaczz components will run slower and likely consume more memory than their spaCy counterparts, especially as more patterns are added and documents get longer. It is therefore recommended to use spaCy components like the EntityRuler for entities with little uncertainty, like consistent spelling errors. Use spaczz components when there are not viable spaCy alternatives.\n\nI am actively working on performance optimizations to spaczz but it is a gradual process. Algorithmic and optimization suggestions are welcome. I am working on learning C but currently C-based work is outside of my skill set.\n\nThe `FuzzyMatcher`, and even more so, the `SimilarityMatcher` are the slowest spaczz components (although allowing for enough "fuzzy" matches in the `RegexMatcher` can get really slow as well). The primary methods for speeding these components up are by decreasing the `flex` parameter towards `0`, or if `flex > 0`, increasing the `min_r1` parameter towards the value of `min_r2` and/or lowering the `thresh` parameter towards `min_r2`. Be aware that all of these "speed-ups" come at the opportunity cost of potentially improved matches.\n\nAs mentioned in the `SimilarityMatcher` description, utilizing a GPU will also help speed up it\'s matching process.\n\nI will likely try to develop some automated and/or heuristic-based API options (while retaining all the current options) in the future to simplify this "tuning" process.\n\n### SpaczzRuler Inconsistencies\n\nThis one is particularly annoying for me because I built myself into this hole trying to support too much too fast. That being said I have addressed much of this as of spaczz 0.4.2 and will continue to improve these issues.\n\nSpaczz, like spaCy, has undefined behavior for multiple labels (or label/ent_id combos) sharing the same pattern. For example, if you add the pattern `"Ireland"` as both `"GPE"` and `"NAME"` the resulting label is unpredictable. For the most part this isn\'t an issue but spaczz also has to deal with the additional wrinkle of fuzzy matches.\n\nFor example if we are looking for the string `"Ireland"` and have the patterns `["Ireland", "Iceland"]`. Even with a required match ratio of `85` these will both match at `100` and `86` respectively. When just dealing with fuzzy matches this isn\'t an issue as we can sort by descending match ratio. However what if the `"Iceland"` pattern was a regex pattern and it returned a tuple of fuzzy regex counts? Or what if the `"Iceland"` pattern was a token pattern and the `TokenMatcher` does not even currently provide match details?!\n\nThe above problem is twofold. First and foremost, I need to develop a way or ways to compare apples to oranges - fuzzy ratios and fuzzy regex counts. Then I need to figure out how to include match details from the `TokenMatcher` which supports both fuzzy and "fuzzy" regex matches.\n\nFor a short-term solution I am having the entity ruler first go through sorted fuzzy matches, then sorted regex matches, and lastly token matches. Token matches will only be sorted by length of match, not quality, so they may provide inconsistent results. Try to be mindful of your token patterns.\n\nThere is additional logic in place to filter overlapping matches preserving earlier matches over later ones. This order of priority (fuzzy, regex, token) may not be ideal for everyone but adding a way to change the order (say regex patterns first) would a temporary solution to a temporary problem.\n\nPlease bear with me through these growing pains.\n\n## Roadmap\n\nI am always open and receptive to feature requests but just be aware, as a solo-dev with a lot left to learn, development can move pretty slow. The following is my roadmap for spaczz so you can see where issues raised might fit into my current priorities.\n\n**High Priority**\n\n1. Bug fixes - both breaking and behavioral. Hopefully these will be minimal.\n1. Ease of use and error/warning handling and messaging enhancements.\n1. Building out Read the Docs.\n1. A method for comparing fuzzy ratios and fuzzy regex counts.\n1. A way to return match details from the `TokenMatcher`.\n1. Option to prioritize match quality over length and/or weighing options.\n1. Profiling - hopefully to find "easy" performance optimizations.\n\n**Enhancements**\n\n1. API support for adding user-defined regexes to the predefined regex.\n 1. Saving these additional predefined regexes as part of the SpaczzRuler will also be supported.\n1. Entity start/end trimming on the token level to prevent fuzzy and regex phrase matches from starting/ending with unwanted tokens, i.e. spaces/punctuation.\n\n**Long-Horizon Performance Enhancements**\n\n1. Having spaczz matchers utilize spaCy vocabularies.\n1. Rewrite the phrase and token searching algorithms in Cython to utilize C speed.\n 1. Try to integrate closely with spaCy.\n\n## Development\n\nPull requests and contributors are welcome.\n\nspaczz is linted with [Flake8](https://flake8.pycqa.org/en/latest/), formatted with [Black](https://black.readthedocs.io/en/stable/), type-checked with [MyPy](http://mypy-lang.org/) (although this could benefit from improved specificity), tested with [Pytest](https://docs.pytest.org/en/stable/), automated with [Nox](https://nox.thea.codes/en/stable/), and built/packaged with [Poetry](https://python-poetry.org/). There are a few other development tools detailed in the noxfile.py, along with Git pre-commit hooks.\n\nTo contribute to spaczz\'s development, fork the repository then install spaczz and it\'s dev dependencies with Poetry. If you\'re interested in being a regular contributor please contact me directly.\n\n\n```python\npoetry install # Within spaczz\'s root directory.\n```\n\nI keep Nox and pre-commit outside of my poetry environment as part of my Python toolchain environments. With pre-commit installed you may also need to run the below to commit changes.\n\n\n```python\npre-commit install\n```\n\nThe only other package that will not be installed via Poetry but is used for testing and in-documentation examples is the spaCy medium English model (`en-core-web-md`). This will need to be installed separately. The command below should do the trick:\n\n\n```python\npoetry run python -m spacy download "en_core_web_md"\n```\n\n## References\n\n- Spaczz tries to stay as close to [spaCy](https://spacy.io/)\'s API as possible. Whenever it made sense to use existing spaCy code within spaczz this was done.\n- Fuzzy matching is performed using [RapidFuzz](https://github.com/maxbachmann/rapidfuzz).\n- Regexes are performed using the [regex](https://pypi.org/project/regex/) library.\n- The search algorithm for phrased-based fuzzy and similarity matching was heavily influnced by Stack Overflow user Ulf Aslak\'s answer in this [thread](https://stackoverflow.com/questions/36013295/find-best-substring-match).\n- Spaczz\'s predefined regex patterns were borrowed from the [commonregex](https://github.com/madisonmay/CommonRegex) package.\n- Spaczz\'s development and CI/CD patterns were inspired by Claudio Jolowicz\'s [*Hypermodern Python*](https://cjolowicz.github.io/posts/hypermodern-python-01-setup/) article series.\n', | ||
| 'long_description': '[](https://github.com/gandersen101/spaczz/actions?workflow=Tests)\n[](https://codecov.io/gh/gandersen101/spaczz)\n[](https://pypi.org/project/spaczz/)\n[](https://spaczz.readthedocs.io/)\n\n# spaczz: Fuzzy matching and more for spaCy\n\nSpaczz provides fuzzy matching and additional regex matching functionality for [spaCy](https://spacy.io/).\nSpaczz\'s components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.\n\nFuzzy matching is currently performed with matchers from [RapidFuzz](https://github.com/maxbachmann/rapidfuzz)\'s fuzz module and regex matching currently relies on the [regex](https://pypi.org/project/regex/) library. Spaczz certainly takes additional influence from other libraries and resources. For additional details see the references section.\n\n**Supports spaCy v3 and v2 (>= 2.2)!**\n\nSpaczz has been tested on Ubuntu 18.04, MacOS 10.15, and Windows Server 2019.\n\n*v0.5.3 Release Notes:*\n- *Fixed a "bug" in the `TokenMatcher`. Spaczz expects token matches returned in order of ascending match start, then descending match length. However, spaCy\'s `Matcher` does not return matches in this order by default. Added a sort in the `TokenMatcher` to ensure this.*\n\n*v0.5.2 Release Notes:*\n- *Minor updates to pre-commits and noxfile.*\n\nPlease see the [changelog](https://github.com/gandersen101/spaczz/blob/master/CHANGELOG.md) for previous release notes. This will eventually be moved to the [Read the Docs](https://spaczz.readthedocs.io/en/latest/) page.\n\n<h1>Table of Contents<span class="tocSkip"></span></h1>\n<div class="toc"><ul class="toc-item"><li><span><a href="#Installation" data-toc-modified-id="Installation-1">Installation</a></span></li><li><span><a href="#Basic-Usage" data-toc-modified-id="Basic-Usage-2">Basic Usage</a></span><ul class="toc-item"><li><span><a href="#FuzzyMatcher" data-toc-modified-id="FuzzyMatcher-2.1">FuzzyMatcher</a></span></li><li><span><a href="#RegexMatcher" data-toc-modified-id="RegexMatcher-2.2">RegexMatcher</a></span></li><li><span><a href="#SimilarityMatcher" data-toc-modified-id="SimilarityMatcher-2.3">SimilarityMatcher</a></span></li><li><span><a href="#TokenMatcher" data-toc-modified-id="TokenMatcher-2.4">TokenMatcher</a></span></li><li><span><a href="#SpaczzRuler" data-toc-modified-id="SpaczzRuler-2.5">SpaczzRuler</a></span></li><li><span><a href="#Custom-Attributes" data-toc-modified-id="Custom-Attributes-2.6">Custom Attributes</a></span></li><li><span><a href="#Saving/Loading" data-toc-modified-id="Saving/Loading-2.7">Saving/Loading</a></span></li></ul></li><li><span><a href="#Known-Issues" data-toc-modified-id="Known-Issues-3">Known Issues</a></span><ul class="toc-item"><li><span><a href="#Performance" data-toc-modified-id="Performance-3.1">Performance</a></span></li><li><span><a href="#SpaczzRuler-Inconsistencies" data-toc-modified-id="SpaczzRuler-Inconsistencies-3.2">SpaczzRuler Inconsistencies</a></span></li></ul></li><li><span><a href="#Roadmap" data-toc-modified-id="Roadmap-4">Roadmap</a></span></li><li><span><a href="#Development" data-toc-modified-id="Development-5">Development</a></span></li><li><span><a href="#References" data-toc-modified-id="References-6">References</a></span></li></ul></div>\n\n## Installation\n\nSpaczz can be installed using pip.\n\n\n```python\npip install spaczz\n```\n\n## Basic Usage\n\nSpaczz\'s primary features are the `FuzzyMatcher`, `RegexMatcher`, and "fuzzy" `TokenMatcher` that function similarly to spaCy\'s `Matcher` and `PhraseMatcher`, and the `SpaczzRuler` which integrates the spaczz matchers into a spaCy pipeline component similar to spaCy\'s `EntityRuler`.\n\n### FuzzyMatcher\n\nThe basic usage of the fuzzy matcher is similar to spaCy\'s `PhraseMatcher` except it returns the fuzzy ratio along with match id, start and end information, so make sure to include a variable for the ratio when unpacking results.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\ntext = """Grint Anderson created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")])\nmatcher.add("GPE", [nlp("Nashville")])\nmatches = matcher(doc)\n\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\n NAME Grint Anderson 86\n GPE Nashv1le 82\n\n\nUnlike spaCy matchers, spaczz matchers are written in pure Python. While they are required to have a spaCy vocab passed to them during initialization, this is purely for consistency as the spaczz matchers do not use currently use the spaCy vocab. This is why the `match_id` above is simply a string instead of an integer value like in spaCy matchers.\n\nSpaczz matchers can also make use of on-match rules via callback functions. These on-match callbacks need to accept the matcher itself, the doc the matcher was called on, the match index and the matches produced by the matcher.\n\n\n```python\nimport spacy\nfrom spacy.tokens import Span\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\ntext = """Grint Anderson created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\n\ndef add_name_ent(matcher, doc, i, matches):\n """Callback on match function. Adds "NAME" entities to doc."""\n # Get the current match and create tuple of entity label, start and end.\n # Append entity to the doc\'s entity. (Don\'t overwrite doc.ents!)\n _match_id, start, end, _ratio = matches[i]\n entity = Span(doc, start, end, label="NAME")\n doc.ents += (entity,)\n\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")], on_match=add_name_ent)\nmatches = matcher(doc)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_))\n```\n\n (\'Grint Anderson\', 0, 2, \'NAME\')\n\n\nLike spaCy\'s `EntityRuler`, a very similar entity updating logic has been implemented in the `SpaczzRuler`. The `SpaczzRuler` also takes care of handling overlapping matches. It is discussed in a later section.\n\nUnlike spaCy\'s matchers, rules added to spaczz matchers have optional keyword arguments that can modify the matching behavior. Take the below fuzzy matching examples:\n\n\n```python\nimport spacy\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\n# Let\'s modify the order of the name in the text.\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")])\nmatches = matcher(doc)\n\n# The default fuzzy matching settings will not find a match.\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\nNext we change the fuzzy matching behavior for the "NAME" rule.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import FuzzyMatcher\n\nnlp = spacy.blank("en")\n# Let\'s modify the order of the name in the text.\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = FuzzyMatcher(nlp.vocab)\nmatcher.add("NAME", [nlp("Grant Andersen")], kwargs=[{"fuzzy_func": "token_sort"}])\nmatches = matcher(doc)\n\n# The default fuzzy matching settings will not find a match.\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\n NAME Anderson, Grint 86\n\n\nThe full list of keyword arguments available for fuzzy matching rules includes:\n\n- `fuzzy_func`: Key name of fuzzy matching function to use. All rapidfuzz matching functions with default settings are available. Default is `"simple"`:\n - "simple" = `ratio`\n - "partial" = `partial_ratio`\n - "token_set" = `token_set_ratio`\n - "token_sort" = `token_sort_ratio`\n - "partial_token_set" = `partial_token_set_ratio`\n - "partial_token_sort" = `partial_token_sort_ratio`\n - "quick" = `QRatio`\n - "weighted" = `WRatio`\n - "token" = `token_ratio`,\n - "partial_token" = `partial_token_ratio`\n Default is `"simple"`.\n- `ignore_case`: If strings should be lower-cased before comparison or not. Default is `True`.\n- `flex`: Number of tokens to move match match boundaries left and right during optimization. Can be an integer value with a max of `len(query)` and a min of `0` (will warn and change if higher or lower),or the strings "max", "min", or "default". Default is `"default"`: `len(query) // 2`.\n- `min_r1`: Minimum match ratio required forselection during the intial search over doc. If `flex == 0`, `min_r1` will be overwritten by `min_r2`. If `flex > 0`, `min_r1` must be lower than `min_r2` and "low" in general because match boundaries are not flexed initially. Default is `50`.\n- `min_r2`: Minimum match ratio required for selection during match optimization. Needs to be higher than `min_r1` and "high" in general to ensure only quality matches are returned. Default is `75`.\n- `thresh`: If this ratio is exceeded in initial scan, and `flex > 0`, no optimization will be attempted. If `flex == 0`, `thresh` has no effect. Default is `100`.\n\n### RegexMatcher\n\nThe basic usage of the regex matcher is also fairly similar to spaCy\'s `PhraseMatcher`. It accepts regex patterns as strings so flags must be inline. Regexes are compiled with the [regex](https://pypi.org/project/regex/) package so approximate "fuzzy" matching is supported. To provide access to these "fuzzy" match results the matcher returns the fuzzy count values along with match id, start and end information, so make sure to include a variable for the counts when unpacking results.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import RegexMatcher\n\nnlp = spacy.blank("en")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nmatcher = RegexMatcher(nlp.vocab)\n# Use inline flags for regex strings as needed\nmatcher.add(\n "APT",\n [\n r"""(?ix)((?:apartment|apt|building|bldg|floor|fl|suite|ste|unit\n|room|rm|department|dept|row|rw)\\.?\\s?)#?\\d{1,4}[a-z]?"""\n ],\n) # Not the most robust regex.\nmatcher.add("GPE", [r"(USA){d<=1}"]) # Fuzzy regex.\nmatches = matcher(doc)\n\nfor match_id, start, end, counts in matches:\n print(match_id, doc[start:end], counts)\n```\n\n APT Apt 5 (0, 0, 0)\n GPE US (0, 0, 1)\n\n\nSpaczz matchers can also make use of on-match rules via callback functions. These on-match callbacks need to accept the matcher itself, the doc the matcher was called on, the match index and the matches produced by the matcher. See the fuzzy matcher usage example above for details.\n\nLike the fuzzy matcher, the regex matcher has optional keyword arguments that can modify matching behavior. Take the below regex matching example.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import RegexMatcher\n\nnlp = spacy.blank("en")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the USA.""" # Spelling errors intentional. Notice \'USA\' here.\ndoc = nlp(text)\n\nmatcher = RegexMatcher(nlp.vocab)\n# Use inline flags for regex strings as needed\nmatcher.add(\n "STREET", ["street_addresses"], kwargs=[{"predef": True}]\n) # Use predefined regex by key name.\n# Below will not expand partial matches to span boundaries.\nmatcher.add("GPE", [r"(?i)[U](nited|\\.?) ?[S](tates|\\.?)"], kwargs=[{"partial": False}])\nmatches = matcher(doc)\n\nfor match_id, start, end, counts in matches:\n print(\n match_id, doc[start:end], counts\n ) # comma in result isn\'t ideal - see "Roadmap"\n```\n\n STREET 555 Fake St, (0, 0, 0)\n\n\nThe full list of keyword arguments available for regex matching rules includes:\n\n- `partial`: Whether partial matches should be extended to existing span boundaries in doc or not, i.e. the regex only matches part of a token or span. Default is True.\n- `predef`: Whether the regex string should be interpreted as a key to a predefined regex pattern or not. Default is False. The included regexes are:\n - `"dates"`\n - `"times"`\n - `"phones"`\n - `"phones_with_exts"`\n - `"links"`\n - `"emails"`\n - `"ips"`\n - `"ipv6s"`\n - `"prices"`\n - `"hex_colors"`\n - `"credit_cards"`\n - `"btc_addresses"`\n - `"street_addresses"`\n - `"zip_codes"`\n - `"po_boxes"`\n - `"ssn_number"`\n\nThe above patterns are the same that the [commonregex](https://github.com/madisonmay/CommonRegex) package provides.\n\n### SimilarityMatcher\n\nThe basic usage of the similarity matcher is similar to spaCy\'s `PhraseMatcher` except it returns the vector similarity ratio along with match id, start and end information, so make sure to include a variable for the ratio when unpacking results.\n\nIn order to produce meaningful results from the similarity matcher, a spaCy model with word vectors (ex. medium or large English models) must be used to initialize the matcher, process the target document, and process any patterns added.\n\n\n```python\nimport spacy\nfrom spaczz.matcher import SimilarityMatcher\n\nnlp = spacy.load("en_core_web_md")\ntext = "I like apples, grapes and bananas."\ndoc = nlp(text)\n\n# lowering min_r2 from default of 75 to produce matches in this example\nmatcher = SimilarityMatcher(nlp.vocab, min_r2=65)\nmatcher.add("FRUIT", [nlp("fruit")])\nmatches = matcher(doc)\n\nfor match_id, start, end, ratio in matches:\n print(match_id, doc[start:end], ratio)\n```\n\n FRUIT apples 72\n FRUIT grapes 72\n FRUIT bananas 68\n\n\nPlease note that even for the mostly pure-Python spaczz, this process is currently extremely slow so be mindful of the scope in which it is applied. Enabling GPU support in spaCy ([see here](https://spacy.io/usage#gpu)) should improve the speed somewhat, but I believe the process will still be bottlenecked in the pure-Python search algorithm until I develop a better search algorithm and/or drop the search to lower-level code (ex C).\n\nAlso as a somewhat experimental feature, the similarity matcher is not currently part of the `SpaczzRuler` nor does it have a separate ruler. If you need to add similarity matches to a doc\'s entities you will need to use an on-match callback for the time being. Please see the fuzzy matcher on-match callback example above for ideas. If there is enough interest in integrating/creating a ruler for the similarity matcher this can be done.\n\nThe full list of keyword arguments available for similarity matching rules includes:\n\n- `flex`: Number of tokens to move match span boundaries left and right during match optimization. Can be an integer value with a max of `len(query)` and a min of `0` (will warn and change if higher or lower), `"max"`, `"min"`, or `"default"`. Default is `"default"`: `len(query) // 2`.\n- `min_r1`: Minimum similarity match ratio required for selection during the intial search over doc. This should be lower than `min_r2` and "low" in general because match span boundaries are not flexed initially. `0` means all spans of query length in doc will have their boundaries flexed and will be re-compared during match optimization. Lower `min_r1` will result in more fine-grained matching but will run slower. Default is `50`.\n- `min_r2`: Minimum similarity match ratio required for selection during match optimization. Should be higher than `min_r1` and "high" in general to ensure only quality matches are returned. Default is `75`.\n- `thresh`: If this ratio is exceeded in initial scan no optimization will be attempted. Default is `100`.\n\n### TokenMatcher\n\nThe basic usage of the token matcher is similar to spaCy\'s `Matcher`. It accepts labeled patterns in the form of lists of dictionaries where each list describes an individual pattern and each dictionary describes an individual token.\n\nThe token matcher accepts all the same token attributes and pattern syntax as it\'s spaCy counterpart but adds fuzzy and fuzzy-regex support.\n\n`"FUZZY"` and `"FREGEX"` are the two additional spaCy token pattern options.\n\nFor example:\n `{"TEXT": {"FREGEX": "(database){e<=1}"}},`\n `{"LOWER": {"FUZZY": "access", "MIN_R": 85, "FUZZY_FUNC": "quick_lev"}}`\n\n**Make sure to use uppercase dictionary keys in patterns.**\n\n\n```python\nimport spacy\nfrom spaczz.matcher import TokenMatcher\n\n# Using model results like POS tagging in token patterns requires model that provides these.\nnlp = spacy.load("en_core_web_md")\ntext = """The manager gave me SQL databesE acess so now I can acces the Sequal DB.\nMy manager\'s name is Grfield"""\ndoc = nlp(text)\n\nmatcher = TokenMatcher(vocab=nlp.vocab)\nmatcher.add(\n "DATA",\n [\n [\n {"TEXT": "SQL"},\n {"LOWER": {"FREGEX": "(database){s<=1}"}},\n {"LOWER": {"FUZZY": "access"}},\n ],\n [{"TEXT": {"FUZZY": "Sequel"}, "POS": "PROPN"}, {"LOWER": "db"}],\n ],\n)\nmatcher.add("NAME", [[{"TEXT": {"FUZZY": "Garfield"}}]])\nmatches = matcher(doc)\n\nfor match_id, start, end, _ in matches: # Note the _ here. Explained below.\n print(match_id, doc[start:end])\n```\n\n DATA SQL databesE acess\n DATA Sequal DB\n NAME Grfield\n\n\nPlease note that the way the token matcher is implemented does not currently have a way to return fuzzy ratio or fuzzy-regex counts like the fuzzy matcher and regex matcher provide. To keep the API consistent, the token matcher returns a placeholder of `None` as the fourth element of the tuples it returns, so be sure to account for this like we did with `_` in unpacking above.\n\nAlso, even though the token matcher can be a drop-in replacement for spaCy\'s `Matcher`, it is still recommended to use spaCy\'s `Matcher` if you do not need the spaczz token matcher\'s fuzzy capabilities - it will slow processing down unnecessarily.\n\n### SpaczzRuler\n\nThe spaczz ruler combines the fuzzy and regex phrase matchers, and the "fuzzy" token matcher, into one pipeline component that can update a doc entities similar to spaCy\'s `EntityRuler`.\n\nPatterns must be added as an iterable of dictionaries in the format of *{label (str), pattern(str or list), type(str), optional kwargs (dict), and optional id (str)}*.\n\nFor example, a fuzzy phrase pattern:\n\n`{\'label\': \'ORG\', \'pattern\': \'Apple\', \'type\': \'fuzzy\', \'kwargs\': {\'min_r2\': 90}}`\n\nOr, a token pattern:\n\n`{\'label\': \'ORG\', \'pattern\': [{\'TEXT\': {\'FUZZY\': \'Apple\'}}], \'type\': \'token\'}`\n\n\n```python\nimport spacy\nfrom spaczz.pipeline import SpaczzRuler\n\nnlp = spacy.blank("en")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the USA.\nSome of his favorite bands are Converg and Protet the Zero.""" # Spelling errors intentional.\ndoc = nlp(text)\n\npatterns = [\n {\n "label": "NAME",\n "pattern": "Grant Andersen",\n "type": "fuzzy",\n "kwargs": {"fuzzy_func": "token_sort"},\n },\n {\n "label": "STREET",\n "pattern": "street_addresses",\n "type": "regex",\n "kwargs": {"predef": True},\n },\n {"label": "GPE", "pattern": "Nashville", "type": "fuzzy"},\n {\n "label": "ZIP",\n "pattern": r"\\b(?:55554){s<=1}(?:(?:[-\\s])?\\d{4}\\b)",\n "type": "regex",\n }, # fuzzy regex\n {"label": "GPE", "pattern": "(?i)[U](nited|\\.?) ?[S](tates|\\.?)", "type": "regex"},\n {\n "label": "BAND",\n "pattern": [{"LOWER": {"FREGEX": "(converge){e<=1}"}}],\n "type": "token",\n },\n {\n "label": "BAND",\n "pattern": [\n {"TEXT": {"FUZZY": "Protest"}},\n {"IS_STOP": True},\n {"TEXT": {"FUZZY": "Hero"}},\n ],\n "type": "token",\n },\n]\n\nruler = SpaczzRuler(nlp)\nruler.add_patterns(patterns)\ndoc = ruler(doc)\n\nprint("Fuzzy Matches:")\nfor ent in doc.ents:\n if ent._.spaczz_type == "fuzzy":\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))\n\nprint("\\n", "Regex Matches:", sep="")\nfor ent in doc.ents:\n if ent._.spaczz_type == "regex":\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_counts))\n\nprint("\\n", "Token Matches:", sep="")\nfor ent in doc.ents:\n if ent._.spaczz_type == "token":\n # ._.spaczz_details is currently just placeholder value of 1\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_details))\n```\n\n Fuzzy Matches:\n (\'Anderson, Grint\', 0, 3, \'NAME\', 86)\n (\'Nashv1le\', 17, 18, \'GPE\', 82)\n\n Regex Matches:\n (\'555 Fake St,\', 9, 13, \'STREET\', (0, 0, 0))\n (\'55555-1234\', 20, 23, \'ZIP\', (1, 0, 0))\n (\'USA\', 25, 26, \'GPE\', (0, 0, 0))\n\n Token Matches:\n (\'Converg\', 34, 35, \'BAND\', 1)\n (\'Protet the Zero\', 36, 39, \'BAND\', 1)\n\n\nWe see in the example above that we are referencing some custom attributes, which are explained below.\n\nFor more `SpaczzRuler` examples see [here](https://github.com/gandersen101/spaczz/blob/master/examples/fuzzy_matching_tweaks.md). In particular this provides details about the ruler\'s sorting process and fuzzy matching parameters.\n\n### Custom Attributes\n\nSpaczz initializes some custom attributes upon importing. These are under spaCy\'s `._.` attribute and are further prepended with `spaczz_` so there should be not conflicts with your own custom attributes. If there are spaczz will force overwrite them.\n\nThese custom attributes are only set via the spaczz ruler at the token level. Span and doc versions of these attributes are getters that reference the token level attributes.\n\nThe following `Token` attributes are available. All are mutable:\n\n- `spaczz_token`: default = `False`. Boolean that denotes if the token is part of an ent set by the spaczz ruler.\n- `spaczz_type`: default = `None`. String that shows which matcher produced an ent using the token.\n- `spaczz_ratio`: default = `None`. If the token is part of fuzzy-phrase-matched ent, will return fuzzy ratio.\n- `spaczz_counts`: default = `None`. If the token is part of regex-phrase-matched ent, will return fuzzy counts.\n- `spaczz_details`: default = `None`. Placeholder for token matcher fuzzy ratio/counts. To be developed. Will return 1 if the token is part of a "fuzzy"-token-matched ent.\n\nThe following `Span` attributes reference the token attributes included in the span. All are immutable:\n\n- `spaczz_ent`: default = `False`. Boolean that denotes if all tokens in span are part of an ent set by the spaczz ruler.\n- `spaczz_type`: default = `None`. String that denotes which matcher produced an ent using the included tokens.\n- `spaczz_types`: default = `set()`. Set that shows which matchers produced ents using the included tokens. An entity span should only have one type, but this allows you to see the types included in any arbitrary span.\n- `spaczz_ratio`: default = `None`. If all the tokens in span are part of fuzzy-phrase-matched ent, will return fuzzy ratio.\n- `spaczz_counts`: default = `None`. If all the tokens in span are part of regex-phrase-matched ent, will return fuzzy counts.\n- `spaczz_details`: default = `None`. Placeholder for token matcher fuzzy ratio/counts. To be developed. Will return 1 if all the tokens in span are part of a "fuzzy"-token-matched ent.\n\nThe following `Doc` attributes reference the token attributes included in the doc. All are immutable:\n\n- `spaczz_doc`: default = `False`. Boolean that denotes if any tokens in the doc are part of an ent set by the spaczz ruler.\n- `spaczz_types`: default = `set()`. Set that shows which matchers produced ents in the doc.\n\n### Saving/Loading\n\nThe `SpaczzRuler` has it\'s own to/from disk/bytes methods and will accept `config` parameters passed to `spacy.load()`. It also has it\'s own spaCy factory entry point so spaCy is aware of the `SpaczzRuler`. Below is an example of saving and loading a spacy pipeline with the small English model, the `EntityRuler`, and the `SpaczzRuler`.\n\n\n```python\nimport spacy\nfrom spaczz.pipeline import SpaczzRuler\n\nnlp = spacy.load("en_core_web_sm")\ntext = """Anderson, Grint created spaczz in his home at 555 Fake St,\nApt 5 in Nashv1le, TN 55555-1234 in the USA.\nSome of his favorite bands are Converg and Protet the Zero.""" # Spelling errors intentional.\ndoc = nlp(text)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_))\n```\n\n (\'Anderson\', 0, 1, \'PERSON\')\n (\'Grint\', 2, 3, \'PERSON\')\n (\'555\', 9, 10, \'CARDINAL\')\n (\'5\', 15, 16, \'CARDINAL\')\n (\'TN 55555-1234\', 19, 23, \'DATE\')\n (\'USA\', 25, 26, \'GPE\')\n (\'Converg\', 34, 35, \'PERSON\')\n (\'Protet\', 36, 37, \'PERSON\')\n\n\nWhile spaCy does a decent job of identifying that named entities are present in this example, we can definitely improve the matches - particularly with the types of labels applied.\n\nLet\'s add an entity ruler for some rules-based matches.\n\n\n```python\nfrom spacy.pipeline import EntityRuler\n\nentity_ruler = nlp.add_pipe("entity_ruler", before="ner") #spaCy v3 syntax\nentity_ruler.add_patterns(\n [{"label": "GPE", "pattern": "Nashville"}, {"label": "GPE", "pattern": "TN"}]\n)\n\ndoc = nlp(text)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_))\n```\n\n (\'Anderson\', 0, 1, \'PERSON\')\n (\'Grint\', 2, 3, \'PERSON\')\n (\'555\', 9, 10, \'CARDINAL\')\n (\'5\', 15, 16, \'CARDINAL\')\n (\'TN\', 19, 20, \'GPE\')\n (\'USA\', 25, 26, \'GPE\')\n (\'Converg\', 34, 35, \'PERSON\')\n (\'Protet\', 36, 37, \'PERSON\')\n\n\nWe\'re making progress, but Nashville is spelled wrong in the text so the entity ruler does not find it, and we still have other entities to fix/find.\n\nLet\'s add a spaczz ruler to round this pipeline out. We will also include the `spaczz_span` custom attribute in the results to denote which entities were set via spaczz.\n\n\n```python\nspaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax\nspaczz_ruler.add_patterns(\n [\n {\n "label": "NAME",\n "pattern": "Grant Andersen",\n "type": "fuzzy",\n "kwargs": {"fuzzy_func": "token_sort"},\n },\n {\n "label": "STREET",\n "pattern": "street_addresses",\n "type": "regex",\n "kwargs": {"predef": True},\n },\n {"label": "GPE", "pattern": "Nashville", "type": "fuzzy"},\n {\n "label": "ZIP",\n "pattern": r"\\b(?:55554){s<=1}(?:[-\\s]\\d{4})?\\b",\n "type": "regex",\n }, # fuzzy regex\n {\n "label": "BAND",\n "pattern": [{"LOWER": {"FREGEX": "(converge){e<=1}"}}],\n "type": "token",\n },\n {\n "label": "BAND",\n "pattern": [\n {"TEXT": {"FUZZY": "Protest"}},\n {"IS_STOP": True},\n {"TEXT": {"FUZZY": "Hero"}},\n ],\n "type": "token",\n },\n ]\n)\n\ndoc = nlp(text)\n\nfor ent in doc.ents:\n print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ent))\n```\n\n (\'Anderson, Grint\', 0, 3, \'NAME\', True)\n (\'555 Fake St,\', 9, 13, \'STREET\', True)\n (\'5\', 15, 16, \'CARDINAL\', False)\n (\'Nashv1le\', 17, 18, \'GPE\', True)\n (\'TN\', 19, 20, \'GPE\', False)\n (\'55555-1234\', 20, 23, \'ZIP\', True)\n (\'USA\', 25, 26, \'GPE\', False)\n (\'Converg\', 34, 35, \'BAND\', True)\n (\'Protet the Zero\', 36, 39, \'BAND\', True)\n\n\nAwesome! The small English model still makes a named entity recognition mistake ("5" in "Apt 5" as `CARDINAL`), but we\'re satisfied overall.\n\nLet\'s save this pipeline to disk and make sure we can load it back correctly.\n\n\n```python\nnlp.to_disk("./example")\nnlp = spacy.load("./example")\nnlp.pipe_names\n```\n\n\n\n\n [\'tok2vec\',\n \'tagger\',\n \'parser\',\n \'entity_ruler\',\n \'spaczz_ruler\',\n \'ner\',\n \'attribute_ruler\',\n \'lemmatizer\']\n\n\n\nWe can even ensure all the spaczz ruler patterns are still present.\n\n\n```python\nspaczz_ruler = nlp.get_pipe("spaczz_ruler")\nspaczz_ruler.patterns\n```\n\n\n\n\n [{\'label\': \'NAME\',\n \'pattern\': \'Grant Andersen\',\n \'type\': \'fuzzy\',\n \'kwargs\': {\'fuzzy_func\': \'token_sort\'}},\n {\'label\': \'GPE\', \'pattern\': \'Nashville\', \'type\': \'fuzzy\'},\n {\'label\': \'STREET\',\n \'pattern\': \'street_addresses\',\n \'type\': \'regex\',\n \'kwargs\': {\'predef\': True}},\n {\'label\': \'ZIP\',\n \'pattern\': \'\\\\b(?:55554){s<=1}(?:[-\\\\s]\\\\d{4})?\\\\b\',\n \'type\': \'regex\'},\n {\'label\': \'BAND\',\n \'pattern\': [{\'LOWER\': {\'FREGEX\': \'(converge){e<=1}\'}}],\n \'type\': \'token\'},\n {\'label\': \'BAND\',\n \'pattern\': [{\'TEXT\': {\'FUZZY\': \'Protest\'}},\n {\'IS_STOP\': True},\n {\'TEXT\': {\'FUZZY\': \'Hero\'}}],\n \'type\': \'token\'}]\n\n\n\n## Known Issues\n\n### Performance\n\nThe main reason for spaczz\'s slower speed is that the *c* in it\'s name is not capitalized like it is in spa*C*y.\nSpaczz is written in pure Python and it\'s matchers do not currently utilize spaCy language vocabularies, which means following it\'s logic should be easy to those familiar with Python. However this means spaczz components will run slower and likely consume more memory than their spaCy counterparts, especially as more patterns are added and documents get longer. It is therefore recommended to use spaCy components like the EntityRuler for entities with little uncertainty, like consistent spelling errors. Use spaczz components when there are not viable spaCy alternatives.\n\nI am actively working on performance optimizations to spaczz but it is a gradual process. Algorithmic and optimization suggestions are welcome. I am working on learning C but currently C-based work is outside of my skill set.\n\nThe `FuzzyMatcher`, and even more so, the `SimilarityMatcher` are the slowest spaczz components (although allowing for enough "fuzzy" matches in the `RegexMatcher` can get really slow as well). The primary methods for speeding these components up are by decreasing the `flex` parameter towards `0`, or if `flex > 0`, increasing the `min_r1` parameter towards the value of `min_r2` and/or lowering the `thresh` parameter towards `min_r2`. Be aware that all of these "speed-ups" come at the opportunity cost of potentially improved matches.\n\nAs mentioned in the `SimilarityMatcher` description, utilizing a GPU will also help speed up it\'s matching process.\n\nI will likely try to develop some automated and/or heuristic-based API options (while retaining all the current options) in the future to simplify this "tuning" process.\n\n### SpaczzRuler Inconsistencies\n\nThis one is particularly annoying for me because I built myself into this hole trying to support too much too fast. That being said I have addressed much of this as of spaczz 0.4.2 and will continue to improve these issues.\n\nSpaczz, like spaCy, has undefined behavior for multiple labels (or label/ent_id combos) sharing the same pattern. For example, if you add the pattern `"Ireland"` as both `"GPE"` and `"NAME"` the resulting label is unpredictable. For the most part this isn\'t an issue but spaczz also has to deal with the additional wrinkle of fuzzy matches.\n\nFor example if we are looking for the string `"Ireland"` and have the patterns `["Ireland", "Iceland"]`. Even with a required match ratio of `85` these will both match at `100` and `86` respectively. When just dealing with fuzzy matches this isn\'t an issue as we can sort by descending match ratio. However what if the `"Iceland"` pattern was a regex pattern and it returned a tuple of fuzzy regex counts? Or what if the `"Iceland"` pattern was a token pattern and the `TokenMatcher` does not even currently provide match details?!\n\nThe above problem is twofold. First and foremost, I need to develop a way or ways to compare apples to oranges - fuzzy ratios and fuzzy regex counts. Then I need to figure out how to include match details from the `TokenMatcher` which supports both fuzzy and "fuzzy" regex matches.\n\nFor a short-term solution I am having the entity ruler first go through sorted fuzzy matches, then sorted regex matches, and lastly token matches. Token matches will only be sorted by length of match, not quality, so they may provide inconsistent results. Try to be mindful of your token patterns.\n\nThere is additional logic in place to filter overlapping matches preserving earlier matches over later ones. This order of priority (fuzzy, regex, token) may not be ideal for everyone but adding a way to change the order (say regex patterns first) would a temporary solution to a temporary problem.\n\nPlease bear with me through these growing pains.\n\n## Roadmap\n\nI am always open and receptive to feature requests but just be aware, as a solo-dev with a lot left to learn, development can move pretty slow. The following is my roadmap for spaczz so you can see where issues raised might fit into my current priorities.\n\n**High Priority**\n\n1. Bug fixes - both breaking and behavioral. Hopefully these will be minimal.\n1. Ease of use and error/warning handling and messaging enhancements.\n1. Building out Read the Docs.\n1. A method for comparing fuzzy ratios and fuzzy regex counts.\n1. A way to return match details from the `TokenMatcher`.\n1. Option to prioritize match quality over length and/or weighing options.\n1. Profiling - hopefully to find "easy" performance optimizations.\n\n**Enhancements**\n\n1. API support for adding user-defined regexes to the predefined regex.\n 1. Saving these additional predefined regexes as part of the SpaczzRuler will also be supported.\n1. Entity start/end trimming on the token level to prevent fuzzy and regex phrase matches from starting/ending with unwanted tokens, i.e. spaces/punctuation.\n\n**Long-Horizon Performance Enhancements**\n\n1. Having spaczz matchers utilize spaCy vocabularies.\n1. Rewrite the phrase and token searching algorithms in Cython to utilize C speed.\n 1. Try to integrate closely with spaCy.\n\n## Development\n\nPull requests and contributors are welcome.\n\nspaczz is linted with [Flake8](https://flake8.pycqa.org/en/latest/), formatted with [Black](https://black.readthedocs.io/en/stable/), type-checked with [MyPy](http://mypy-lang.org/) (although this could benefit from improved specificity), tested with [Pytest](https://docs.pytest.org/en/stable/), automated with [Nox](https://nox.thea.codes/en/stable/), and built/packaged with [Poetry](https://python-poetry.org/). There are a few other development tools detailed in the noxfile.py, along with Git pre-commit hooks.\n\nTo contribute to spaczz\'s development, fork the repository then install spaczz and it\'s dev dependencies with Poetry. If you\'re interested in being a regular contributor please contact me directly.\n\n\n```python\npoetry install # Within spaczz\'s root directory.\n```\n\nI keep Nox and pre-commit outside of my poetry environment as part of my Python toolchain environments. With pre-commit installed you may also need to run the below to commit changes.\n\n\n```python\npre-commit install\n```\n\nThe only other package that will not be installed via Poetry but is used for testing and in-documentation examples is the spaCy medium English model (`en-core-web-md`). This will need to be installed separately. The command below should do the trick:\n\n\n```python\npoetry run python -m spacy download "en_core_web_md"\n```\n\n## References\n\n- Spaczz tries to stay as close to [spaCy](https://spacy.io/)\'s API as possible. Whenever it made sense to use existing spaCy code within spaczz this was done.\n- Fuzzy matching is performed using [RapidFuzz](https://github.com/maxbachmann/rapidfuzz).\n- Regexes are performed using the [regex](https://pypi.org/project/regex/) library.\n- The search algorithm for phrased-based fuzzy and similarity matching was heavily influnced by Stack Overflow user Ulf Aslak\'s answer in this [thread](https://stackoverflow.com/questions/36013295/find-best-substring-match).\n- Spaczz\'s predefined regex patterns were borrowed from the [commonregex](https://github.com/madisonmay/CommonRegex) package.\n- Spaczz\'s development and CI/CD patterns were inspired by Claudio Jolowicz\'s [*Hypermodern Python*](https://cjolowicz.github.io/posts/hypermodern-python-01-setup/) article series.\n', | ||
| 'author': 'Grant Andersen', | ||
@@ -29,0 +29,0 @@ 'author_email': 'gandersen.codes@gmail.com', |
@@ -121,2 +121,3 @@ """Module for TokenMatcher with an API semi-analogous to spaCy's Matcher.""" | ||
| ] | ||
| extended_matches.sort(key=lambda x: (x[1], -x[2] - x[1])) | ||
| for i, (label, _start, _end, _details) in enumerate(extended_matches): | ||
@@ -123,0 +124,0 @@ on_match = self._callbacks.get(label) |
@@ -158,4 +158,4 @@ """Module for _PhraseSearcher: flexible phrase searching in spaCy `Doc` objects.""" | ||
| if matches: | ||
| sorted_matches = sorted(matches, key=lambda x: (-x[2], x[0])) | ||
| filtered_matches = self._filter_overlapping_matches(sorted_matches) | ||
| matches.sort(key=lambda x: (-x[2], x[0])) | ||
| filtered_matches = self._filter_overlapping_matches(matches) | ||
| return filtered_matches | ||
@@ -162,0 +162,0 @@ else: |
Alert delta unavailable
Currently unable to show alert delta for PyPI packages.
4121
0.02%284938
-0.05%