![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Library designed as a python wrapper to unleash Rust text processing power combined with Python
Library defined to achieve easily high performance on regex and text processing inside Python, being built as a direct Wrapper of Rust regex and text crates.
On short text, sparsity of found elements is the common denominator, this library focuses on algorithms that aknowledge this sparsity and efficiently achieves good performance from simple Python API calls to Rust optimized logics.
This lib has special treatment for texts that only contain [a-zA-Z0-9ñç ]
plus accented vocals, allowing to use non unicode matching over those texts. This is particularly convenient for some Automatic Speech Recognition outputs.
In every place that it is possibly to provide it, this:
unicode
: False
-> removes unicode chars from matching, making matching much more efficient (x6 - x12 it is easly achieved).substitute_bound
: True
-> substitutes in patterns r"\b"
for r"(?-u:\b)"
as recommended heresubstitute_latin_char
: True -> substitutes in patterns pkg::constants::LATIN_CHARS_TO_REPLACE
for pkg::constants::LATIN_CHARS_REPLACEMENT
, to allow the use of non unicode variant without losing the ability to match texts and patterns that contain those latin chars (care it projects them into pkg::constants::LATIN_CHARS_TO_REPLACE
both in patterns and texts).Find patterns in texts, possibly parallelizing by chunks of either patterns or texts.
It uses efficient regex::RegexSet
that reduces the cardinality of the patterns in the matching phase.
The structure of finding function is:
regex::Regex
for the list of patterns. Get the list of valid ones and invalid ones.regex::RegexSet
with valid patterns and apply over the list of texts. This gives which ones have match over the texts.regex::Regex
, finding them over all the texts for the subset of pairs that have matched in the regex::RegexSet
.fancy_regex::Regex
and find matches over the texts. It reduces final invalid
patterns list that is given back to python.regex
package
that has expanded pattern support over re
built-in package.This is a very concrete function to perform high performance literal replacement using Rust aho_corasick
implementation. It accepts parallelization by chunks of text.
It uses Rust aho_corasick
to perform replacements, adding a layer of bounding around literals to replace through the is_bounded
parameter.
is_bounded
is True
then before replacing the literal found, it is checked that any of [A-Za-z0-9_]
(expanded with accents and special word chards that can be checked in pkg::unicode::check_if_word_bytes
) is around the literal.aho_corasick::MatchKind
, being the default one aho_corasick::MatchKind::LeftmostLongest
.More at doc/notebook/doc/literal_replacer.ipynb
in the repository.
from pytextrust.replacer import replace_literal_patterns, MatchKind
replace_literal_patterns(
literal_patterns=["uno", "dos"],
replacements=["1", "2"],
text_to_replace=["es el numero uno o el Dos yo soy el veintiuno"],
is_bounded=True,
case_insensitive=True,
match_kind=MatchKind.LeftmostLongest)
returns the replaced text and the number of replacements
(['es el numero 1 o el 2 yo soy el veintiuno'], 2)
Entities are found by overlapping and have a hierarchichal folder structure.
\b
, just if the literal matching where \b(lit_1|...|lit_N)\b
. Tech note: positions reported by aho corasick should be mapped from byte to char position.month
is a literal entity,
Then \d+ of \d+ of {{month}}
is a possible entity. The regex entities that depend positively (no negative lookback or lookahead), only are searched on the texts where the literal entity has been found, minimizing computational weight.Feeding of entity matches:
kind
with one of two values: re
or lit
.Steps of entity recognition:
LiteralEntityPool
. There are public and private literal entities:
RegexEntityPool
using literals from LiteralEntityPool
, then there are two kinds of regex entities
regex::RegexSet
A pattern in a regex entity has two type of categorizations:
regex
crate:
entities::extract_required_template_structure
throws a non-empty vector.RegexSet
. This is a pattern with entities::extract_required_template_structure
throwing an empty vector.regex
crate will receive a direct find from fancy_regex
crate. This pattern
receives an Error from entities::extract_required_template_structure
.Naming convention for entity files is:
This repository pretends to be a perfect CICD example for a Python+Rust lib based on pyo3
. Any suggestions (caching, badges, anything, ...) just let me know by issue :)
FAQs
Library designed as a python wrapper to unleash Rust text processing power combined with Python
We found that pytextrust demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.