Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
This is an updated fork of pyre2. It has built wheels for newer Python versions.
All docs below are taken from the pyre2 package.
pyre2 is a Python extension that wraps
Google's RE2 regular expression library <https://github.com/google/re2>
_.
The RE2 engine compiles (strictly) regular expressions to
deterministic finite automata, which guarantees linear-time behavior.
Intended as a drop-in replacement for re
. Unicode is supported by encoding
to UTF-8, and bytes strings are treated as UTF-8 when the UNICODE flag is given.
For best performance, work with UTF-8 encoded bytes strings.
Normal usage for Linux/Mac/Windows::
$ pip install pyre2-updated
Requirements for building the C++ extension from the repo source:
A build environment with gcc
or clang
(e.g. sudo apt-get install build-essential
)
Build tools and libraries: RE2, pybind11, and cmake installed in the build environment.
sudo apt-get install build-essential cmake ninja-build python3-dev cython3 pybind11-dev libre2-dev
On MacOS, use the brew
package manager::
$ brew install -s re2 pybind11
On Windows use the vcpkg
package manager::
$ vcpkg install re2:x64-windows pybind11:x64-windows
You can pass some cmake environment variables to alter the build type or pass a toolchain file (the latter is required on Windows) or specify the cmake generator. For example::
$ CMAKE_GENERATOR="Unix Makefiles" CMAKE_TOOLCHAIN_FILE=clang_toolchain.cmake tox -e deploy
For development, get the source::
$ git clone git://github.com/tyteen4a03/pyre2.git
$ cd pyre2
$ make install
An alternative to the above is provided via the conda
_ recipe (use the
miniconda installer
_ if you don't have conda
installed already).
.. _conda: https://anaconda.org/conda-forge/pyre2 .. _miniconda installer: https://docs.conda.io/en/latest/miniconda.html
The stated goal of this module is to be a drop-in replacement for re
, i.e.::
try:
import re2 as re
except ImportError:
import re
That being said, there are features of the re
module that this module may
never have; these will be handled through fallback to the original re
module:
(?!...)
\\n
in search pattern)On the other hand, unicode character classes are supported (e.g., \p{Greek}
).
Syntax reference: https://github.com/google/re2/wiki/Syntax
However, there are times when you may want to be notified of a failover. The
function set_fallback_notification
determines the behavior in these cases::
try:
import re2 as re
except ImportError:
import re
else:
re.set_fallback_notification(re.FALLBACK_WARNING)
set_fallback_notification
takes three values:
re.FALLBACK_QUIETLY
(default), re.FALLBACK_WARNING
(raise a warning),
and re.FALLBACK_EXCEPTION
(raise an exception).
Consult the docstrings in the source code or interactively
through ipython or pydoc re2
etc.
Python bytes
and unicode
strings are fully supported, but note that
RE2
works with UTF-8 encoded strings under the hood, which means that
unicode
strings need to be encoded and decoded back and forth.
There are two important factors:
unicode
pattern and search string is used (will be encoded to UTF-8 internally)UNICODE
flag: whether operators such as \w
recognize Unicode characters.To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass
UTF-8 encoded bytes strings directly but still treat them as unicode
::
In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e']
In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'))
Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e']
However, note that the indices in Match
objects will refer to the bytes string.
The indices of the match in the unicode
string could be computed by
decoding/encoding, but this is done automatically and more efficiently if you
pass the unicode
string::
>>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
<re2.Match object; span=(10, 12), match='\xc3\xbc'>
>>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE)
<re2.Match object; span=(9, 10), match=u'\xfc'>
Finally, if you want to match bytes without regard for Unicode characters,
pass bytes strings and leave out the UNICODE
flag (this will cause Latin 1
encoding to be used with RE2
under the hood)::
>>> re2.findall(br'.', b'\x80\x81\x82')
['\x80', '\x81', '\x82']
Performance is of course the point of this module, so it better perform well.
Regular expressions vary widely in complexity, and the salient feature of RE2
is
that it behaves well asymptotically. This being said, for very simple substitutions,
I've found that occasionally python's regular re
module is actually slightly faster.
However, when the re
module gets slow, it gets really slow, while this module
buzzes along.
In the below example, I'm running the data against 8MB of text from the colossal Wikipedia
XML file. I'm running them multiple times, being careful to use the timeit
module.
To see more details, please see the performance script <http://github.com/andreasvc/pyre2/tree/master/tests/performance.py>
_.
+-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+
|Test |Description |# total runs|re
time(s)|re2
time(s)|% re
time|regex
time(s)|% regex
time|
+=================+===========================================================================+============+==============+===============+=============+=================+================+
|Findall URI|Email|Find list of '([a-zA-Z][a-zA-Z0-9])://([^ /]+)(/[^ ])?|([^ @]+)@([^ @]+)'|2 |6.262 |0.131 |2.08% |5.119 |2.55% |
+-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+
|Replace WikiLinks|This test replaces links of the form [[Obama|Barack_Obama]] to Obama. |100 |4.374 |0.815 |18.63% |1.176 |69.33% |
+-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+
|Remove WikiLinks |This test splits the data by the tag. |100 |4.153 |0.225 |5.43% |0.537 |42.01% |
+-----------------+---------------------------------------------------------------------------+------------+--------------+---------------+-------------+-----------------+----------------+
Feel free to add more speed tests to the bottom of the script and send a pull request my way!
The tests show the following differences with Python's re
module:
$
operator in Python's re
matches twice if the string ends
with \n
. This can be simulated using \n?$
, except when doing
substitutions.pyre2
module and Python's re
may behave differently with nested groups.
See tests/test_emptygroups.txt
for the examples.Please report any further issues with pyre2
.
If you would like to help, one thing that would be very useful is writing comprehensive tests for this. It's actually really easy:
Example <http://github.com/andreasvc/pyre2/blob/master/tests/test_search.txt>
_.import re
with import re2 as re
.test_<name>.txt
in the tests directory. You can comment on it however you like and indent the code with 4 spaces.This code builds on the following projects (in chronological order):
FAQs
Python wrapper for Google's RE2 using Cython
We found that pyre2-updated demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.