esaxx-py

Python wrapper of the Enhanced Suffix Array (ESA).
This is a version adapted for Python, originating from the C++ implementation found here.
Installation
pip install esaxx-py
Usage
from esa import esaxx
def print_snippet(T, beg, length):
for i in range(length):
c = T[beg + i]
print("_" if c.isspace() else c, end="")
T = 'abracadabra'
SA = [0]*len(T)
L = [0]*len(T)
R = [0]*len(T)
D = [0]*len(T)
k = 0x100
node_num = 0
node_num = esaxx(T, SA, L, R, D, len(T), k, node_num)
if node_num == -1:
exit()
print(f"node:{node_num}")
for i in range(node_num):
print(f"{i}\t{R[i] - L[i]}\t{D[i]}\t", end="")
print_snippet(T, SA[L[i]], D[i])
print()
node:5
0 2 4 abra
1 5 1 a
2 2 3 bra
3 2 2 ra
4 11 0
Alternatively, you can use enumSubstring.py:
ehco abracadabra | python enumSubstring.py
For the original implementation:
g++ enumSubstring.cpp
echo abracadabra | ./a.out
Note
In the original implementation, the return value of esaxx was an error code, not node_num.
However, due to the constraints of Python and the difficulty in passing by reference, I've chosen to return node_num.
Maximal Substrings
To obtain Maximal Substrings:
from esa import esaxx
def print_snippet(T, beg, length):
for i in range(length):
c = T[beg + i]
print("_" if c.isspace() else c, end="")
T = 'abracadabra'
SA = [0]*len(T)
L = [0]*len(T)
R = [0]*len(T)
D = [0]*len(T)
k = 0x100
node_num = 0
node_num = esaxx(T, SA, L, R, D, len(T), k, node_num)
if node_num == -1:
exit()
size = len(T)
rank = [0] * size
r = 0
for i in range(size):
if i == 0 or T[(SA[i] + size - 1) % size] != T[(SA[i - 1] + size - 1) % size]:
r += 1
rank[i] = r
print("count\tlength\tstring")
for i in range(node_num):
if D[i] == 0 or (rank[R[i] - 1] - rank[L[i]] == 0):
continue
print(f"{R[i] - L[i]}\t{D[i]}\t", end="")
print_snippet(T, SA[L[i]], D[i])
print()
The first column represents the frequency of occurrence, and the second column represents the length of the string.
count length string
2 4 abra
5 1 a
Here, even strings that appear more than once are listed, even if they are just one character. If you want to skip those, you can use if len < 2: continue
.
enumMaxSubstring.py:
ehco abracadabra | python enumMaxSubstring.py
C++ (enumMaxSubstring.cpp):
g++ enumMaxSubstring.cpp
echo abracadabra | ./a.out
UPDATE in 0.2.0
Introduce a new function: get_maximal_substrings(str)
.
This function allows for easier extraction of maximal substrings from a given string.
Usage Example:
from esa import get_maximal_substrings
T = 'abracadabra'
substrings = get_maximal_substrings(T)
print("count\tlength\tstring")
for substring in substrings:
print(f'{substring.count}\t{substring.length}\t{substring.string})
count length string
2 4 abra
5 1 a
UPDATE in 2.7.0
Available unicode character
Usage Example:
from esa import get_maximal_substrings_unicode
T = '松島やああ松島や松島や'
substrings = get_maximal_substrings_unicode(T)
print(T)
print("count\tlength\tstring")
for substring in substrings:
print(f'{substring.count}\t{substring.length}\t{substring.string})
松島やああ松島や松島や
count length string
3 3 松島や
3 2 島や
3 1 や
2 1 あ
Additional Information
C++ Implementation:
Rust Version:
Software using esaxx:
List of papers using esaxx:
Articles about esaxx
: