Basics to strtree
strtree is a Python package for strings binary classification, based on regular expressions put in a decision tree.
Github repo: stretree
With strtree you can:
- Do a binary classification of your strings using automatically extracted regular expressions
- Find shortest regular expressions which covers strings with positive labels in the most accurate way
Look at a quick example.
Example
Firstly, let's build a tree from strings and their labels.
import strtree
strings = ['Samsung X-500', 'Samsung SM-10', 'Samsung X-1100', 'Samsung F-10', 'Samsung X-2200',
'AB Nokia 1', 'DG Nokia 2', 'THGF Nokia 3', 'SFSD Nokia 4', 'Nokia XG', 'Nokia YO']
labels = [1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0]
tree = StringTree()
tree.build(strings, labels, min_precision=0.75, min_token_length=1)
Let's see what regular expressions were extracted.
for leaf in tree.leaves:
print(leaf)
You may need to check the precision and recall of the whole tree for a given set of strings and true labels.
print('Precision: {}'.format(tree.precision_score(strings, labels)))
print('Recall: {}'.format(tree.precision_score(strings, labels)))
Finally, you can pass any strings you want and see if they match to extracted regular expressions or not.
matches = tree.match(other_strings)
Installing
- Use PyPI:
pip install strtree
- Use a distribution file located in the
dist
folder:
pip install strtree-0.1.0-py3-none-any.whl
Contribution
You are very welcome to participate in the project. You may solve the current issues or add new functionality - it is up to you to.