Snowball stemming library collection for Python
Python 3 (>= 3.3) is supported. We no longer actively support Python 2 as
the Python developers stopped supporting it at the start of 2020. Snowball
2.1.0 was the last release to officially support Python 2.
What is Stemming?
Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps connection, connections, connective,
connected, and connecting to connect. So a searching for connected
would also find documents which only have the other forms.
This stem form is often a word itself, but this is not always the case as this
is not a requirement for text search systems, which are the intended field of
use. We also aim to conflate words with the same meaning, rather than all
words with a common linguistic root (so awe and awful don't have the same
stem), and over-stemming is more problematic than under-stemming so we tend not
to stem in cases that are hard to resolve. If you want to always reduce words
to a root form and/or get a root form which is itself a word then Snowball's
stemming algorithms likely aren't the right answer.
How to use library
The snowballstemmer
module has two functions.
The snowballstemmer.algorithms
function returns a list of available
algorithm names.
The snowballstemmer.stemmer
function takes an algorithm name and returns a
Stemmer
object.
Stemmer
objects have a Stemmer.stemWord(word)
method and a
Stemmer.stemWords(word[])
method.
.. code-block:: python
import snowballstemmer
stemmer = snowballstemmer.stemmer('english');
print(stemmer.stemWords("We are the world".split()));
Automatic Acceleration
PyStemmer <https://pypi.org/project/PyStemmer/>
_ is a wrapper module for
Snowball's libstemmer_c
and should provide results 100% compatible to
snowballstemmer.
PyStemmer is faster because it wraps generated C versions of the stemmers;
snowballstemmer uses generate Python code and is slower but offers a pure
Python solution.
If PyStemmer is installed, snowballstemmer.stemmer
returns a PyStemmer
Stemmer
object which provides the same Stemmer.stemWord()
and
Stemmer.stemWords()
methods.
Benchmark
This is a crude benchmark which measures the time for running each stemmer on
every word in its sample vocabulary (10,787,583 words over 26 languages). It's
not a realistic test of normal use as a real application would do much more
than just stemming. It's also skewed towards the stemmers which do more work
per word and towards those with larger sample vocabularies.
* Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer)
* Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer)
* PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer)
* PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer)
* Python 2.7 + **PyStemmer** : 52s
For reference the equivalent test for C runs in 9 seconds.
These results are for Snowball 2.0.0. They're likely to evolve over time as
the code Snowball generates for both Python and C continues to improve (for
a much older test over a different set of stemmers using Python 2.7,
**snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower
with **PyPy**).
The message to take away is that if you're stemming a lot of words you should
either install **PyStemmer** (which **snowballstemmer** will then automatically
use for you as described above) or use PyPy.
The TestApp example
-------------------
The ``testapp.py`` example program allows you to run any of the stemmers
on a sample vocabulary.
Usage::
testapp.py <algorithm> "sentences ... "
.. code-block:: bash
$ python testapp.py English "sentences... "