Word embedding: generic iterative stemmer
A generic helper for training gensim
and fasttext
word embedding models.
Specifically, this repository was created in order to
implement stemming
on a Wikipedia-based corpus in Hebrew, but it will probably also work for other
corpus sources and languages as well.
Important to note that while there are sophisticated and efficient
approaches to the stemming task, this repository implements a naive approach
with no strict time or memory considerations (more about that in
the explanation section).
Based on https://github.com/liorshk/wordembedding-hebrew.
Setup
- Create a
python3
virtual environment. - Install dependencies using
make install
(this will run tests too).
Usage
The general flow is as follows:
- Get a text corpus (for example, from Wikipedia).
- Create a training program.
- Run a
StemmingTrainer
.
The output of the training process is a generic_iterative_stemmer.models.StemmedKeyedVectors
object
(in the form of a .kv
file). It has the same interface as the standard gensim.models.KeyedVectors
,
so the 2 can be used interchangeably.
0. (Optional) Set up a language data cache
generic_iterative_stemmer
uses a language data cache to store its output and intermediate results.
The language data directory is useful if you want to train multiple models on the same corpus, or if you want to
train a model on a corpus that you've already trained on in the past, with different parameters.
To set up the language data cache, run mkdir -p ~/.cache/language_data
.
Tip: soft-link the language data cache to your project's root directory,
e.g. ln -s ~/.cache/language_data language_data
.
1. Get a text corpus
If you don't a specific corpus in mind, you can use Wikipedia. Here's how:
- Under
~/.cache/language_data
folder, create a directory for your corpus (for example, wiki-he
). - Download Hebrew (or any other language) dataset from Wikipedia:
- Go to wikimedia dumps (in the URL, replace
he
with
your language code). - Download the matching
wiki-latest-pages-articles.xml.bz2
file, and place it in your corpus directory.
- Create initial text corpus: run the script inside
notebooks/create_corpus.py
(change parameters as needed).
This will create a corpus.txt
file in your corpus directory. It takes roughly 15 minutes to run (depending on the
corpus size and your computer).
2. Create a training program
TODO
3. Run a StemmingTrainer
TODO
4. Play with your trained model
Play with your trained model using playground.ipynb
.
Generic iterative stemming
TODO: Explain the algorithm.