Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

generic-iterative-stemmer

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

generic-iterative-stemmer

A generic language stemming utility, dedicated for gensim word-embedding.

1.2.0
PyPI

Sorry we don't scan binary artifacts yet

Maintainers: 1

Word embedding: generic iterative stemmer

A generic helper for training gensim and fasttext word embedding models.
Specifically, this repository was created in order to implement stemming on a Wikipedia-based corpus in Hebrew, but it will probably also work for other corpus sources and languages as well.

Important to note that while there are sophisticated and efficient approaches to the stemming task, this repository implements a naive approach with no strict time or memory considerations (more about that in the explanation section).

Based on https://github.com/liorshk/wordembedding-hebrew.

Setup

Create a python3 virtual environment.
Install dependencies using make install (this will run tests too).

Usage

The general flow is as follows:

Get a text corpus (for example, from Wikipedia).
Create a training program.
Run a StemmingTrainer.

The output of the training process is a generic_iterative_stemmer.models.StemmedKeyedVectors object (in the form of a .kv file). It has the same interface as the standard gensim.models.KeyedVectors, so the 2 can be used interchangeably.

0. (Optional) Set up a language data cache

generic_iterative_stemmer uses a language data cache to store its output and intermediate results.
The language data directory is useful if you want to train multiple models on the same corpus, or if you want to train a model on a corpus that you've already trained on in the past, with different parameters.

To set up the language data cache, run mkdir -p ~/.cache/language_data.

Tip: soft-link the language data cache to your project's root directory, e.g. ln -s ~/.cache/language_data language_data.

1. Get a text corpus

If you don't a specific corpus in mind, you can use Wikipedia. Here's how:

Under ~/.cache/language_data folder, create a directory for your corpus (for example, wiki-he).
Download Hebrew (or any other language) dataset from Wikipedia:
1. Go to wikimedia dumps (in the URL, replace he with your language code).
2. Download the matching wiki-latest-pages-articles.xml.bz2 file, and place it in your corpus directory.
Create initial text corpus: run the script inside notebooks/create_corpus.py (change parameters as needed).
This will create a corpus.txt file in your corpus directory. It takes roughly 15 minutes to run (depending on the corpus size and your computer).

2. Create a training program

TODO

3. Run a `StemmingTrainer`

TODO

4. Play with your trained model

Play with your trained model using playground.ipynb.

Generic iterative stemming

TODO: Explain the algorithm.

FAQs

What is generic-iterative-stemmer?

Is generic-iterative-stemmer well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install