Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

lemmagen3

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

lemmagen3

A Python2/3 wrapper for Lemmagen lemmatizer supporting 19 languages.

  • 3.5.1
  • PyPI
  • Socket score

Maintainers
1

About

lemmagen3 is a Python 2/3 wrapper for the Lemmagen lemmatizer (version 2.2).

It is different from other Lemmagen wrappers like this one on PyPi because it offers a clean, fast OO interface built with the excellent pybind11 library and supports an additional language (Croatian).

Models for Slovene, Croatian and Serbian are significantly updated and make use of frequency data to prefer most frequent lemmas, e.g., for Slovene: je->biti instead of je->jesti, mene->jaz instead od mene->mena, od->od instead of od->oda etc.

In total, 19 languages are supported:

  1. Bulgarian: bg
  2. Croatian: hr
  3. Czech: cs
  4. English: en
  5. Estonian: et
  6. Farsi/Persian: fa
  7. French: fr
  8. German: de
  9. Hungarian: hu
  10. Italian: it
  11. Macedonian: mk
  12. Polish: pl
  13. Romanian: ro
  14. Russian: ru
  15. Serbian: sr
  16. Slovak: sk
  17. Slovene: sl
  18. Spanish: es
  19. Ukrainian: uk

Installation and requirements

pip install lemmagen3

will install the module and language model files. Please note that on python <=3.5 and python 2.7 the package will be built from source so you will need a C++ compiler.

Note: If you use python 3.5.0 or 3.5.1 you will likely get the error shown below. This is a known bug in these two versions so please consider upgrading your Python.

ImportError: ..._lemmagen.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _PyThreadState_UncheckedGet

How to use

The following snippet illustrates how to use lemmagen3.

from lemmagen3 import Lemmatizer

# first, we can list all supported languages
print(Lemmatizer.list_supported_languages())

# then, create few lemmatizer objects using ISO 639-1 language codes
# (English, Slovene and Russian)

lem_en = Lemmatizer('en')
lem_sl = Lemmatizer('sl')
lem_ru = Lemmatizer('ru')

# now lemmatize the word "cats" in all three languages
print(lem_en.lemmatize('cats'))
print(lem_sl.lemmatize('mačke'))
print(lem_ru.lemmatize('коты'))

# you can also change the language for an existing Lemmatizer object
# lem_en will now become a French lemmatizer:
lem_en.load_language('fr')

# finally, you can also load your own Lemmagen model
my_lem = Lemmatizer()
my_lem.load_model('/path/to/my/model')

Note that the function lemmatize accepts single string tokens and does not split the input string! If you want to lemmatize a chunk of text you will have to tokenize it first, e.g.:

sentence = 'cats hate dogs'
tokens = sentence.split()
sentence_lemmatized = ' '.join([lem_en.lemmatize(token) for token in tokens])

Note also that lemmagen3 operates on unicode encoded strings so if you use python 2 make sure that your input string is encoded as unicode.

License

Please note that this repository contains code and binary models compiled and built from different sources which are under different licenses:

  1. C++ files and headers come from Lemmagen and are modified and adapted to work as a Python module (LGPL)
  2. Binary models are built from Multext and Multext-east sources:
    • Language resources used to build Farsi/Persian, Macedonian, Polish, and Russian models are for non-commercial use only.
    • Language resource for other supported languages are released under CC BY-SA 4.0.

The rest of the code in this repository was created by the author and is licensed under the MIT license.

Authors

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc