
Company News
Socket Joins the OpenJS Foundation
Socket is proud to join the OpenJS Foundation as a Silver Member, deepening our commitment to the long-term health and security of the JavaScript ecosystem.
mordl
Advanced tools
MorDL is a tool to organize the pipeline for complete morphological sentence parsing (POS-tagging, lemmatization, morphological feature tagging) and Named-entity recognition.
Scores (accuracy) on SynTagRus test dataset: UPOS: 99.35%; FEATS: 98.87%
(tokens), 99.31% (tags); LEMMA: 99.50%. In all experiments, we used
seed=42. Some other seed values may help to achive better results. Models'
hyperparameters are also allowed to tune.
The validation with the official evaluation script of CoNLL 2018 Shared Task:
99.35%; UFeats: 98.36%; AllTags: 98.21; Lemmas: 98.88%.For completeness, we included that script in our distribution, so you can use
it for your model evaluation, too. To simplify it, we also made a wrapper
mordl.conll18_ud_eval
for it.
MorDL supports Python 3.6 and Transformers 4.3.3 or later. To install via pip, run:
$ pip install mordl
If you currently have a previous version of MorDL installed, run:
$ pip install mordl -U
Alternatively, you can install MorDL from the source of this git repository:
$ git clone https://github.com/fostroll/mordl.git
$ cd mordl
$ pip install -e .
This gives you access to examples that are not included in the PyPI package.
Our taggers use separate models, so they can be used independently. But to achieve best results FEATS tagger uses UPOS tags during training. And LEMMA and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged corpus, the tagging pipeline is serially applying the taggers, like shown below (assuming that our goal is NER and we already have trained taggers of all types):
from mordl import UposTagger, FeatsTagger, NeTagger
tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger()
tagger_u.load('upos_model')
tagger_f.load('feats_model')
tagger_n.load('misc-ne_model')
tagger_n.predict(
tagger_f.predict(
tagger_u.predict('untagged.conllu')
), save_to='result.conllu'
)
Any tagger in our pipeline may be replaced with a better one if you have it. The weakness of separate taggers is that they take more space. If all models were created with BERT embeddings, and you load them in memory simultaneously, they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading, you can use params device and dataset_device to distribute your models on various GPUs. Alternatively, if you need just to tag some corpus once, you may load models serially:
tagger = UposTagger()
tagger.load('upos_model')
tagger.predict('untagged.conllu', save_to='result_upos.conllu')
del tagger # just for sure
tagger = FeatsTagger()
tagger.load('feats_model')
tagger.predict('result_upos.conllu', save_to='result_feats.conllu')
del tagger
tagger = NeTagger()
tagger_n.load('misc-ne_model')
tagger.predict('result_feats.conllu', save_to='result.conllu')
del tagger
Don't use identical names for input and output file names when you call the
.predict() methods. Normally, there will be no problem, because the methods
by default load all the input file in memory before tagging. But if the input
file is large, you may want to use the split parameter for the methods
handle the file by parts. In that case, saving of the first part of the
tagging data occurs before loading next. So, identical names will entail data
loss.
The training process is also simple. If you have training corpora and you don't want any experiments, just run:
from mordl import UposTagger
tagger = UposTagger()
tagger.load_train_corpus(train_corpus)
tagger.load_test_corpus(dev_corpus)
stat = tagger.train('upos_model', device='cuda:0',
stage3_params={'save_as': 'upos_bert_model'})
It is a training pipeline for the UPOS tagger; pipelines for other taggers are identical.
For a more complete understanding of MorDL toolkit usage, refer to the
Python notebook with the pipeline example in the examples directory of the
MorDL GitHub repository. Also, the detailed descriptions are available
in the docs:
Also, you can find training pipelines for different taggers in our example notebook.
This project was developed with the focus on Russian language, but a few nuances we use for it are unlikely to worsen the quality of processing other languages.
MorDL's supports CoNLL-U (if input/output is a file), or Parsed CoNLL-U (if input/output is an object). Also, MorDL's allows Corpuscula's corpora wrappers as input.
MorDL is released under the BSD License. See the LICENSE file for more details.
FAQs
Morphological parser (POS, lemmata, NER etc.)
We found that mordl demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Company News
Socket is proud to join the OpenJS Foundation as a Silver Member, deepening our commitment to the long-term health and security of the JavaScript ecosystem.

Security News
npm now links to Socket's security analysis on every package page. Here's what you'll find when you click through.

Security News
A compromised npm publish token was used to push a malicious postinstall script in cline@2.3.0, affecting the popular AI coding agent CLI with 90k weekly downloads.