natasha
Advanced tools
| Metadata-Version: 2.1 | ||
| Name: natasha | ||
| Version: 1.1.0 | ||
| Summary: Named-entity recognition for russian language | ||
| Home-page: https://github.com/natasha/natasha | ||
| Author: Natasha contributors | ||
| Author-email: d.a.veselov@yandex.ru, alex@alexkuk.ru | ||
| License: MIT | ||
| Description: | ||
| <img src="https://github.com/natasha/natasha-logos/blob/master/natasha.svg"> | ||
|  [](https://codecov.io/gh/natasha/natasha) | ||
| Natasha solves basic NLP tasks for Russian language: tokenization, sentence segmentation, word embedding, morphology tagging, lemmatization, phrase normalization, syntax parsing, NER tagging, fact extraction. Quality on every task is similar or better then current SOTAs for Russian language on news articles, see <a href="https://github.com/natasha/natasha#evaluation">evaluation section</a>. Natasha is not a research project, underlying technologies are built for production. We pay attention to model size, RAM usage and performance. Models run on CPU, use Numpy for inference. | ||
| Natasha integrates libraries from <a href="https://github.com/natasha">Natasha project</a> under one convenient API: | ||
| * <a href="https://github.com/natasha/razdel">Razdel</a> — token, sentence segmentation for Russian | ||
| * <a href="https://github.com/natasha/navec">Navec</a> — compact Russian embeddings | ||
| * <a href="https://github.com/natasha/slovnet">Slovnet</a> — modern deep-learning techniques for Russian NLP, compact models for Russian morphology, syntax, NER. | ||
| * <a href="https://github.com/natasha/yargy">Yargy</a> — rule-based fact extraction similar to Tomita parser. | ||
| * <a href="https://github.com/natasha/ipymarkup">Ipymarkup</a> — NLP visualizations for NER and syntax markups. | ||
| > ⚠ API may change, for realworld tasks consider using low level libraries from Natasha project. | ||
| > Models optimized for news articles, quality on other domain may be lower. | ||
| > To use old `NamesExtractor`, `AddressExtactor` downgrade `pip install natasha<1 yargy<0.13` | ||
| ## Install | ||
| Natasha supports Python 3.5+ and PyPy3: | ||
| ```bash | ||
| $ pip install natasha | ||
| ``` | ||
| ## Usage | ||
| For more examples and explanation see [Natasha documentation](http://nbviewer.jupyter.org/github/natasha/natasha/blob/master/docs.ipynb). | ||
| ```python | ||
| >>> from natasha import ( | ||
| Segmenter, | ||
| MorphVocab, | ||
| NewsEmbedding, | ||
| NewsMorphTagger, | ||
| NewsSyntaxParser, | ||
| NewsNERTagger, | ||
| PER, | ||
| NamesExtractor, | ||
| Doc | ||
| ) | ||
| ####### | ||
| # | ||
| # INIT | ||
| # | ||
| ##### | ||
| >>> segmenter = Segmenter() | ||
| >>> morph_vocab = MorphVocab() | ||
| >>> emb = NewsEmbedding() | ||
| >>> morph_tagger = NewsMorphTagger(emb) | ||
| >>> syntax_parser = NewsSyntaxParser(emb) | ||
| >>> ner_tagger = NewsNERTagger(emb) | ||
| >>> names_extractor = NamesExtractor(morph_vocab) | ||
| >>> text = 'Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав о решении властей Львовской области объявить 2019 год годом лидера запрещенной в России Организации украинских националистов (ОУН) Степана Бандеры. Свое заявление он разместил в Twitter. «Я не могу понять, как прославление тех, кто непосредственно принимал участие в ужасных антисемитских преступлениях, помогает бороться с антисемитизмом и ксенофобией. Украина не должна забывать о преступлениях, совершенных против украинских евреев, и никоим образом не отмечать их через почитание их исполнителей», — написал дипломат. 11 декабря Львовский областной совет принял решение провозгласить 2019 год в регионе годом Степана Бандеры в связи с празднованием 110-летия со дня рождения лидера ОУН (Бандера родился 1 января 1909 года). В июле аналогичное решение принял Житомирский областной совет. В начале месяца с предложением к президенту страны Петру Порошенко вернуть Бандере звание Героя Украины обратились депутаты Верховной Рады. Парламентарии уверены, что признание Бандеры национальным героем поможет в борьбе с подрывной деятельностью против Украины в информационном поле, а также остановит «распространение мифов, созданных российской пропагандой». Степан Бандера (1909-1959) был одним из лидеров Организации украинских националистов, выступающей за создание независимого государства на территориях с украиноязычным населением. В 2010 году в период президентства Виктора Ющенко Бандера был посмертно признан Героем Украины, однако впоследствии это решение было отменено судом. ' | ||
| >>> doc = Doc(text) | ||
| ####### | ||
| # | ||
| # SEGMENT | ||
| # | ||
| ##### | ||
| >>> doc.segment(segmenter) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> display(doc.sents[:5]) | ||
| [DocToken(stop=5, text='Посол'), | ||
| DocToken(start=6, stop=13, text='Израиля'), | ||
| DocToken(start=14, stop=16, text='на'), | ||
| DocToken(start=17, stop=24, text='Украине'), | ||
| DocToken(start=25, stop=30, text='Йоэль')] | ||
| [DocSent(stop=218, text='Посол Израиля на Украине Йоэль Лион признался, чт..., tokens=[...]), | ||
| DocSent(start=219, stop=257, text='Свое заявление он разместил в Twitter.', tokens=[...]), | ||
| DocSent(start=258, stop=424, text='«Я не могу понять, как прославление тех, кто непо..., tokens=[...]), | ||
| DocSent(start=425, stop=592, text='Украина не должна забывать о преступлениях, совер..., tokens=[...]), | ||
| DocSent(start=593, stop=798, text='11 декабря Львовский областной совет принял решен..., tokens=[...])] | ||
| ####### | ||
| # | ||
| # MORPH | ||
| # | ||
| ##### | ||
| >>> doc.tag_morph(morph_tagger) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> doc.sents[0].morph.print() | ||
| [DocToken(stop=5, text='Посол', pos='NOUN', feats=<Anim,Nom,Masc,Sing>), | ||
| DocToken(start=6, stop=13, text='Израиля', pos='PROPN', feats=<Inan,Gen,Masc,Sing>), | ||
| DocToken(start=14, stop=16, text='на', pos='ADP'), | ||
| DocToken(start=17, stop=24, text='Украине', pos='PROPN', feats=<Inan,Loc,Fem,Sing>), | ||
| DocToken(start=25, stop=30, text='Йоэль', pos='PROPN', feats=<Anim,Nom,Masc,Sing>)] | ||
| Посол NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Израиля PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing | ||
| на ADP | ||
| Украине PROPN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing | ||
| Йоэль PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Лион PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| признался VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid | ||
| , PUNCT | ||
| что SCONJ | ||
| ... | ||
| ###### | ||
| # | ||
| # LEMMA | ||
| # | ||
| ####### | ||
| >>> for token in doc.tokens: | ||
| >>> token.lemmatize(morph_vocab) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> {_.text: _.lemma for _ in doc.tokens} | ||
| [DocToken(stop=5, text='Посол', pos='NOUN', feats=<Anim,Nom,Masc,Sing>, lemma='посол'), | ||
| DocToken(start=6, stop=13, text='Израиля', pos='PROPN', feats=<Inan,Gen,Masc,Sing>, lemma='израиль'), | ||
| DocToken(start=14, stop=16, text='на', pos='ADP', lemma='на'), | ||
| DocToken(start=17, stop=24, text='Украине', pos='PROPN', feats=<Inan,Loc,Fem,Sing>, lemma='украина'), | ||
| DocToken(start=25, stop=30, text='Йоэль', pos='PROPN', feats=<Anim,Nom,Masc,Sing>, lemma='йоэль')] | ||
| {'Посол': 'посол', | ||
| 'Израиля': 'израиль', | ||
| 'на': 'на', | ||
| 'Украине': 'украина', | ||
| 'Йоэль': 'йоэль', | ||
| 'Лион': 'лион', | ||
| 'признался': 'признаться', | ||
| ',': ',', | ||
| 'что': 'что', | ||
| 'пришел': 'прийти', | ||
| 'в': 'в', | ||
| 'шок': 'шок', | ||
| 'узнав': 'узнать', | ||
| 'о': 'о', | ||
| ... | ||
| ####### | ||
| # | ||
| # SYNTAX | ||
| # | ||
| ###### | ||
| >>> doc.parse_syntax(syntax_parser) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> doc.sents[0].syntax.print() | ||
| [DocToken(stop=5, text='Посол', id='1_1', head_id='1_7', rel='nsubj', pos='NOUN', feats=<Anim,Nom,Masc,Sing>), | ||
| DocToken(start=6, stop=13, text='Израиля', id='1_2', head_id='1_1', rel='nmod', pos='PROPN', feats=<Inan,Gen,Masc,Sing>), | ||
| DocToken(start=14, stop=16, text='на', id='1_3', head_id='1_4', rel='case', pos='ADP'), | ||
| DocToken(start=17, stop=24, text='Украине', id='1_4', head_id='1_1', rel='nmod', pos='PROPN', feats=<Inan,Loc,Fem,Sing>), | ||
| DocToken(start=25, stop=30, text='Йоэль', id='1_5', head_id='1_1', rel='appos', pos='PROPN', feats=<Anim,Nom,Masc,Sing>)] | ||
| ┌──► Посол nsubj | ||
| │ Израиля | ||
| │ ┌► на case | ||
| │ └─ Украине | ||
| │ ┌─ Йоэль | ||
| │ └► Лион flat:name | ||
| ┌─────┌─└─── признался | ||
| │ │ ┌──► , punct | ||
| │ │ │ ┌► что mark | ||
| │ └►└─└─ пришел ccomp | ||
| │ │ ┌► в case | ||
| │ └──►└─ шок obl | ||
| │ ┌► , punct | ||
| │ ┌────►┌─└─ узнав advcl | ||
| │ │ │ ┌► о case | ||
| │ │ ┌───└►└─ решении obl | ||
| │ │ │ ┌─└──► властей nmod | ||
| │ │ │ │ ┌► Львовской amod | ||
| │ │ │ └──►└─ области nmod | ||
| │ └─└►┌─┌─── объявить nmod | ||
| │ │ │ ┌► 2019 amod | ||
| │ │ └►└─ год obj | ||
| │ └──►┌─ годом obl | ||
| │ ┌─────└► лидера nmod | ||
| │ │ ┌►┌─── запрещенной acl | ||
| │ │ │ │ ┌► в case | ||
| │ │ │ └►└─ России obl | ||
| │ ┌─└►└─┌─── Организации nmod | ||
| │ │ │ ┌► украинских amod | ||
| │ │ ┌─└►└─ националистов nmod | ||
| │ │ │ ┌► ( punct | ||
| │ │ └►┌─└─ ОУН parataxis | ||
| │ │ └──► ) punct | ||
| │ └──────►┌─ Степана appos | ||
| │ └► Бандеры flat:name | ||
| └──────────► . punct | ||
| ... | ||
| ####### | ||
| # | ||
| # NER | ||
| # | ||
| ###### | ||
| >>> doc.tag_ner(ner_tagger) | ||
| >>> display(doc.spans[:5]) | ||
| >>> doc.ner.print() | ||
| [DocSpan(start=6, stop=13, type='LOC', text='Израиля', tokens=[...]), | ||
| DocSpan(start=17, stop=24, type='LOC', text='Украине', tokens=[...]), | ||
| DocSpan(start=25, stop=35, type='PER', text='Йоэль Лион', tokens=[...]), | ||
| DocSpan(start=89, stop=106, type='LOC', text='Львовской области', tokens=[...]), | ||
| DocSpan(start=152, stop=158, type='LOC', text='России', tokens=[...])] | ||
| Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав | ||
| LOC──── LOC──── PER─────── | ||
| о решении властей Львовской области объявить 2019 год годом лидера | ||
| LOC────────────── | ||
| запрещенной в России Организации украинских националистов (ОУН) | ||
| LOC─── ORG─────────────────────────────────────── | ||
| Степана Бандеры. Свое заявление он разместил в Twitter. «Я не могу | ||
| PER──────────── ORG──── | ||
| понять, как прославление тех, кто непосредственно принимал участие в | ||
| ужасных антисемитских преступлениях, помогает бороться с | ||
| антисемитизмом и ксенофобией. Украина не должна забывать о | ||
| LOC──── | ||
| преступлениях, совершенных против украинских евреев, и никоим образом | ||
| не отмечать их через почитание их исполнителей», — написал дипломат. | ||
| 11 декабря Львовский областной совет принял решение провозгласить 2019 | ||
| ORG────────────────────── | ||
| год в регионе годом Степана Бандеры в связи с празднованием 110-летия | ||
| PER──────────── | ||
| со дня рождения лидера ОУН (Бандера родился 1 января 1909 года). В | ||
| ORG | ||
| июле аналогичное решение принял Житомирский областной совет. В начале | ||
| ORG──────────────────────── | ||
| месяца с предложением к президенту страны Петру Порошенко вернуть | ||
| PER──────────── | ||
| Бандере звание Героя Украины обратились депутаты Верховной Рады. | ||
| PER──── LOC──── ORG─────────── | ||
| Парламентарии уверены, что признание Бандеры национальным героем | ||
| PER──── | ||
| поможет в борьбе с подрывной деятельностью против Украины в | ||
| LOC──── | ||
| информационном поле, а также остановит «распространение мифов, | ||
| созданных российской пропагандой». Степан Бандера (1909-1959) был | ||
| PER─────────── | ||
| одним из лидеров Организации украинских националистов, выступающей за | ||
| ORG───────────────────────────────── | ||
| создание независимого государства на территориях с украиноязычным | ||
| населением. В 2010 году в период президентства Виктора Ющенко Бандера | ||
| PER─────────── PER──── | ||
| был посмертно признан Героем Украины, однако впоследствии это решение | ||
| LOC──── | ||
| было отменено судом. | ||
| ####### | ||
| # | ||
| # PHRASE NORM | ||
| # | ||
| ####### | ||
| >>> for span in doc.spans: | ||
| >>> span.normalize(morph_vocab) | ||
| >>> display(doc.spans[:5]) | ||
| >>> {_.text: _.normal for _ in doc.spans if _.text != _.normal} | ||
| [DocSpan(start=6, stop=13, type='LOC', text='Израиля', tokens=[...], normal='Израиль'), | ||
| DocSpan(start=17, stop=24, type='LOC', text='Украине', tokens=[...], normal='Украина'), | ||
| DocSpan(start=25, stop=35, type='PER', text='Йоэль Лион', tokens=[...], normal='Йоэль Лион'), | ||
| DocSpan(start=89, stop=106, type='LOC', text='Львовской области', tokens=[...], normal='Львовская область'), | ||
| DocSpan(start=152, stop=158, type='LOC', text='России', tokens=[...], normal='Россия')] | ||
| {'Израиля': 'Израиль', | ||
| 'Украине': 'Украина', | ||
| 'Львовской области': 'Львовская область', | ||
| 'России': 'Россия', | ||
| 'Организации украинских националистов (ОУН)': 'Организация украинских националистов (ОУН)', | ||
| 'Степана Бандеры': 'Степан Бандера', | ||
| 'Петру Порошенко': 'Петр Порошенко', | ||
| 'Бандере': 'Бандера', | ||
| 'Украины': 'Украина', | ||
| 'Верховной Рады': 'Верховная Рада', | ||
| 'Бандеры': 'Бандера', | ||
| 'Организации украинских националистов': 'Организация украинских националистов', | ||
| 'Виктора Ющенко': 'Виктор Ющенко'} | ||
| ####### | ||
| # | ||
| # FACT | ||
| # | ||
| ###### | ||
| >>> for span in doc.spans: | ||
| >>> if span.type == PER: | ||
| >>> span.extract_fact(names_extractor) | ||
| >>> {_.normal: _.fact for _ in doc.spans if _.type == PER} | ||
| {'Йоэль Лион': Name( | ||
| first='Йоэль', | ||
| last='Лион' | ||
| ), | ||
| 'Степан Бандера': Name( | ||
| first='Степан', | ||
| last='Бандера' | ||
| ), | ||
| 'Петр Порошенко': Name( | ||
| first='Петр', | ||
| last='Порошенко' | ||
| ), | ||
| 'Бандера': Name( | ||
| last='Бандера' | ||
| ), | ||
| 'Виктор Ющенко': Name( | ||
| first='Виктор', | ||
| last='Ющенко' | ||
| )} | ||
| ``` | ||
| ## Evaluation | ||
| ### Segmentation | ||
| Natasha uses <a href="https://github.com/natasha/razdel">Razdel</a> for text segmentation. | ||
| `errors` — number of errors aggregated over 4 datasets, see <a href="https://github.com/natasha/razdel#quality-performance">Razdel evalualtion section</a> for more info. | ||
| <!--- token ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>errors</th> | ||
| <th>time</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>razdel.tokenize</th> | ||
| <td>5439</td> | ||
| <td>9.898350</td> | ||
| </tr> | ||
| <tr> | ||
| <th>mystem</th> | ||
| <td>12192</td> | ||
| <td>17.210470</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>12288</td> | ||
| <td>19.920618</td> | ||
| </tr> | ||
| <tr> | ||
| <th>nltk.word_tokenize</th> | ||
| <td>130119</td> | ||
| <td>12.405366</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- token ---> | ||
| <!--- sent ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>errors</th> | ||
| <th>time</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>razdel.sentenize</th> | ||
| <td>32106</td> | ||
| <td>21.989045</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov/rusenttokenize</th> | ||
| <td>41722</td> | ||
| <td>32.535322</td> | ||
| </tr> | ||
| <tr> | ||
| <th>nltk.sent_tokenize</th> | ||
| <td>60378</td> | ||
| <td>29.916063</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- sent ---> | ||
| ### Embedding | ||
| Natasha uses <a href="https://github.com/natasha/navec">Navec pretrained embeddings</a>. | ||
| `precision` — Average precision over 4 datasets, see <a href="https://github.com/natasha/navec#evaluation">Navec evalualtion section</a> for more info. | ||
| <!--- emb1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>type</th> | ||
| <th>precision</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>vocab</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>hudlit_12B_500K_300d_100q</th> | ||
| <td>navec</td> | ||
| <td>0.825</td> | ||
| <td>1.0</td> | ||
| <td>50.6</td> | ||
| <td>95.3</td> | ||
| <td>500K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>news_1B_250K_300d_100q</th> | ||
| <td>navec</td> | ||
| <td>0.775</td> | ||
| <td>0.5</td> | ||
| <td>25.4</td> | ||
| <td>47.7</td> | ||
| <td>250K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>ruscorpora_upos_cbow_300_20_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.777</td> | ||
| <td>12.1</td> | ||
| <td>220.6</td> | ||
| <td>236.1</td> | ||
| <td>189K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>ruwikiruscorpora_upos_skipgram_300_2_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.776</td> | ||
| <td>15.7</td> | ||
| <td>290.0</td> | ||
| <td>309.4</td> | ||
| <td>248K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>tayga_upos_skipgram_300_2_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.795</td> | ||
| <td>15.7</td> | ||
| <td>290.7</td> | ||
| <td>310.9</td> | ||
| <td>249K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>tayga_none_fasttextcbow_300_10_2019</th> | ||
| <td>fasttext</td> | ||
| <td>0.706</td> | ||
| <td>11.3</td> | ||
| <td>2741.9</td> | ||
| <td>2746.9</td> | ||
| <td>192K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>araneum_none_fasttextcbow_300_5_2018</th> | ||
| <td>fasttext</td> | ||
| <td>0.720</td> | ||
| <td>7.8</td> | ||
| <td>2752.1</td> | ||
| <td>2754.7</td> | ||
| <td>195K</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- emb1 ---> | ||
| ### Morphology | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#morphology">Slovnet morphology tagger</a>. | ||
| `accuracy` — accuracy on news dataset, see <a href="https://github.com/natasha/slovnet#morphology-1">Slovnet evaluation section</a> for more. | ||
| <!--- morph1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>accuracy</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, sents/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.961</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>115</td> | ||
| <td>532.0</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.951</td> | ||
| <td>20.0</td> | ||
| <td>1393</td> | ||
| <td>8704</td> | ||
| <td>85.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov</th> | ||
| <td>0.940</td> | ||
| <td>4.0</td> | ||
| <td>32</td> | ||
| <td>10240</td> | ||
| <td>90.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>0.919</td> | ||
| <td>10.9</td> | ||
| <td>89</td> | ||
| <td>579</td> | ||
| <td>30.6</td> | ||
| </tr> | ||
| <tr> | ||
| <th>udpipe</th> | ||
| <td>0.918</td> | ||
| <td>6.9</td> | ||
| <td>45</td> | ||
| <td>242</td> | ||
| <td>56.2</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- morph1 ---> | ||
| ### Syntax | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#syntax">Slovnet syntax parser</a>. | ||
| `uas`, `las` — accuracy on news dataset, see <a href="https://github.com/natasha/slovnet#syntax-1">Slovnet evaluation section</a> for more. | ||
| <!--- syntax1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>uas</th> | ||
| <th>las</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, sents/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.907</td> | ||
| <td>0.880</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>125</td> | ||
| <td>450.0</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.962</td> | ||
| <td>0.910</td> | ||
| <td>34.0</td> | ||
| <td>1427</td> | ||
| <td>8704</td> | ||
| <td>75.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>0.876</td> | ||
| <td>0.818</td> | ||
| <td>10.9</td> | ||
| <td>89</td> | ||
| <td>579</td> | ||
| <td>31.6</td> | ||
| </tr> | ||
| <tr> | ||
| <th>udpipe</th> | ||
| <td>0.873</td> | ||
| <td>0.823</td> | ||
| <td>6.9</td> | ||
| <td>45</td> | ||
| <td>242</td> | ||
| <td>56.2</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- syntax1 ---> | ||
| ### NER | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#ner">Slovnet NER tagger</a>. | ||
| `f1` — score aggregated over 4 datasets, see <a href="https://github.com/natasha/slovnet#ner-1">Slovnet evaluation section</a> for more. | ||
| <!--- ner1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>PER/LOC/ORG f1</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, articles/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.97/0.91/0.85</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>205</td> | ||
| <td>25.3</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.98/0.92/0.86</td> | ||
| <td>34.5</td> | ||
| <td>2048</td> | ||
| <td>6144</td> | ||
| <td>13.1 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov</th> | ||
| <td>0.92/0.86/0.76</td> | ||
| <td>5.9</td> | ||
| <td>1024</td> | ||
| <td>3072</td> | ||
| <td>24.3 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>pullenti</th> | ||
| <td>0.92/0.82/0.64</td> | ||
| <td>2.9</td> | ||
| <td>16</td> | ||
| <td>253</td> | ||
| <td>6.0</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- ner1 ---> | ||
| ## Support | ||
| - Chat — https://telegram.me/natural_language_processing | ||
| - Issues — https://github.com/natasha/natasha/issues | ||
| - Commercial support — http://lab.alexkuk.ru/natasha | ||
| ## Development | ||
| Tests: | ||
| ```bash | ||
| make test | ||
| ``` | ||
| Package: | ||
| ```bash | ||
| make version | ||
| git push | ||
| git push --tags | ||
| make clean package publish | ||
| ``` | ||
| Keywords: natural language processing,russian | ||
| Platform: UNKNOWN | ||
| Classifier: License :: OSI Approved :: MIT License | ||
| Classifier: Programming Language :: Python :: 3 | ||
| Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence | ||
| Classifier: Topic :: Text Processing :: Linguistic | ||
| Description-Content-Type: text/markdown |
| pymorphy2 | ||
| intervaltree>=3 | ||
| razdel>=0.5.0 | ||
| navec>=0.9.0 | ||
| slovnet>=0.3.0 | ||
| yargy>=0.14.0 | ||
| ipymarkup>=0.8.0 |
| README.md | ||
| setup.cfg | ||
| setup.py | ||
| natasha/__init__.py | ||
| natasha/const.py | ||
| natasha/doc.py | ||
| natasha/emb.py | ||
| natasha/extractors.py | ||
| natasha/ner.py | ||
| natasha/norm.py | ||
| natasha/record.py | ||
| natasha/segment.py | ||
| natasha/shape.py | ||
| natasha/span.py | ||
| natasha/syntax.py | ||
| natasha.egg-info/PKG-INFO | ||
| natasha.egg-info/SOURCES.txt | ||
| natasha.egg-info/dependency_links.txt | ||
| natasha.egg-info/requires.txt | ||
| natasha.egg-info/top_level.txt | ||
| natasha/data/__init__.py | ||
| natasha/data/dict/first.txt | ||
| natasha/data/dict/last.txt | ||
| natasha/data/dict/maybe_first.txt | ||
| natasha/data/emb/navec_news_v1_1B_250K_300d_100q.tar | ||
| natasha/data/model/slovnet_morph_news_v1.tar | ||
| natasha/data/model/slovnet_ner_news_v1.tar | ||
| natasha/data/model/slovnet_syntax_news_v1.tar | ||
| natasha/grammars/__init__.py | ||
| natasha/grammars/addr.py | ||
| natasha/grammars/date.py | ||
| natasha/grammars/money.py | ||
| natasha/grammars/name.py | ||
| natasha/morph/__init__.py | ||
| natasha/morph/lemma.py | ||
| natasha/morph/tagger.py | ||
| natasha/morph/vocab.py | ||
| natasha/tests/__init__.py | ||
| natasha/tests/conftest.py | ||
| natasha/tests/test_addr.py | ||
| natasha/tests/test_date.py | ||
| natasha/tests/test_doc.py | ||
| natasha/tests/test_money.py | ||
| natasha/tests/test_name.py |
| natasha |
| PER = 'PER' | ||
| LOC = 'LOC' | ||
| ORG = 'ORG' |
Sorry, the diff of this file is too big to display
Sorry, the diff of this file is too big to display
| август | ||
| августа | ||
| агата # агаты | ||
| ада | ||
| аи | ||
| алмаз | ||
| алтай | ||
| альф # Альфа-банку | ||
| аля | ||
| ангел | ||
| ангела | ||
| арен # арену O2 | ||
| ария | ||
| афина | ||
| ая # 5-ая | ||
| баги | ||
| барак | ||
| бела # белой | ||
| бета | ||
| боян | ||
| буся | ||
| валентинка | ||
| валентиночка | ||
| вели | ||
| вера | ||
| вики # wiki | ||
| викторина | ||
| галька | ||
| гера | ||
| гор | ||
| дада | ||
| дан # дана | ||
| дана | ||
| данна # данной | ||
| даня # дань | ||
| дельфин | ||
| джем | ||
| джин | ||
| дрон | ||
| женя # жени | ||
| ида # иду | ||
| иза # из | ||
| ислам | ||
| калия # нитрата калия | ||
| канат | ||
| капа | ||
| кися | ||
| кола | ||
| коля | ||
| костя | ||
| куба | ||
| лада | ||
| ландыш | ||
| лев | ||
| лен | ||
| лена | ||
| лиана | ||
| ливия | ||
| лил | ||
| лила | ||
| лилия | ||
| лука | ||
| люба # любой | ||
| любов | ||
| любовь | ||
| лёва # левой | ||
| лёня | ||
| майк | ||
| майя | ||
| макс | ||
| макса | ||
| маргаритка | ||
| марк # марки | ||
| марта | ||
| мая | ||
| мила | ||
| надежда | ||
| нано | ||
| нарик | ||
| ник | ||
| ника | ||
| никон | ||
| нил | ||
| ной # 50%-ной | ||
| нора | ||
| олимпиада | ||
| ося # ось | ||
| павлина | ||
| петрушка | ||
| петя | ||
| питер | ||
| пол | ||
| поль | ||
| поля | ||
| рада | ||
| раш # раша | ||
| рая | ||
| рим | ||
| рима | ||
| родя # в своём роде | ||
| роза | ||
| розочка | ||
| рой | ||
| ром | ||
| рома | ||
| роман | ||
| романа | ||
| рулон | ||
| сами | ||
| света | ||
| семён # семена | ||
| серёжка | ||
| сирена | ||
| слава | ||
| спартак | ||
| султан | ||
| султана | ||
| талия | ||
| таньчик | ||
| тарана | ||
| тиша | ||
| толя # толи | ||
| том # в том | ||
| тома | ||
| томик | ||
| тёма # тем | ||
| урал | ||
| феня | ||
| фиалка | ||
| флора | ||
| хамит | ||
| ширин | ||
| ширина | ||
| эльбрус | ||
| ёлка |
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
+311
| from .const import ORG | ||
| from .record import Record | ||
| from .span import index_spans, query_spans_index | ||
| from .norm import normalize, syntax_normalize | ||
| class Record(Record): | ||
| def __repr__(self): | ||
| return compact_repr(self) | ||
| def _repr_pretty_(self, printer, cycle): | ||
| printer.text(repr(self)) | ||
| class DocToken(Record): | ||
| __attributes__ = ['start', 'stop', 'text', | ||
| 'id', 'head_id', 'rel', | ||
| 'pos', 'feats', 'lemma'] | ||
| def __init__(self, start, stop, text, | ||
| id=None, head_id=None, rel=None, | ||
| pos=None, feats=None, lemma=None): | ||
| self.start = start | ||
| self.stop = stop | ||
| self.text = text | ||
| self.id = id | ||
| self.head_id = head_id | ||
| self.rel = rel | ||
| self.pos = pos | ||
| self.feats = feats | ||
| self.lemma = lemma | ||
| def lemmatize(self, vocab): | ||
| self.lemma = vocab.lemmatize(self.text, self.pos, self.feats) | ||
| class DocSpan(Record): | ||
| __attributes__ = ['start', 'stop', 'type', 'text', | ||
| 'tokens', 'normal', 'fact'] | ||
| def __init__(self, start, stop, type, text, | ||
| tokens=None, normal=None, fact=None): | ||
| self.start = start | ||
| self.stop = stop | ||
| self.type = type | ||
| self.text = text | ||
| self.tokens = tokens | ||
| self.normal = normal | ||
| self.fact = fact | ||
| def normalize(self, vocab): | ||
| method = ( | ||
| syntax_normalize | ||
| if self.type == ORG | ||
| else normalize | ||
| ) | ||
| self.normal = method(vocab, self.tokens) | ||
| def extract_fact(self, extractor): | ||
| match = extractor.find(self.normal) | ||
| if match: | ||
| self.fact = match.fact | ||
| class DocSent(Record): | ||
| __attributes__ = ['start', 'stop', 'text', 'tokens', 'spans'] | ||
| def __init__(self, start, stop, text, | ||
| tokens=None, spans=None): | ||
| self.start = start | ||
| self.stop = stop | ||
| self.text = text | ||
| self.tokens = tokens | ||
| self.spans = spans | ||
| @property | ||
| def morph(self): | ||
| return morph_markup(self.tokens) | ||
| @property | ||
| def syntax(self): | ||
| return syntax_markup(self.tokens) | ||
| @property | ||
| def ner(self): | ||
| return ner_markup(self.text, self.spans, -self.start) | ||
| class Doc(Record): | ||
| __attributes__ = ['text', 'tokens', 'spans', 'sents'] | ||
| def __init__(self, text, tokens=None, spans=None, sents=None): | ||
| self.text = text | ||
| self.tokens = tokens | ||
| self.spans = spans | ||
| self.sents = sents | ||
| def segment(self, segmenter): | ||
| segment_doc(self, segmenter) | ||
| def tag_morph(self, tagger): | ||
| tag_morph_doc(self, tagger) | ||
| def parse_syntax(self, parser): | ||
| parse_syntax_doc(self, parser) | ||
| def tag_ner(self, tagger): | ||
| tag_ner_doc(self, tagger) | ||
| @property | ||
| def morph(self): | ||
| return morph_markup(self.tokens) | ||
| @property | ||
| def syntax(self): | ||
| return syntax_markup(self.tokens) | ||
| @property | ||
| def ner(self): | ||
| return ner_markup(self.text, self.spans) | ||
| ####### | ||
| # | ||
| # SEGMENT | ||
| # | ||
| ####### | ||
| def adapt_token(token, offset): | ||
| start, stop, text = token | ||
| return DocToken( | ||
| offset + start, | ||
| offset + stop, | ||
| text | ||
| ) | ||
| def adapt_sent(sent): | ||
| start, stop, text = sent | ||
| return DocSent(start, stop, text) | ||
| def segment_doc(doc, segmenter): | ||
| doc.tokens, doc.sents = [], [] | ||
| for sent in segmenter.sentenize(doc.text): | ||
| sent = adapt_sent(sent) | ||
| doc.sents.append(sent) | ||
| sent.tokens = [] | ||
| for token in segmenter.tokenize(sent.text): | ||
| token = adapt_token(token, sent.start) | ||
| doc.tokens.append(token) | ||
| sent.tokens.append(token) | ||
| ####### | ||
| # | ||
| # MORPH | ||
| # | ||
| #### | ||
| def sent_words(sent): | ||
| return [_.text for _ in sent.tokens] | ||
| def inject_morph(targets, sources): | ||
| for target, source in zip(targets, sources): | ||
| target.pos = source.pos | ||
| target.feats = source.feats | ||
| def tag_morph_doc(doc, tagger): | ||
| chunk = [sent_words(_) for _ in doc.sents] | ||
| markups = tagger.map(chunk) | ||
| for sent, markup in zip(doc.sents, markups): | ||
| inject_morph(sent.tokens, markup.tokens) | ||
| ####### | ||
| # | ||
| # SYNTAX | ||
| # | ||
| ###### | ||
| def offset_syntax(sent_id, tokens): | ||
| for token in tokens: | ||
| token.id = '%s_%s' % (sent_id, token.id) | ||
| token.head_id = '%s_%s' % (sent_id, token.head_id) | ||
| def inject_syntax(targets, sources): | ||
| for target, source in zip(targets, sources): | ||
| target.id = source.id | ||
| target.head_id = source.head_id | ||
| target.rel = source.rel | ||
| def parse_syntax_doc(doc, parser): | ||
| chunk = [sent_words(_) for _ in doc.sents] | ||
| markups = parser.map(chunk) | ||
| for sent_id, (sent, markup) in enumerate(zip(doc.sents, markups), 1): | ||
| inject_syntax(sent.tokens, markup.tokens) | ||
| offset_syntax(sent_id, sent.tokens) | ||
| ######### | ||
| # | ||
| # NER | ||
| # | ||
| ###### | ||
| def adapt_spans(doc, spans): | ||
| for start, stop, type in spans: | ||
| text = doc.text[start:stop] | ||
| yield DocSpan(start, stop, type, text) | ||
| def tag_ner_doc(doc, tagger): | ||
| markup = tagger(doc.text) | ||
| doc.spans = list(adapt_spans(doc, markup.spans)) | ||
| # envelope tokens < spans < sents | ||
| index = index_spans(doc.tokens) | ||
| for span in doc.spans: | ||
| span.tokens = query_spans_index(index, span) | ||
| index = index_spans(doc.spans) | ||
| for sent in doc.sents: | ||
| sent.spans = query_spans_index(index, sent) | ||
| ##### | ||
| # | ||
| # MARKUP | ||
| # | ||
| ###### | ||
| def morph_markup(tokens): | ||
| from .morph.tagger import adapt_tokens, MorphMarkup | ||
| tokens = list(adapt_tokens(tokens)) | ||
| return MorphMarkup(tokens) | ||
| def syntax_markup(tokens): | ||
| from .syntax import adapt_tokens, SyntaxMarkup | ||
| tokens = list(adapt_tokens(tokens)) | ||
| return SyntaxMarkup(tokens) | ||
| def ner_markup(text, spans, offset=0): | ||
| from .span import adapt_spans, offset_spans | ||
| from .ner import NERMarkup | ||
| spans = adapt_spans(spans) | ||
| spans = list(offset_spans(spans, offset)) | ||
| return NERMarkup(text, spans) | ||
| ####### | ||
| # | ||
| # REPR | ||
| # | ||
| ###### | ||
| FEATS = 'feats' | ||
| def format_feats(feats): | ||
| values = (feats[_] for _ in sorted(feats)) | ||
| return '<%s>' % ','.join(values) | ||
| def capped_str(value, cap=50): | ||
| if len(value) <= cap: | ||
| return value | ||
| return value[:cap] + '...' | ||
| HIDE_LIST = '[...]' | ||
| def compact_repr(record): | ||
| parts = [] | ||
| for key in record.__attributes__: | ||
| value = getattr(record, key) | ||
| if not value: | ||
| continue | ||
| if isinstance(value, list): | ||
| value = HIDE_LIST | ||
| elif key == FEATS: | ||
| value = format_feats(value) | ||
| else: | ||
| value = repr(value) | ||
| value = capped_str(value) | ||
| parts.append('%s=%s' % (key, value)) | ||
| return '%s(%s)' % (record.__class__.__name__, ', '.join(parts)) |
| from navec import Navec | ||
| from .data import NEWS_EMBEDDING | ||
| class Embedding(Navec): | ||
| def __init__(self, path): | ||
| meta, vocab, pq = Navec.load(path) | ||
| Navec.__init__(self, meta, vocab, pq) | ||
| class NewsEmbedding(Embedding): | ||
| def __init__(self, path=NEWS_EMBEDDING): | ||
| Embedding.__init__(self, path) |
| from yargy import ( | ||
| rule, | ||
| or_, and_ | ||
| ) | ||
| from yargy.interpretation import fact | ||
| from yargy.predicates import ( | ||
| eq, lte, gte, gram, type, tag, | ||
| length_eq, | ||
| in_, in_caseless, dictionary, | ||
| normalized, caseless, | ||
| is_title | ||
| ) | ||
| from yargy.pipelines import morph_pipeline | ||
| from yargy.tokenizer import QUOTES | ||
| Index = fact( | ||
| 'Index', | ||
| ['value'] | ||
| ) | ||
| Country = fact( | ||
| 'Country', | ||
| ['name'] | ||
| ) | ||
| Region = fact( | ||
| 'Region', | ||
| ['name', 'type'] | ||
| ) | ||
| Settlement = fact( | ||
| 'Settlement', | ||
| ['name', 'type'] | ||
| ) | ||
| Street = fact( | ||
| 'Street', | ||
| ['name', 'type'] | ||
| ) | ||
| Building = fact( | ||
| 'Building', | ||
| ['number', 'type'] | ||
| ) | ||
| Room = fact( | ||
| 'Room', | ||
| ['number', 'type'] | ||
| ) | ||
| AddrPart = fact( | ||
| 'AddrPart', | ||
| ['value'] | ||
| ) | ||
| def value(key): | ||
| @property | ||
| def field(self): | ||
| return getattr(self, key) | ||
| return field | ||
| class Index(Index): | ||
| type = 'индекс' | ||
| class Country(Country): | ||
| type = 'страна' | ||
| value = value('name') | ||
| class Region(Region): | ||
| value = value('name') | ||
| class Settlement(Settlement): | ||
| value = value('name') | ||
| class Street(Settlement): | ||
| value = value('name') | ||
| class Building(Building): | ||
| value = value('number') | ||
| class Room(Room): | ||
| value = value('number') | ||
| class AddrPart(AddrPart): | ||
| @property | ||
| def obj(self): | ||
| from natasha.extractors import AddrPart | ||
| part = self.value | ||
| return AddrPart(part.value, part.type) | ||
| DASH = eq('-') | ||
| DOT = eq('.') | ||
| ADJF = gram('ADJF') | ||
| NOUN = gram('NOUN') | ||
| INT = type('INT') | ||
| TITLE = is_title() | ||
| ANUM = rule( | ||
| INT, | ||
| DASH.optional(), | ||
| in_caseless({ | ||
| 'я', 'й', 'е', | ||
| 'ое', 'ая', 'ий', 'ой' | ||
| }) | ||
| ) | ||
| ######### | ||
| # | ||
| # STRANA | ||
| # | ||
| ########## | ||
| # TODO | ||
| COUNTRY_VALUE = dictionary({ | ||
| 'россия', | ||
| 'украина' | ||
| }) | ||
| ABBR_COUNTRY_VALUE = in_caseless({ | ||
| 'рф' | ||
| }) | ||
| COUNTRY = or_( | ||
| COUNTRY_VALUE, | ||
| ABBR_COUNTRY_VALUE | ||
| ).interpretation( | ||
| Country.name | ||
| ).interpretation( | ||
| Country | ||
| ) | ||
| ############# | ||
| # | ||
| # FED OKRUGA | ||
| # | ||
| ############ | ||
| FED_OKRUG_NAME = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'дальневосточный', | ||
| 'приволжский', | ||
| 'сибирский', | ||
| 'уральский', | ||
| 'центральный', | ||
| 'южный', | ||
| }) | ||
| ), | ||
| rule( | ||
| caseless('северо'), | ||
| DASH.optional(), | ||
| dictionary({ | ||
| 'западный', | ||
| 'кавказский' | ||
| }) | ||
| ) | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| FED_OKRUG_WORDS = or_( | ||
| rule( | ||
| normalized('федеральный'), | ||
| normalized('округ') | ||
| ), | ||
| rule(caseless('фо')) | ||
| ).interpretation( | ||
| Region.type.const('федеральный округ') | ||
| ) | ||
| FED_OKRUG = rule( | ||
| FED_OKRUG_WORDS, | ||
| FED_OKRUG_NAME | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ######### | ||
| # | ||
| # RESPUBLIKA | ||
| # | ||
| ############ | ||
| RESPUBLIKA_WORDS = or_( | ||
| rule(caseless('респ'), DOT.optional()), | ||
| rule(normalized('республика')) | ||
| ).interpretation( | ||
| Region.type.const('республика') | ||
| ) | ||
| RESPUBLIKA_ADJF = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'удмуртский', | ||
| 'чеченский', | ||
| 'чувашский', | ||
| }) | ||
| ), | ||
| rule( | ||
| caseless('карачаево'), | ||
| DASH.optional(), | ||
| normalized('черкесский') | ||
| ), | ||
| rule( | ||
| caseless('кабардино'), | ||
| DASH.optional(), | ||
| normalized('балкарский') | ||
| ) | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| RESPUBLIKA_NAME = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'адыгея', | ||
| 'алтай', | ||
| 'башкортостан', | ||
| 'бурятия', | ||
| 'дагестан', | ||
| 'ингушетия', | ||
| 'калмыкия', | ||
| 'карелия', | ||
| 'коми', | ||
| 'крым', | ||
| 'мордовия', | ||
| 'татарстан', | ||
| 'тыва', | ||
| 'удмуртия', | ||
| 'хакасия', | ||
| 'саха', | ||
| 'якутия', | ||
| }) | ||
| ), | ||
| rule(caseless('марий'), caseless('эл')), | ||
| rule( | ||
| normalized('северный'), normalized('осетия'), | ||
| rule('-', normalized('алания')).optional() | ||
| ) | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| RESPUBLIKA_ABBR = in_caseless({ | ||
| 'кбр', | ||
| 'кчр', | ||
| 'рт', # Татарстан | ||
| }).interpretation( | ||
| Region.name # TODO type | ||
| ) | ||
| RESPUBLIKA = or_( | ||
| rule(RESPUBLIKA_ADJF, RESPUBLIKA_WORDS), | ||
| rule(RESPUBLIKA_WORDS, RESPUBLIKA_NAME), | ||
| rule(RESPUBLIKA_ABBR) | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########## | ||
| # | ||
| # KRAI | ||
| # | ||
| ######## | ||
| KRAI_WORDS = normalized('край').interpretation( | ||
| Region.type.const('край') | ||
| ) | ||
| KRAI_NAME = dictionary({ | ||
| 'алтайский', | ||
| 'забайкальский', | ||
| 'камчатский', | ||
| 'краснодарский', | ||
| 'красноярский', | ||
| 'пермский', | ||
| 'приморский', | ||
| 'ставропольский', | ||
| 'хабаровский', | ||
| }).interpretation( | ||
| Region.name | ||
| ) | ||
| KRAI = rule( | ||
| KRAI_NAME, KRAI_WORDS | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ############ | ||
| # | ||
| # OBLAST | ||
| # | ||
| ############ | ||
| OBLAST_WORDS = or_( | ||
| rule(normalized('область')), | ||
| rule( | ||
| caseless('обл'), | ||
| DOT.optional() | ||
| ) | ||
| ).interpretation( | ||
| Region.type.const('область') | ||
| ) | ||
| OBLAST_NAME = dictionary({ | ||
| 'амурский', | ||
| 'архангельский', | ||
| 'астраханский', | ||
| 'белгородский', | ||
| 'брянский', | ||
| 'владимирский', | ||
| 'волгоградский', | ||
| 'вологодский', | ||
| 'воронежский', | ||
| 'горьковский', | ||
| 'ивановский', | ||
| 'ивановский', | ||
| 'иркутский', | ||
| 'калининградский', | ||
| 'калужский', | ||
| 'камчатский', | ||
| 'кемеровский', | ||
| 'кировский', | ||
| 'костромской', | ||
| 'курганский', | ||
| 'курский', | ||
| 'ленинградский', | ||
| 'липецкий', | ||
| 'магаданский', | ||
| 'московский', | ||
| 'мурманский', | ||
| 'нижегородский', | ||
| 'новгородский', | ||
| 'новосибирский', | ||
| 'омский', | ||
| 'оренбургский', | ||
| 'орловский', | ||
| 'пензенский', | ||
| 'пермский', | ||
| 'псковский', | ||
| 'ростовский', | ||
| 'рязанский', | ||
| 'самарский', | ||
| 'саратовский', | ||
| 'сахалинский', | ||
| 'свердловский', | ||
| 'смоленский', | ||
| 'тамбовский', | ||
| 'тверской', | ||
| 'томский', | ||
| 'тульский', | ||
| 'тюменский', | ||
| 'ульяновский', | ||
| 'челябинский', | ||
| 'читинский', | ||
| 'ярославский', | ||
| }).interpretation( | ||
| Region.name | ||
| ) | ||
| OBLAST = rule( | ||
| OBLAST_NAME, | ||
| OBLAST_WORDS | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########## | ||
| # | ||
| # AUTO OKRUG | ||
| # | ||
| ############# | ||
| AUTO_OKRUG_NAME = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'чукотский', | ||
| 'эвенкийский', | ||
| 'корякский', | ||
| 'ненецкий', | ||
| 'таймырский', | ||
| 'агинский', | ||
| 'бурятский', | ||
| }) | ||
| ), | ||
| rule(caseless('коми'), '-', normalized('пермяцкий')), | ||
| rule(caseless('долгано'), '-', normalized('ненецкий')), | ||
| rule(caseless('ямало'), '-', normalized('ненецкий')), | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| AUTO_OKRUG_WORDS = or_( | ||
| rule( | ||
| normalized('автономный'), | ||
| normalized('округ') | ||
| ), | ||
| rule(caseless('ао')) | ||
| ).interpretation( | ||
| Region.type.const('автономный округ') | ||
| ) | ||
| HANTI = rule( | ||
| caseless('ханты'), '-', normalized('мансийский') | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| BURAT = rule( | ||
| caseless('усть'), '-', normalized('ордынский'), | ||
| normalized('бурятский') | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| AUTO_OKRUG = or_( | ||
| rule(AUTO_OKRUG_NAME, AUTO_OKRUG_WORDS), | ||
| or_( | ||
| rule( | ||
| HANTI, | ||
| AUTO_OKRUG_WORDS, | ||
| '-', normalized('югра') | ||
| ), | ||
| rule( | ||
| caseless('хмао'), | ||
| ).interpretation(Region.name), | ||
| rule( | ||
| caseless('хмао'), | ||
| '-', caseless('югра') | ||
| ).interpretation(Region.name), | ||
| ), | ||
| rule( | ||
| BURAT, | ||
| AUTO_OKRUG_WORDS | ||
| ) | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########## | ||
| # | ||
| # RAION | ||
| # | ||
| ########### | ||
| RAION_WORDS = or_( | ||
| rule(caseless('р'), '-', in_caseless({'он', 'н'})), | ||
| rule(normalized('район')) | ||
| ).interpretation( | ||
| Region.type.const('район') | ||
| ) | ||
| RAION_SIMPLE_NAME = and_( | ||
| ADJF, | ||
| TITLE | ||
| ) | ||
| RAION_MODIFIERS = rule( | ||
| in_caseless({ | ||
| 'усть', | ||
| 'северо', | ||
| 'александрово', | ||
| 'гаврилово', | ||
| }), | ||
| DASH.optional(), | ||
| TITLE | ||
| ) | ||
| RAION_COMPLEX_NAME = rule( | ||
| RAION_MODIFIERS, | ||
| RAION_SIMPLE_NAME | ||
| ) | ||
| RAION_NAME = or_( | ||
| rule(RAION_SIMPLE_NAME), | ||
| RAION_COMPLEX_NAME | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| RAION = rule( | ||
| RAION_NAME, | ||
| RAION_WORDS | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########### | ||
| # | ||
| # GOROD | ||
| # | ||
| ########### | ||
| # Top 200 Russia cities, cover 75% of population | ||
| COMPLEX = morph_pipeline([ | ||
| 'санкт-петербург', | ||
| 'нижний новгород', | ||
| 'н.новгород', | ||
| 'ростов-на-дону', | ||
| 'набережные челны', | ||
| 'улан-удэ', | ||
| 'нижний тагил', | ||
| 'комсомольск-на-амуре', | ||
| 'йошкар-ола', | ||
| 'старый оскол', | ||
| 'великий новгород', | ||
| 'южно-сахалинск', | ||
| 'петропавловск-камчатский', | ||
| 'каменск-уральский', | ||
| 'орехово-зуево', | ||
| 'сергиев посад', | ||
| 'новый уренгой', | ||
| 'ленинск-кузнецкий', | ||
| 'великие луки', | ||
| 'каменск-шахтинский', | ||
| 'усть-илимск', | ||
| 'усолье-сибирский', | ||
| 'кирово-чепецк', | ||
| ]) | ||
| SIMPLE = dictionary({ | ||
| 'москва', | ||
| 'новосибирск', | ||
| 'екатеринбург', | ||
| 'казань', | ||
| 'самар', | ||
| 'омск', | ||
| 'челябинск', | ||
| 'уфа', | ||
| 'волгоград', | ||
| 'пермь', | ||
| 'красноярск', | ||
| 'воронеж', | ||
| 'саратов', | ||
| 'краснодар', | ||
| 'тольятти', | ||
| 'барнаул', | ||
| 'ижевск', | ||
| 'ульяновск', | ||
| 'владивосток', | ||
| 'ярославль', | ||
| 'иркутск', | ||
| 'тюмень', | ||
| 'махачкала', | ||
| 'хабаровск', | ||
| 'оренбург', | ||
| 'новокузнецк', | ||
| 'кемерово', | ||
| 'рязань', | ||
| 'томск', | ||
| 'астрахань', | ||
| 'пенза', | ||
| 'липецк', | ||
| 'тула', | ||
| 'киров', | ||
| 'чебоксары', | ||
| 'калининград', | ||
| 'брянск', | ||
| 'курск', | ||
| 'иваново', | ||
| 'магнитогорск', | ||
| 'тверь', | ||
| 'ставрополь', | ||
| 'симферополь', | ||
| 'белгород', | ||
| 'архангельск', | ||
| 'владимир', | ||
| 'севастополь', | ||
| 'сочи', | ||
| 'курган', | ||
| 'смоленск', | ||
| 'калуга', | ||
| 'чита', | ||
| 'орёл', | ||
| 'волжский', | ||
| 'череповец', | ||
| 'владикавказ', | ||
| 'мурманск', | ||
| 'сургут', | ||
| 'вологда', | ||
| 'саранск', | ||
| 'тамбов', | ||
| 'стерлитамак', | ||
| 'грозный', | ||
| 'якутск', | ||
| 'кострома', | ||
| 'петрозаводск', | ||
| 'таганрог', | ||
| 'нижневартовск', | ||
| 'братск', | ||
| 'новороссийск', | ||
| 'дзержинск', | ||
| 'шахта', | ||
| 'нальчик', | ||
| 'орск', | ||
| 'сыктывкар', | ||
| 'нижнекамск', | ||
| 'ангарск', | ||
| 'балашиха', | ||
| 'благовещенск', | ||
| 'прокопьевск', | ||
| 'химки', | ||
| 'псков', | ||
| 'бийск', | ||
| 'энгельс', | ||
| 'рыбинск', | ||
| 'балаково', | ||
| 'северодвинск', | ||
| 'армавир', | ||
| 'подольск', | ||
| 'королёв', | ||
| 'сызрань', | ||
| 'норильск', | ||
| 'златоуст', | ||
| 'мытищи', | ||
| 'люберцы', | ||
| 'волгодонск', | ||
| 'новочеркасск', | ||
| 'абакан', | ||
| 'находка', | ||
| 'уссурийск', | ||
| 'березники', | ||
| 'салават', | ||
| 'электросталь', | ||
| 'миасс', | ||
| 'первоуральск', | ||
| 'рубцовск', | ||
| 'альметьевск', | ||
| 'ковровый', | ||
| 'коломна', | ||
| 'керчь', | ||
| 'майкоп', | ||
| 'пятигорск', | ||
| 'одинцово', | ||
| 'копейск', | ||
| 'хасавюрт', | ||
| 'новомосковск', | ||
| 'кисловодск', | ||
| 'серпухов', | ||
| 'новочебоксарск', | ||
| 'нефтеюганск', | ||
| 'димитровград', | ||
| 'нефтекамск', | ||
| 'черкесск', | ||
| 'дербент', | ||
| 'камышин', | ||
| 'невинномысск', | ||
| 'красногорск', | ||
| 'мур', | ||
| 'батайск', | ||
| 'новошахтинск', | ||
| 'ноябрьск', | ||
| 'кызыл', | ||
| 'октябрьский', | ||
| 'ачинск', | ||
| 'северск', | ||
| 'новокуйбышевск', | ||
| 'елец', | ||
| 'евпатория', | ||
| 'арзамас', | ||
| 'обнинск', | ||
| 'каспийск', | ||
| 'элиста', | ||
| 'пушкино', | ||
| 'жуковский', | ||
| 'междуреченск', | ||
| 'сарапул', | ||
| 'ессентуки', | ||
| 'воткинск', | ||
| 'ногинск', | ||
| 'тобольск', | ||
| 'ухта', | ||
| 'серов', | ||
| 'бердск', | ||
| 'мичуринск', | ||
| 'киселёвск', | ||
| 'новотроицк', | ||
| 'зеленодольск', | ||
| 'соликамск', | ||
| 'раменский', | ||
| 'домодедово', | ||
| 'магадан', | ||
| 'глазов', | ||
| 'железногорск', | ||
| 'канск', | ||
| 'назрань', | ||
| 'гатчина', | ||
| 'саров', | ||
| 'новоуральск', | ||
| 'воскресенск', | ||
| 'долгопрудный', | ||
| 'бугульма', | ||
| 'кузнецк', | ||
| 'губкин', | ||
| 'кинешма', | ||
| 'ейск', | ||
| 'реутов', | ||
| 'железногорск', | ||
| 'чайковский', | ||
| 'азов', | ||
| 'бузулук', | ||
| 'озёрск', | ||
| 'балашов', | ||
| 'юрга', | ||
| 'кропоткин', | ||
| 'клин' | ||
| }) | ||
| GOROD_ABBR = in_caseless({ | ||
| 'спб', | ||
| 'мск', | ||
| 'нск' # Новосибирск | ||
| }) | ||
| GOROD_NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX, | ||
| rule(GOROD_ABBR) | ||
| ).interpretation( | ||
| Settlement.name | ||
| ) | ||
| SIMPLE = and_( | ||
| TITLE, | ||
| or_( | ||
| NOUN, | ||
| ADJF # Железнодорожный, Юбилейный | ||
| ) | ||
| ) | ||
| COMPLEX = or_( | ||
| rule( | ||
| SIMPLE, | ||
| DASH.optional(), | ||
| SIMPLE | ||
| ), | ||
| rule( | ||
| TITLE, | ||
| DASH.optional(), | ||
| caseless('на'), | ||
| DASH.optional(), | ||
| TITLE | ||
| ) | ||
| ) | ||
| NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX | ||
| ) | ||
| MAYBE_GOROD_NAME = or_( | ||
| NAME, | ||
| rule(NAME, '-', INT) | ||
| ).interpretation( | ||
| Settlement.name | ||
| ) | ||
| GOROD_WORDS = or_( | ||
| rule(normalized('город')), | ||
| rule( | ||
| caseless('г'), | ||
| DOT.optional() | ||
| ) | ||
| ).interpretation( | ||
| Settlement.type.const('город') | ||
| ) | ||
| GOROD = or_( | ||
| rule(GOROD_WORDS, MAYBE_GOROD_NAME), | ||
| rule( | ||
| GOROD_WORDS.optional(), | ||
| GOROD_NAME | ||
| ) | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ########## | ||
| # | ||
| # SETTLEMENT NAME | ||
| # | ||
| ########## | ||
| ADJS = gram('ADJS') | ||
| SIMPLE = and_( | ||
| or_( | ||
| NOUN, # Александровка, Заречье, Горки | ||
| ADJS, # Кузнецово | ||
| ADJF, # Никольское, Новая, Марьино | ||
| ), | ||
| TITLE | ||
| ) | ||
| COMPLEX = rule( | ||
| SIMPLE, | ||
| DASH.optional(), | ||
| SIMPLE | ||
| ) | ||
| NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX | ||
| ) | ||
| SETTLEMENT_NAME = or_( | ||
| NAME, | ||
| rule(NAME, '-', INT), | ||
| rule(NAME, ANUM) | ||
| ) | ||
| ########### | ||
| # | ||
| # SELO | ||
| # | ||
| ############# | ||
| SELO_WORDS = or_( | ||
| rule( | ||
| caseless('с'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('село')) | ||
| ).interpretation( | ||
| Settlement.type.const('село') | ||
| ) | ||
| SELO_NAME = SETTLEMENT_NAME.interpretation( | ||
| Settlement.name | ||
| ) | ||
| SELO = rule( | ||
| SELO_WORDS, | ||
| SELO_NAME | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ########### | ||
| # | ||
| # DEREVNYA | ||
| # | ||
| ############# | ||
| DEREVNYA_WORDS = or_( | ||
| rule( | ||
| caseless('д'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('деревня')) | ||
| ).interpretation( | ||
| Settlement.type.const('деревня') | ||
| ) | ||
| DEREVNYA_NAME = SETTLEMENT_NAME.interpretation( | ||
| Settlement.name | ||
| ) | ||
| DEREVNYA = rule( | ||
| DEREVNYA_WORDS, | ||
| DEREVNYA_NAME | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ########### | ||
| # | ||
| # POSELOK | ||
| # | ||
| ############# | ||
| POSELOK_WORDS = or_( | ||
| rule( | ||
| in_caseless({'п', 'пос'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('посёлок')), | ||
| rule( | ||
| caseless('р'), | ||
| DOT.optional(), | ||
| caseless('п'), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| normalized('рабочий'), | ||
| normalized('посёлок') | ||
| ), | ||
| rule( | ||
| caseless('пгт'), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| caseless('п'), DOT, caseless('г'), DOT, caseless('т'), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| normalized('посёлок'), | ||
| normalized('городского'), | ||
| normalized('типа'), | ||
| ), | ||
| ).interpretation( | ||
| Settlement.type.const('посёлок') | ||
| ) | ||
| POSELOK_NAME = SETTLEMENT_NAME.interpretation( | ||
| Settlement.name | ||
| ) | ||
| POSELOK = rule( | ||
| POSELOK_WORDS, | ||
| POSELOK_NAME | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ############## | ||
| # | ||
| # ADDR PERSON | ||
| # | ||
| ############ | ||
| ABBR = and_( | ||
| length_eq(1), | ||
| is_title() | ||
| ) | ||
| PART = and_( | ||
| TITLE, | ||
| or_( | ||
| gram('Name'), | ||
| gram('Surn') | ||
| ) | ||
| ) | ||
| MAYBE_FIO = or_( | ||
| rule(TITLE, PART), | ||
| rule(PART, TITLE), | ||
| rule(ABBR, '.', TITLE), | ||
| rule(ABBR, '.', ABBR, '.', TITLE), | ||
| rule(TITLE, ABBR, '.', ABBR, '.') | ||
| ) | ||
| POSITION_WORDS_ = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'мичман', | ||
| 'геолог', | ||
| 'подводник', | ||
| 'краевед', | ||
| 'снайпер', | ||
| 'штурман', | ||
| 'бригадир', | ||
| 'учитель', | ||
| 'политрук', | ||
| 'военком', | ||
| 'ветеран', | ||
| 'историк', | ||
| 'пулемётчик', | ||
| 'авиаконструктор', | ||
| 'адмирал', | ||
| 'академик', | ||
| 'актер', | ||
| 'актриса', | ||
| 'архитектор', | ||
| 'атаман', | ||
| 'врач', | ||
| 'воевода', | ||
| 'генерал', | ||
| 'губернатор', | ||
| 'хирург', | ||
| 'декабрист', | ||
| 'разведчик', | ||
| 'граф', | ||
| 'десантник', | ||
| 'конструктор', | ||
| 'скульптор', | ||
| 'писатель', | ||
| 'поэт', | ||
| 'капитан', | ||
| 'князь', | ||
| 'комиссар', | ||
| 'композитор', | ||
| 'космонавт', | ||
| 'купец', | ||
| 'лейтенант', | ||
| 'лётчик', | ||
| 'майор', | ||
| 'маршал', | ||
| 'матрос', | ||
| 'подполковник', | ||
| 'полковник', | ||
| 'профессор', | ||
| 'сержант', | ||
| 'старшина', | ||
| 'танкист', | ||
| 'художник', | ||
| 'герой', | ||
| 'княгиня', | ||
| 'строитель', | ||
| 'дружинник', | ||
| 'диктор', | ||
| 'прапорщик', | ||
| 'артиллерист', | ||
| 'графиня', | ||
| 'большевик', | ||
| 'патриарх', | ||
| 'сварщик', | ||
| 'офицер', | ||
| 'рыбак', | ||
| 'брат', | ||
| }) | ||
| ), | ||
| rule(normalized('генерал'), normalized('армия')), | ||
| rule(normalized('герой'), normalized('россия')), | ||
| rule( | ||
| normalized('герой'), | ||
| normalized('российский'), normalized('федерация')), | ||
| rule( | ||
| normalized('герой'), | ||
| normalized('советский'), normalized('союз') | ||
| ), | ||
| ) | ||
| ABBR_POSITION_WORDS = rule( | ||
| in_caseless({ | ||
| 'адм', | ||
| 'ак', | ||
| 'акад', | ||
| }), | ||
| DOT.optional() | ||
| ) | ||
| POSITION_WORDS = or_( | ||
| POSITION_WORDS_, | ||
| ABBR_POSITION_WORDS | ||
| ) | ||
| MAYBE_PERSON = or_( | ||
| MAYBE_FIO, | ||
| rule(POSITION_WORDS, MAYBE_FIO), | ||
| rule(POSITION_WORDS, TITLE) | ||
| ) | ||
| ########### | ||
| # | ||
| # IMENI | ||
| # | ||
| ########## | ||
| IMENI_WORDS = or_( | ||
| rule( | ||
| caseless('им'), | ||
| DOT.optional() | ||
| ), | ||
| rule(caseless('имени')) | ||
| ) | ||
| IMENI = or_( | ||
| rule( | ||
| IMENI_WORDS.optional(), | ||
| MAYBE_PERSON | ||
| ), | ||
| rule( | ||
| IMENI_WORDS, | ||
| TITLE | ||
| ) | ||
| ) | ||
| ########## | ||
| # | ||
| # LET | ||
| # | ||
| ########## | ||
| LET_WORDS = or_( | ||
| rule(caseless('лет')), | ||
| rule( | ||
| DASH.optional(), | ||
| caseless('летия') | ||
| ) | ||
| ) | ||
| LET_NAME = in_caseless({ | ||
| 'влксм', | ||
| 'ссср', | ||
| 'алтая', | ||
| 'башкирии', | ||
| 'бурятии', | ||
| 'дагестана', | ||
| 'калмыкии', | ||
| 'колхоза', | ||
| 'комсомола', | ||
| 'космонавтики', | ||
| 'москвы', | ||
| 'октября', | ||
| 'пионерии', | ||
| 'победы', | ||
| 'приморья', | ||
| 'района', | ||
| 'совхоза', | ||
| 'совхозу', | ||
| 'татарстана', | ||
| 'тувы', | ||
| 'удмуртии', | ||
| 'улуса', | ||
| 'хакасии', | ||
| 'целины', | ||
| 'чувашии', | ||
| 'якутии', | ||
| }) | ||
| LET = rule( | ||
| INT, | ||
| LET_WORDS, | ||
| LET_NAME | ||
| ) | ||
| ########## | ||
| # | ||
| # ADDR DATE | ||
| # | ||
| ############# | ||
| MONTH_WORDS = dictionary({ | ||
| 'январь', | ||
| 'февраль', | ||
| 'март', | ||
| 'апрель', | ||
| 'май', | ||
| 'июнь', | ||
| 'июль', | ||
| 'август', | ||
| 'сентябрь', | ||
| 'октябрь', | ||
| 'ноябрь', | ||
| 'декабрь', | ||
| }) | ||
| DAY = and_( | ||
| INT, | ||
| gte(1), | ||
| lte(31) | ||
| ) | ||
| YEAR = and_( | ||
| INT, | ||
| gte(1), | ||
| lte(2100) | ||
| ) | ||
| YEAR_WORDS = normalized('год') | ||
| DATE = or_( | ||
| rule(DAY, MONTH_WORDS), | ||
| rule(YEAR, YEAR_WORDS) | ||
| ) | ||
| ######### | ||
| # | ||
| # MODIFIER | ||
| # | ||
| ############ | ||
| MODIFIER_WORDS_ = rule( | ||
| dictionary({ | ||
| 'большой', | ||
| 'малый', | ||
| 'средний', | ||
| 'верхний', | ||
| 'центральный', | ||
| 'нижний', | ||
| 'северный', | ||
| 'дальний', | ||
| 'первый', | ||
| 'второй', | ||
| 'старый', | ||
| 'новый', | ||
| 'красный', | ||
| 'лесной', | ||
| 'тихий', | ||
| }), | ||
| DASH.optional() | ||
| ) | ||
| ABBR_MODIFIER_WORDS = rule( | ||
| in_caseless({ | ||
| 'б', 'м', 'н' | ||
| }), | ||
| DOT.optional() | ||
| ) | ||
| SHORT_MODIFIER_WORDS = rule( | ||
| in_caseless({ | ||
| 'больше', | ||
| 'мало', | ||
| 'средне', | ||
| 'верх', | ||
| 'верхне', | ||
| 'центрально', | ||
| 'нижне', | ||
| 'северо', | ||
| 'дальне', | ||
| 'восточно', | ||
| 'западно', | ||
| 'перво', | ||
| 'второ', | ||
| 'старо', | ||
| 'ново', | ||
| 'красно', | ||
| 'тихо', | ||
| 'горно', | ||
| }), | ||
| DASH.optional() | ||
| ) | ||
| MODIFIER_WORDS = or_( | ||
| MODIFIER_WORDS_, | ||
| ABBR_MODIFIER_WORDS, | ||
| SHORT_MODIFIER_WORDS, | ||
| ) | ||
| ########## | ||
| # | ||
| # ADDR NAME | ||
| # | ||
| ########## | ||
| ROD = gram('gent') | ||
| SIMPLE = and_( | ||
| or_( | ||
| ADJF, # Школьная | ||
| and_(NOUN, ROD), # Ленина, Победы | ||
| ), | ||
| TITLE | ||
| ) | ||
| COMPLEX = or_( | ||
| rule( | ||
| and_(ADJF, TITLE), | ||
| NOUN | ||
| ), | ||
| rule( | ||
| TITLE, | ||
| DASH.optional(), | ||
| TITLE | ||
| ), | ||
| ) | ||
| # TODO | ||
| EXCEPTION = dictionary({ | ||
| 'арбат', | ||
| 'варварка' | ||
| }) | ||
| MAYBE_NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX, | ||
| rule(EXCEPTION) | ||
| ) | ||
| NAME = or_( | ||
| MAYBE_NAME, | ||
| LET, | ||
| DATE, | ||
| IMENI | ||
| ) | ||
| NAME = rule( | ||
| MODIFIER_WORDS.optional(), | ||
| NAME | ||
| ) | ||
| ADDR_CRF = tag('I').repeatable() | ||
| NAME = or_( | ||
| NAME, | ||
| ANUM, | ||
| rule(NAME, ANUM), | ||
| rule(ANUM, NAME), | ||
| rule(INT, DASH.optional(), NAME), | ||
| rule(NAME, DASH, INT), | ||
| ADDR_CRF | ||
| ) | ||
| ADDR_NAME = NAME | ||
| ######## | ||
| # | ||
| # STREET | ||
| # | ||
| ######### | ||
| STREET_WORDS = or_( | ||
| rule(normalized('улица')), | ||
| rule( | ||
| caseless('ул'), | ||
| DOT.optional() | ||
| ) | ||
| ).interpretation( | ||
| Street.type.const('улица') | ||
| ) | ||
| STREET_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| STREET = or_( | ||
| rule(STREET_WORDS, STREET_NAME), | ||
| rule(STREET_NAME, STREET_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ########## | ||
| # | ||
| # PROSPEKT | ||
| # | ||
| ########## | ||
| PROSPEKT_WORDS = or_( | ||
| rule( | ||
| in_caseless({'пр', 'просп'}), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| caseless('пр'), | ||
| '-', | ||
| in_caseless({'кт', 'т'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('проспект')) | ||
| ).interpretation( | ||
| Street.type.const('проспект') | ||
| ) | ||
| PROSPEKT_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PROSPEKT = or_( | ||
| rule(PROSPEKT_WORDS, PROSPEKT_NAME), | ||
| rule(PROSPEKT_NAME, PROSPEKT_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ############ | ||
| # | ||
| # PROEZD | ||
| # | ||
| ############# | ||
| PROEZD_WORDS = or_( | ||
| rule(caseless('пр'), DOT.optional()), | ||
| rule( | ||
| caseless('пр'), | ||
| '-', | ||
| in_caseless({'зд', 'д'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('проезд')) | ||
| ).interpretation( | ||
| Street.type.const('проезд') | ||
| ) | ||
| PROEZD_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PROEZD = or_( | ||
| rule(PROEZD_WORDS, PROEZD_NAME), | ||
| rule(PROEZD_NAME, PROEZD_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ########### | ||
| # | ||
| # PEREULOK | ||
| # | ||
| ############## | ||
| PEREULOK_WORDS = or_( | ||
| rule( | ||
| caseless('п'), | ||
| DOT | ||
| ), | ||
| rule( | ||
| caseless('пер'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('переулок')) | ||
| ).interpretation( | ||
| Street.type.const('переулок') | ||
| ) | ||
| PEREULOK_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PEREULOK = or_( | ||
| rule(PEREULOK_WORDS, PEREULOK_NAME), | ||
| rule(PEREULOK_NAME, PEREULOK_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ######## | ||
| # | ||
| # PLOSHAD | ||
| # | ||
| ########## | ||
| PLOSHAD_WORDS = or_( | ||
| rule( | ||
| caseless('пл'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('площадь')) | ||
| ).interpretation( | ||
| Street.type.const('площадь') | ||
| ) | ||
| PLOSHAD_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PLOSHAD = or_( | ||
| rule(PLOSHAD_WORDS, PLOSHAD_NAME), | ||
| rule(PLOSHAD_NAME, PLOSHAD_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ############ | ||
| # | ||
| # SHOSSE | ||
| # | ||
| ########### | ||
| # TODO | ||
| # Покровское 17 км. | ||
| # Сергеляхское 13 км | ||
| # Сергеляхское 14 км. | ||
| SHOSSE_WORDS = or_( | ||
| rule( | ||
| caseless('ш'), | ||
| DOT | ||
| ), | ||
| rule(normalized('шоссе')) | ||
| ).interpretation( | ||
| Street.type.const('шоссе') | ||
| ) | ||
| SHOSSE_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| SHOSSE = or_( | ||
| rule(SHOSSE_WORDS, SHOSSE_NAME), | ||
| rule(SHOSSE_NAME, SHOSSE_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ######## | ||
| # | ||
| # NABEREG | ||
| # | ||
| ########## | ||
| NABEREG_WORDS = or_( | ||
| rule( | ||
| caseless('наб'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('набережная')) | ||
| ).interpretation( | ||
| Street.type.const('набережная') | ||
| ) | ||
| NABEREG_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| NABEREG = or_( | ||
| rule(NABEREG_WORDS, NABEREG_NAME), | ||
| rule(NABEREG_NAME, NABEREG_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ######## | ||
| # | ||
| # BULVAR | ||
| # | ||
| ########## | ||
| BULVAR_WORDS = or_( | ||
| rule( | ||
| caseless('б'), | ||
| '-', | ||
| caseless('р') | ||
| ), | ||
| rule( | ||
| caseless('б'), | ||
| DOT | ||
| ), | ||
| rule( | ||
| caseless('бул'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('бульвар')) | ||
| ).interpretation( | ||
| Street.type.const('бульвар') | ||
| ) | ||
| BULVAR_NAME = ADDR_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| BULVAR = or_( | ||
| rule(BULVAR_WORDS, BULVAR_NAME), | ||
| rule(BULVAR_NAME, BULVAR_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ############## | ||
| # | ||
| # ADDR VALUE | ||
| # | ||
| ############# | ||
| LETTER = in_caseless(set('абвгдежзиклмнопрстуфхшщэюя')) | ||
| QUOTE = in_(QUOTES) | ||
| LETTER = or_( | ||
| rule(LETTER), | ||
| rule(QUOTE, LETTER, QUOTE) | ||
| ) | ||
| VALUE = rule( | ||
| INT, | ||
| LETTER.optional() | ||
| ) | ||
| SEP = in_(r'/\-') | ||
| VALUE = or_( | ||
| rule(VALUE), | ||
| rule(VALUE, SEP, VALUE), | ||
| rule(VALUE, SEP, LETTER) | ||
| ) | ||
| ADDR_VALUE = rule( | ||
| eq('№').optional(), | ||
| VALUE | ||
| ) | ||
| ############ | ||
| # | ||
| # DOM | ||
| # | ||
| ############# | ||
| DOM_WORDS = or_( | ||
| rule(normalized('дом')), | ||
| rule( | ||
| caseless('д'), | ||
| DOT | ||
| ) | ||
| ).interpretation( | ||
| Building.type.const('дом') | ||
| ) | ||
| DOM_VALUE = ADDR_VALUE.interpretation( | ||
| Building.number | ||
| ) | ||
| DOM = rule( | ||
| DOM_WORDS, | ||
| DOM_VALUE | ||
| ).interpretation( | ||
| Building | ||
| ) | ||
| ########### | ||
| # | ||
| # KORPUS | ||
| # | ||
| ########## | ||
| KORPUS_WORDS = or_( | ||
| rule( | ||
| in_caseless({'корп', 'кор'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('корпус')) | ||
| ).interpretation( | ||
| Building.type.const('корпус') | ||
| ) | ||
| KORPUS_VALUE = ADDR_VALUE.interpretation( | ||
| Building.number | ||
| ) | ||
| KORPUS = or_( | ||
| rule( | ||
| KORPUS_WORDS, | ||
| KORPUS_VALUE | ||
| ), | ||
| rule( | ||
| KORPUS_VALUE, | ||
| KORPUS_WORDS | ||
| ) | ||
| ).interpretation( | ||
| Building | ||
| ) | ||
| ########### | ||
| # | ||
| # STROENIE | ||
| # | ||
| ########## | ||
| STROENIE_WORDS = or_( | ||
| rule( | ||
| caseless('стр'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('строение')) | ||
| ).interpretation( | ||
| Building.type.const('строение') | ||
| ) | ||
| STROENIE_VALUE = ADDR_VALUE.interpretation( | ||
| Building.number | ||
| ) | ||
| STROENIE = rule( | ||
| STROENIE_WORDS, | ||
| ADDR_VALUE | ||
| ).interpretation( | ||
| Building | ||
| ) | ||
| ########### | ||
| # | ||
| # OFIS | ||
| # | ||
| ############# | ||
| OFIS_WORDS = or_( | ||
| rule( | ||
| caseless('оф'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('офис')) | ||
| ).interpretation( | ||
| Room.type.const('офис') | ||
| ) | ||
| OFIS_VALUE = ADDR_VALUE.interpretation( | ||
| Room.number | ||
| ) | ||
| OFIS = rule( | ||
| OFIS_WORDS, | ||
| OFIS_VALUE | ||
| ).interpretation( | ||
| Room | ||
| ) | ||
| ########### | ||
| # | ||
| # KVARTIRA | ||
| # | ||
| ############# | ||
| KVARTIRA_WORDS = or_( | ||
| rule( | ||
| caseless('кв'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('квартира')) | ||
| ).interpretation( | ||
| Room.type.const('квартира') | ||
| ) | ||
| KVARTIRA_VALUE = ADDR_VALUE.interpretation( | ||
| Room.number | ||
| ) | ||
| KVARTIRA = rule( | ||
| KVARTIRA_WORDS, | ||
| KVARTIRA_VALUE | ||
| ).interpretation( | ||
| Room | ||
| ) | ||
| ########### | ||
| # | ||
| # INDEX | ||
| # | ||
| ############# | ||
| INDEX = and_( | ||
| INT, | ||
| gte(100000), | ||
| lte(999999) | ||
| ).interpretation( | ||
| Index.value | ||
| ).interpretation( | ||
| Index | ||
| ) | ||
| ############# | ||
| # | ||
| # ADDR PART | ||
| # | ||
| ############ | ||
| ADDR_PART = or_( | ||
| INDEX, | ||
| COUNTRY, | ||
| FED_OKRUG, | ||
| RESPUBLIKA, | ||
| KRAI, | ||
| OBLAST, | ||
| AUTO_OKRUG, | ||
| GOROD, | ||
| DEREVNYA, | ||
| SELO, | ||
| POSELOK, | ||
| STREET, | ||
| PROSPEKT, | ||
| PROEZD, | ||
| PEREULOK, | ||
| PLOSHAD, | ||
| SHOSSE, | ||
| NABEREG, | ||
| BULVAR, | ||
| DOM, | ||
| KORPUS, | ||
| STROENIE, | ||
| OFIS, | ||
| KVARTIRA | ||
| ).interpretation( | ||
| AddrPart.value | ||
| ).interpretation( | ||
| AddrPart | ||
| ) |
| def normal_word(word): | ||
| word = word.lower() | ||
| return word.replace('ё', 'е') | ||
| NORMAL_POSES = { | ||
| 'PROPN': 'NOUN', | ||
| 'AUX': 'VERB', # был | ||
| 'SCONJ': 'CCONJ' | ||
| } | ||
| def normal_pos(pos): | ||
| return NORMAL_POSES.get(pos, pos) | ||
| def pos_sim(a, b): | ||
| return normal_pos(a) == normal_pos(b) | ||
| # ignore Animacy, Voice, ... | ||
| FEATS = { | ||
| 'Case', 'Gender', 'Number', | ||
| 'Aspect', 'NumForm', 'Person', 'Tense', 'Variant' | ||
| } | ||
| def feats_sim(a, b): | ||
| return sum( | ||
| a[_] == b[_] | ||
| for _ in a.keys() & b.keys() | ||
| if _ in FEATS | ||
| ) | ||
| def grams_sim(a_pos, a_feats, b_pos, b_feats): | ||
| return pos_sim(a_pos, b_pos) + feats_sim(a_feats, b_feats) | ||
| def best_form(forms, pos, feats): | ||
| max, best = 0, None | ||
| for form in forms: | ||
| if form.pos == pos and form.feats == feats: | ||
| return form | ||
| sim = grams_sim(form.pos, form.feats, pos, feats) | ||
| if sim > max: | ||
| best = form | ||
| max = sim | ||
| return best | ||
| def lemmatize(vocab, word, pos, feats): | ||
| word = normal_word(word) | ||
| forms = vocab(word) | ||
| form = best_form(forms, pos, feats) | ||
| if form: | ||
| return normal_word(form.normal) | ||
| return word |
| from slovnet import Morph as SlovnetMorph | ||
| from natasha.data import NEWS_MORPH | ||
| from natasha.record import Record | ||
| ########### | ||
| # | ||
| # MARKUP | ||
| # | ||
| ####### | ||
| class MorphToken(Record): | ||
| __attributes__ = ['text', 'pos', 'feats'] | ||
| class MorphMarkup(Record): | ||
| __attributes__ = ['tokens'] | ||
| def print(self): | ||
| print_markup(self) | ||
| def adapt_tokens(tokens): | ||
| for token in tokens: | ||
| yield MorphToken(token.text, token.pos, token.feats) | ||
| def adapt_markup(markup): | ||
| return MorphMarkup( | ||
| list(adapt_tokens(markup.tokens)) | ||
| ) | ||
| def format_tag(pos, feats): | ||
| if not feats: | ||
| return pos | ||
| feats = '|'.join( | ||
| '%s=%s' % (_, feats[_]) | ||
| for _ in sorted(feats) | ||
| ) | ||
| return '%s|%s' % (pos, feats) | ||
| def format_markup(markup, size=20): | ||
| for token in markup.tokens: | ||
| word = token.text.rjust(size) | ||
| tag = format_tag(token.pos, token.feats) | ||
| yield '%s %s' % (word, tag) | ||
| def print_markup(markup): | ||
| for line in format_markup(markup): | ||
| print(line) | ||
| ########## | ||
| # | ||
| # TAGGER | ||
| # | ||
| ####### | ||
| class MorphTagger(SlovnetMorph): | ||
| def __init__(self, emb, path): | ||
| infer, *args = SlovnetMorph.load(path) | ||
| SlovnetMorph.__init__(self, infer, *args) | ||
| self.navec(emb) | ||
| def map(self, items): | ||
| markups = SlovnetMorph.map(self, items) | ||
| for markup in markups: | ||
| yield adapt_markup(markup) | ||
| class NewsMorphTagger(MorphTagger): | ||
| def __init__(self, emb, path=NEWS_MORPH): | ||
| MorphTagger.__init__(self, emb, path) |
| from functools import lru_cache | ||
| from pymorphy2.analyzer import ( | ||
| Parse as PymorphyParse, | ||
| MorphAnalyzer as PymorphyAnalyzer | ||
| ) | ||
| # https://github.com/kmike/russian-tagsets/blob/master/russian_tagsets/ud.py | ||
| OC_UD_POS = { | ||
| 'ADJF': 'ADJ', | ||
| 'ADJS': 'ADJ', | ||
| 'ADVB': 'ADV', | ||
| 'COMP': 'ADV', | ||
| 'VERB': 'VERB', | ||
| 'GRND': 'VERB', | ||
| 'INFN': 'VERB', | ||
| 'PRTF': 'VERB', | ||
| 'PRTS': 'VERB', | ||
| 'NOUN': 'NOUN', | ||
| 'NPRO': 'PRON', | ||
| 'NUMR': 'NUM', | ||
| 'NUMB': 'NUM', | ||
| 'Apro': 'DET', | ||
| 'CONJ': 'CCONJ', | ||
| 'INTJ': 'INTJ', | ||
| 'PART': 'PRCL', | ||
| 'PNCT': 'PUNCT', | ||
| 'PRCL': 'PART', | ||
| 'PREP': 'ADP', | ||
| } | ||
| # ordering is importance, dups in OC_UD_INDEX, UD_OC_FEATS | ||
| OC_UD_FEATS = [ | ||
| ['Animacy', 'anim', 'Anim'], | ||
| ['Animacy', 'inan', 'Inan'], | ||
| ['Aspect', 'impf', 'Imp'], | ||
| ['Aspect', 'perf', 'Perf'], | ||
| ['Case', 'ablt', 'Ins'], | ||
| ['Case', 'accs', 'Acc'], | ||
| ['Case', 'datv', 'Dat'], | ||
| ['Case', 'gent', 'Gen'], | ||
| ['Case', 'gen1', 'Gen'], | ||
| ['Case', 'gen2', 'Gen'], | ||
| ['Case', 'loct', 'Loc'], | ||
| ['Case', 'loc2', 'Loc'], | ||
| ['Case', 'nomn', 'Nom'], | ||
| ['Case', 'voct', 'Nom'], | ||
| ['Degree', 'COMP', 'Cmp'], | ||
| ['Degree', 'Supr', 'Sup'], | ||
| ['Gender', 'femn', 'Fem'], | ||
| ['Gender', 'masc', 'Masc'], | ||
| ['Gender', 'neut', 'Neut'], | ||
| ['Mood', 'impr', 'Imp'], | ||
| ['Mood', 'indc', 'Ind'], | ||
| ['Number', 'plur', 'Plur'], | ||
| ['Number', 'sing', 'Sing'], | ||
| ['NumForm', 'NUMB', 'Digit'], | ||
| ['Person', '1per', '1'], | ||
| ['Person', '2per', '2'], | ||
| ['Person', '3per', '3'], | ||
| ['Person', 'excl', '2'], | ||
| ['Person', 'incl', '1'], | ||
| ['Tense', 'futr', 'Fut'], | ||
| ['Tense', 'past', 'Past'], | ||
| ['Tense', 'pres', 'Pres'], | ||
| ['Variant', 'ADJS', 'Brev'], | ||
| ['Variant', 'PRTS', 'Brev'], | ||
| ['VerbForm', 'GRND', 'Conv'], | ||
| ['VerbForm', 'INFN', 'Inf'], | ||
| ['VerbForm', 'PRTF', 'Part'], | ||
| ['VerbForm', 'PRTS', 'Part'], | ||
| ['VerbForm', 'VERB', 'Fin'], | ||
| ['Voice', 'actv', 'Act'], | ||
| ['Voice', 'pssv', 'Pass'], | ||
| ['Abbr', 'Abbr', 'Yes'], | ||
| ] | ||
| OC_UD_INDEX = {} | ||
| UD_OC_FEATS = {} | ||
| for key, oc, ud in OC_UD_FEATS: | ||
| # the only duplicate is PRTS in VerbForm, Variant | ||
| # use Variant | ||
| OC_UD_INDEX[oc] = (key, ud) | ||
| # many duplicates, use first accurance: 1 -> 1per, ADJ -> ADJF | ||
| if ud not in UD_OC_FEATS: | ||
| UD_OC_FEATS[ud] = oc | ||
| def ud_pos(tag): | ||
| # super rare pos are missing: PRED, ROMN | ||
| return OC_UD_POS.get(tag._POS, 'X') | ||
| def ud_feats(tag): | ||
| # a number of OC grams are missing: | ||
| # ANim Adjx Af-p Anph Anum Apro Arch Cmp2 Coll Coun Dist Dmns Fimp | ||
| # Fixd GNdr Geox Impe Impx Infr Inmx Litr Ms-f Name Orgn Patr Pltm | ||
| # Poss Prdx Prnt Qual Ques Sgtm Slng Subx Surn V-be V-bi V-ej V-ey | ||
| # V-oy V-sh Vpre intg intr real tran | ||
| for gram in tag._grammemes_tuple: | ||
| item = OC_UD_INDEX.get(gram) | ||
| if item: | ||
| yield item | ||
| def oc_grams(grams): | ||
| for gram in grams: | ||
| yield UD_OC_FEATS[gram] | ||
| class MorphForm(PymorphyParse): | ||
| def __new__(cls, *args): # PymorphyParse is namedtuple | ||
| self = PymorphyParse.__new__(cls, *args) | ||
| self.normal = self.normal_form | ||
| self.pos = ud_pos(self.tag) | ||
| self.feats = dict(ud_feats(self.tag)) | ||
| return self | ||
| def inflect(self, grams): | ||
| grams = set(oc_grams(grams)) | ||
| return PymorphyParse.inflect(self, grams) | ||
| def __repr__(self): | ||
| return ( | ||
| '{name}(normal={self.normal!r}, ' | ||
| 'pos={self.pos!r}, feats={self.feats!r})' | ||
| ).format( | ||
| name=self.__class__.__name__, | ||
| self=self | ||
| ) | ||
| CACHE_SIZE = 10000 | ||
| class MorphVocab(PymorphyAnalyzer): | ||
| def __init__(self): | ||
| PymorphyAnalyzer.__init__(self, result_type=MorphForm) | ||
| parse = lru_cache(CACHE_SIZE)(PymorphyAnalyzer.parse) | ||
| __call__ = parse | ||
| def __repr__(self): | ||
| return '%s()' % self.__class__.__name__ | ||
| def lemmatize(self, word, pos, feats): | ||
| from .lemma import lemmatize | ||
| return lemmatize(self, word, pos, feats) |
| from slovnet import NER as SlovnetNER | ||
| from ipymarkup import show_span_ascii_markup | ||
| from .data import NEWS_NER | ||
| from .record import Record | ||
| from .span import Span | ||
| ##### | ||
| # | ||
| # MARKUP | ||
| # | ||
| ###### | ||
| class NERMarkup(Record): | ||
| __attributes__ = ['text', 'spans'] | ||
| def print(self): | ||
| show_span_ascii_markup(self.text, self.spans) | ||
| def adapt_spans(spans): | ||
| for span in spans: | ||
| yield Span(span.start, span.stop, span.type) | ||
| def adapt_markup(markup): | ||
| return NERMarkup( | ||
| markup.text, | ||
| list(adapt_spans(markup.spans)) | ||
| ) | ||
| ###### | ||
| # | ||
| # TAGGER | ||
| # | ||
| ######### | ||
| class NERTagger(SlovnetNER): | ||
| def __init__(self, emb, path): | ||
| infer, *args = SlovnetNER.load(path) | ||
| SlovnetNER.__init__(self, infer, *args) | ||
| self.navec(emb) | ||
| def map(self, items): | ||
| markups = SlovnetNER.map(self, items) | ||
| for markup in markups: | ||
| yield adapt_markup(markup) | ||
| class NewsNERTagger(NERTagger): | ||
| def __init__(self, emb, path=NEWS_NER): | ||
| NERTagger.__init__(self, emb, path) |
+129
| from collections import defaultdict, deque | ||
| from .shape import recover_shape | ||
| PROPN = 'PROPN' | ||
| NOUN = 'NOUN' | ||
| ADJ = 'ADJ' | ||
| VERB = 'VERB' | ||
| GENDER = 'Gender' | ||
| NUMBER = 'Number' | ||
| CASE = 'Case' | ||
| NOM = 'Nom' | ||
| def recover_shapes(words, tokens): | ||
| for word, token in zip(words, tokens): | ||
| yield recover_shape(word, token.text) | ||
| def recover_spaces(words, tokens): | ||
| offset = None | ||
| parts = [] | ||
| for index, (word, token) in enumerate(zip(words, tokens)): | ||
| if index > 0: | ||
| parts.append(' ' * (token.start - offset)) | ||
| parts.append(word) | ||
| offset = token.stop | ||
| return ''.join(parts) | ||
| def normal_pos(pos): | ||
| if pos == PROPN: | ||
| pos = NOUN | ||
| return pos | ||
| def pos_match(a, b): | ||
| return normal_pos(a) == normal_pos(b) | ||
| def feats_match(a, b): | ||
| return ( | ||
| a.get(GENDER) == b.get(GENDER) | ||
| and a.get(NUMBER) == b.get(NUMBER) | ||
| and a.get(CASE) == b.get(CASE) | ||
| ) | ||
| def form_match(form, pos, feats): | ||
| return pos_match(form.pos, pos) and feats_match(form.feats, feats) | ||
| def select_form(forms, pos, feats): | ||
| for form in forms: | ||
| if form_match(form, pos, feats): | ||
| return form | ||
| def normal_word(word): | ||
| word = word.lower() | ||
| return word.replace('ё', 'е') | ||
| def inflect_word(vocab, token): | ||
| word, pos, feats = token.text, token.pos, token.feats | ||
| word = normal_word(word) | ||
| if pos not in (PROPN, NOUN, ADJ, VERB): | ||
| return word | ||
| if feats.get(CASE) == NOM: | ||
| return word | ||
| forms = vocab(word) | ||
| form = select_form(forms, pos, feats) | ||
| if form: | ||
| form = form.inflect({NOM}) | ||
| if form: | ||
| return normal_word(form.word) | ||
| return word | ||
| def inflect_words(vocab, tokens, ids=None): | ||
| for token in tokens: | ||
| if not ids or token.id in ids: | ||
| yield inflect_word(vocab, token) | ||
| else: | ||
| yield token.text | ||
| def select_inflectable(tokens): | ||
| index = {} | ||
| for token in tokens: | ||
| index[token.id] = token | ||
| roots = set() | ||
| children = defaultdict(list) | ||
| for token in tokens: | ||
| if token.head_id not in index: | ||
| roots.add(token.id) | ||
| else: | ||
| children[token.head_id].append(token.id) | ||
| stack = deque(roots) | ||
| while stack: | ||
| id = stack.popleft() | ||
| yield id | ||
| for child in children[id]: | ||
| token = index[child] | ||
| if token.pos in (ADJ, VERB): | ||
| stack.append(child) | ||
| def syntax_normalize(vocab, tokens): | ||
| ids = set(select_inflectable(tokens)) | ||
| words = inflect_words(vocab, tokens, ids) | ||
| words = recover_shapes(words, tokens) | ||
| return recover_spaces(words, tokens) | ||
| def normalize(vocab, tokens): | ||
| words = inflect_words(vocab, tokens) | ||
| words = recover_shapes(words, tokens) | ||
| return recover_spaces(words, tokens) |
| from collections import OrderedDict | ||
| def parse_annotation(annotation): | ||
| type = annotation or str | ||
| repeatable = False | ||
| if isinstance(annotation, list): # [Fact] | ||
| repeatable = True | ||
| type = annotation[0] | ||
| is_record = issubclass(type, Record) | ||
| return type, repeatable, is_record | ||
| class Record(object): | ||
| __attributes__ = [] | ||
| __annotations__ = {} | ||
| def __init__(self, *args, **kwargs): | ||
| for key, value in zip(self.__attributes__, args): | ||
| self.__dict__[key] = value | ||
| self.__dict__.update(kwargs) | ||
| def __eq__(self, other): | ||
| return ( | ||
| type(self) == type(other) | ||
| and all( | ||
| (getattr(self, _) == getattr(other, _)) | ||
| for _ in self.__attributes__ | ||
| ) | ||
| ) | ||
| def __ne__(self, other): | ||
| return not self == other | ||
| def __iter__(self): | ||
| return (getattr(self, _) for _ in self.__attributes__) | ||
| def __hash__(self): | ||
| return hash(tuple(self)) | ||
| def __repr__(self): | ||
| name = self.__class__.__name__ | ||
| args = ', '.join( | ||
| '{key}={value!r}'.format( | ||
| key=_, | ||
| value=getattr(self, _) | ||
| ) | ||
| for _ in self.__attributes__ | ||
| ) | ||
| return '{name}({args})'.format( | ||
| name=name, | ||
| args=args | ||
| ) | ||
| def _repr_pretty_(self, printer, cycle): | ||
| name = self.__class__.__name__ | ||
| if cycle: | ||
| printer.text('{name}(...)'.format(name=name)) | ||
| else: | ||
| printer.text('{name}('.format(name=name)) | ||
| keys = self.__attributes__ | ||
| size = len(keys) | ||
| if size: | ||
| with printer.indent(4): | ||
| printer.break_() | ||
| for index, key in enumerate(keys): | ||
| printer.text(key + '=') | ||
| value = getattr(self, key) | ||
| printer.pretty(value) | ||
| if index < size - 1: | ||
| printer.text(',') | ||
| printer.break_() | ||
| printer.break_() | ||
| printer.text(')') | ||
| @property | ||
| def as_json(self): | ||
| data = OrderedDict() | ||
| for key in self.__attributes__: | ||
| annotation = self.__annotations__.get(key) | ||
| _, repeatable, is_record = parse_annotation(annotation) | ||
| value = getattr(self, key) | ||
| if value is None: | ||
| continue | ||
| if repeatable and is_record: | ||
| value = [_.as_json for _ in value] | ||
| elif is_record: | ||
| value = value.as_json | ||
| data[key] = value | ||
| return data | ||
| @classmethod | ||
| def from_json(cls, data): | ||
| args = [] | ||
| for key in cls.__attributes__: | ||
| annotation = cls.__annotations__.get(key) | ||
| type, repeatable, is_record = parse_annotation(annotation) | ||
| value = data.get(key) | ||
| if value is None and repeatable: | ||
| value = [] | ||
| elif value is not None: | ||
| if repeatable and is_record: | ||
| value = [type.from_json(_) for _ in value] | ||
| elif is_record: | ||
| value = type.from_json(value) | ||
| args.append(value) | ||
| return cls(*args) |
| from razdel import sentenize, tokenize | ||
| from .record import Record | ||
| class Sent(Record): | ||
| __attributes__ = ['start', 'stop', 'text'] | ||
| class Token(Record): | ||
| __attributes__ = ['start', 'stop', 'text'] | ||
| def adapt_token(token): | ||
| start, stop, text = token | ||
| return Token(start, stop, text) | ||
| def adapt_sent(sent): | ||
| start, stop, text = sent | ||
| return Sent(start, stop, text) | ||
| class Segmenter(Record): | ||
| def tokenize(self, text): | ||
| for token in tokenize(text): | ||
| yield adapt_token(token) | ||
| def sentenize(self, text): | ||
| for sent in sentenize(text): | ||
| yield adapt_sent(sent) |
| import re | ||
| ALPHA = 'alpha' | ||
| NONALPHA = 'nonalpha' | ||
| SHAPE_PART = re.compile(r''' | ||
| (?P<alpha>[а-яёa-z]+) | ||
| |(?P<nonalpha>[^а-яёa-z]+) | ||
| ''', re.I | re.X) | ||
| class ShapeRecoverError(Exception): | ||
| pass | ||
| def shape_parts(word): | ||
| matches = SHAPE_PART.finditer(word) | ||
| for match in matches: | ||
| yield match.lastgroup, match.group(0) | ||
| def recover_part_shape(part, ref): | ||
| if ref.islower(): | ||
| return part.lower() | ||
| elif ref.isupper(): | ||
| return part.upper() | ||
| elif ref.istitle(): | ||
| return part.capitalize() | ||
| else: | ||
| raise ShapeRecoverError | ||
| def recover_shape_(word, ref): | ||
| if word.lower() == ref.lower(): | ||
| yield ref | ||
| return | ||
| parts = list(shape_parts(word)) | ||
| ref_parts = list(shape_parts(ref)) | ||
| if len(parts) != len(ref_parts): | ||
| raise ShapeRecoverError | ||
| for (type, part), (ref_type, ref_part) in zip(parts, ref_parts): | ||
| if type != ref_type: | ||
| raise ShapeRecoverError | ||
| if type == ALPHA: | ||
| yield recover_part_shape(part, ref_part) | ||
| else: | ||
| yield part | ||
| def recover_shape(word, ref): | ||
| try: | ||
| parts = recover_shape_(word, ref) | ||
| return ''.join(parts) | ||
| except ShapeRecoverError: | ||
| return word |
| from intervaltree import IntervalTree | ||
| from .record import Record | ||
| class Span(Record): | ||
| __attributes__ = ['start', 'stop', 'type'] | ||
| def adapt_spans(spans): | ||
| for span in spans: | ||
| yield Span(span.start, span.stop, span.type) | ||
| def offset_spans(spans, offset): | ||
| for span in spans: | ||
| yield Span( | ||
| offset + span.start, | ||
| offset + span.stop, | ||
| span.type | ||
| ) | ||
| def index_spans(spans): | ||
| index = IntervalTree() | ||
| for span in spans: | ||
| index.addi(span.start, span.stop, span) | ||
| return index | ||
| def query_spans_index(index, span): | ||
| return [ | ||
| _.data | ||
| for _ in sorted(index.envelop(span.start, span.stop)) | ||
| ] |
| from slovnet import Syntax as SlovnetSyntax | ||
| from ipymarkup import show_dep_ascii_markup | ||
| from .data import NEWS_SYNTAX | ||
| from .record import Record | ||
| ######## | ||
| # | ||
| # MARKUP | ||
| # | ||
| ###### | ||
| class SyntaxToken(Record): | ||
| __attributes__ = ['id', 'text', 'head_id', 'rel'] | ||
| class SyntaxMarkup(Record): | ||
| __attributes__ = ['tokens'] | ||
| def print(self): | ||
| print_markup(self) | ||
| def adapt_tokens(tokens): | ||
| for token in tokens: | ||
| yield SyntaxToken( | ||
| token.id, token.text, | ||
| token.head_id, token.rel | ||
| ) | ||
| def adapt_markup(markup): | ||
| return SyntaxMarkup( | ||
| list(adapt_tokens(markup.tokens)) | ||
| ) | ||
| def token_deps(tokens): | ||
| ids = {} | ||
| for index, token in enumerate(tokens): | ||
| ids[token.id] = index | ||
| for token in tokens: | ||
| source = ids.get(token.head_id) | ||
| target = ids[token.id] | ||
| if source and source != target: # skip root, loop | ||
| yield source, target, token.rel | ||
| def markup_words(markup): | ||
| return [_.text for _ in markup.tokens] | ||
| def print_markup(markup): | ||
| words = markup_words(markup) | ||
| deps = token_deps(markup.tokens) | ||
| show_dep_ascii_markup(words, deps) | ||
| ###### | ||
| # | ||
| # PARSER | ||
| # | ||
| ####### | ||
| class SyntaxParser(SlovnetSyntax): | ||
| def __init__(self, emb, path): | ||
| infer, *args = SlovnetSyntax.load(path) | ||
| SlovnetSyntax.__init__(self, infer, *args) | ||
| self.navec(emb) | ||
| def map(self, items): | ||
| markups = SlovnetSyntax.map(self, items) | ||
| for markup in markups: | ||
| yield adapt_markup(markup) | ||
| class NewsSyntaxParser(SyntaxParser): | ||
| def __init__(self, emb, path=NEWS_SYNTAX): | ||
| SyntaxParser.__init__(self, emb, path) |
| import pytest | ||
| from natasha import ( | ||
| Segmenter, | ||
| MorphVocab, | ||
| NewsEmbedding, | ||
| NewsMorphTagger, | ||
| NewsSyntaxParser, | ||
| NewsNERTagger, | ||
| NamesExtractor, | ||
| DatesExtractor, | ||
| MoneyExtractor, | ||
| AddrExtractor, | ||
| ) | ||
| @pytest.fixture(scope='session') | ||
| def segmenter(): | ||
| return Segmenter() | ||
| @pytest.fixture(scope='session') | ||
| def morph_vocab(): | ||
| return MorphVocab() | ||
| @pytest.fixture(scope='session') | ||
| def embedding(): | ||
| return NewsEmbedding() | ||
| @pytest.fixture(scope='session') | ||
| def morph_tagger(embedding): | ||
| return NewsMorphTagger(embedding) | ||
| @pytest.fixture(scope='session') | ||
| def syntax_parser(embedding): | ||
| return NewsSyntaxParser(embedding) | ||
| @pytest.fixture(scope='session') | ||
| def ner_tagger(embedding): | ||
| return NewsNERTagger(embedding) | ||
| @pytest.fixture(scope='session') | ||
| def names_extractor(morph_vocab): | ||
| return NamesExtractor(morph_vocab) | ||
| @pytest.fixture(scope='session') | ||
| def dates_extractor(morph_vocab): | ||
| return DatesExtractor(morph_vocab) | ||
| @pytest.fixture(scope='session') | ||
| def money_extractor(morph_vocab): | ||
| return MoneyExtractor(morph_vocab) | ||
| @pytest.fixture(scope='session') | ||
| def addr_extractor(morph_vocab): | ||
| return AddrExtractor(morph_vocab) |
| import pytest | ||
| from natasha.extractors import ( | ||
| AddrPart as Part, | ||
| Addr | ||
| ) | ||
| tests = [ | ||
| [ | ||
| 'Россия, Вологодская обл. г. Череповец, пр.Победы 93 б', | ||
| Addr([ | ||
| Part('Россия', 'страна'), | ||
| Part('Вологодская', 'область'), | ||
| Part('Череповец', 'город'), | ||
| Part('Победы', 'проспект'), | ||
| ]) | ||
| ], | ||
| [ | ||
| '692909, РФ, Приморский край, г. Находка, ул. Добролюбова, 18', | ||
| Addr([ | ||
| Part('692909', 'индекс'), | ||
| Part('РФ', 'страна'), | ||
| Part('Приморский', 'край'), | ||
| Part('Находка', 'город'), | ||
| Part('Добролюбова', 'улица'), | ||
| ]) | ||
| ], | ||
| [ | ||
| 'д. Федоровка, ул. Дружбы, 13', | ||
| Addr([ | ||
| Part('Федоровка', 'деревня'), | ||
| Part('Дружбы', 'улица'), | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Россия, 129110, г.Москва, Олимпийский проспект, 22', | ||
| Addr([ | ||
| Part('Россия', 'страна'), | ||
| Part('129110', 'индекс'), | ||
| Part('Москва', 'город'), | ||
| Part('Олимпийский', 'проспект'), | ||
| ]) | ||
| ], | ||
| [ | ||
| 'г. Санкт-Петербург, Красногвардейский пер., д. 15', | ||
| Addr([ | ||
| Part('Санкт-Петербург', 'город'), | ||
| Part('Красногвардейский', 'переулок'), | ||
| Part('15', 'дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Республика Карелия,г.Петрозаводск,ул.Маршала Мерецкова, д.8 Б,офис 4', | ||
| Addr([ | ||
| Part('Карелия', 'республика'), | ||
| Part('Петрозаводск', 'город'), | ||
| Part('Маршала Мерецкова', 'улица'), | ||
| Part('8 Б', 'дом'), | ||
| Part('4', 'офис') | ||
| ]) | ||
| ], | ||
| [ | ||
| '628000, ХМАО-Югра, г.Ханты-Мансийск, ул. Ледовая , д.19', | ||
| Addr([ | ||
| Part('628000', 'индекс'), | ||
| Part('ХМАО-Югра'), | ||
| Part('Ханты-Мансийск', 'город'), | ||
| Part('Ледовая', 'улица'), | ||
| Part('19', 'дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'ХМАО г.Нижневартовск пер.Ягельный 17', | ||
| Addr([ | ||
| Part('ХМАО'), | ||
| Part('Нижневартовск', 'город'), | ||
| Part('Ягельный', 'переулок'), | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Белгородская обл, пгт Борисовка,ул. Рудого д.160', | ||
| Addr([ | ||
| Part('Белгородская', 'область'), | ||
| Part('Борисовка', 'посёлок'), | ||
| Part('Рудого', 'улица'), | ||
| Part('160', 'дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Самарская область, п.г.т. Алексеевка, ул. Ульяновская д. 21', | ||
| Addr([ | ||
| Part('Самарская', 'область'), | ||
| Part('Алексеевка', 'посёлок'), | ||
| Part('Ульяновская', 'улица'), | ||
| Part('21', 'дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Мурманская обл поселок городского типа Молочный, ул.Гальченко д.11', | ||
| Addr([ | ||
| Part('Мурманская', 'область'), | ||
| Part('Молочный', 'посёлок'), | ||
| Part('Гальченко', 'улица'), | ||
| Part('11', 'дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'ул. Народного Ополчения д. 9к.3', | ||
| Addr([ | ||
| Part('Народного Ополчения', 'улица'), | ||
| Part('9к', 'дом'), | ||
| ]) | ||
| ], | ||
| [ | ||
| 'ул. Б. Пироговская, д.37/430', | ||
| Addr([ | ||
| Part('Б. Пироговская', 'улица'), | ||
| Part('37/430', 'дом') | ||
| ]) | ||
| ], | ||
| ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(addr_extractor, test): | ||
| text, target = test | ||
| pred = addr_extractor.find(text).fact | ||
| assert pred == target |
| import re | ||
| from natasha import PER, Doc | ||
| from natasha.extractors import Name | ||
| TEXT = 'Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав о решении властей Львовской области объявить 2019 год годом лидера запрещенной в России Организации украинских националистов (ОУН) Степана Бандеры. Свое заявление он разместил в Twitter. 11 декабря Львовский областной совет принял решение провозгласить 2019 год в регионе годом Степана Бандеры в связи с празднованием 110-летия со дня рождения лидера ОУН (Бандера родился 1 января 1909 года).' | ||
| def strip(markup): | ||
| markup = markup.lstrip('\n') | ||
| return re.sub(r'\s+\n', '\n', markup) | ||
| NER = strip(''' | ||
| Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав | ||
| LOC──── LOC──── PER─────── | ||
| о решении властей Львовской области объявить 2019 год годом лидера | ||
| LOC────────────── | ||
| запрещенной в России Организации украинских националистов (ОУН) | ||
| LOC─── ORG─────────────────────────────────────── | ||
| Степана Бандеры. Свое заявление он разместил в Twitter. 11 декабря | ||
| PER──────────── ORG──── | ||
| Львовский областной совет принял решение провозгласить 2019 год в | ||
| ORG────────────────────── | ||
| регионе годом Степана Бандеры в связи с празднованием 110-летия со дня | ||
| PER──────────── | ||
| рождения лидера ОУН (Бандера родился 1 января 1909 года). | ||
| ORG | ||
| ''') | ||
| MORPH = strip(''' | ||
| Посол NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Израиля PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing | ||
| на ADP | ||
| Украине PROPN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing | ||
| Йоэль PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Лион PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| признался VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid | ||
| , PUNCT | ||
| что SCONJ | ||
| пришел VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act | ||
| в ADP | ||
| шок NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing | ||
| , PUNCT | ||
| узнав VERB|Aspect=Perf|Tense=Past|VerbForm=Conv|Voice=Act | ||
| о ADP | ||
| решении NOUN|Animacy=Inan|Case=Loc|Gender=Neut|Number=Sing | ||
| властей NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur | ||
| Львовской ADJ|Case=Gen|Degree=Pos|Gender=Fem|Number=Sing | ||
| области NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing | ||
| объявить VERB|Aspect=Perf|VerbForm=Inf|Voice=Act | ||
| 2019 ADJ | ||
| год NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing | ||
| годом NOUN|Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing | ||
| лидера NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Sing | ||
| запрещенной VERB|Aspect=Perf|Case=Gen|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass | ||
| в ADP | ||
| России PROPN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing | ||
| Организации PROPN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing | ||
| украинских ADJ|Case=Gen|Degree=Pos|Number=Plur | ||
| националистов NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur | ||
| ( PUNCT | ||
| ОУН PROPN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Sing | ||
| ) PUNCT | ||
| Степана PROPN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Sing | ||
| Бандеры PROPN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Sing | ||
| . PUNCT | ||
| ''') | ||
| SYNTAX = strip(''' | ||
| ┌──► Посол nsubj | ||
| │ Израиля | ||
| │ ┌► на case | ||
| │ └─ Украине | ||
| │ ┌─ Йоэль | ||
| │ └► Лион flat:name | ||
| ┌─────┌─└─── признался | ||
| │ │ ┌──► , punct | ||
| │ │ │ ┌► что mark | ||
| │ └►└─└─ пришел ccomp | ||
| │ │ ┌► в case | ||
| │ └──►└─ шок obl | ||
| │ ┌► , punct | ||
| │ ┌────►┌─└─ узнав advcl | ||
| │ │ │ ┌► о case | ||
| │ │ ┌───└►└─ решении obl | ||
| │ │ │ ┌─└──► властей nmod | ||
| │ │ │ │ ┌► Львовской amod | ||
| │ │ │ └──►└─ области nmod | ||
| │ └─└►┌─┌─── объявить nmod | ||
| │ │ │ ┌► 2019 amod | ||
| │ │ └►└─ год obj | ||
| │ └──►┌─ годом obl | ||
| │ ┌─────└► лидера nmod | ||
| │ │ ┌►┌─── запрещенной acl | ||
| │ │ │ │ ┌► в case | ||
| │ │ │ └►└─ России obl | ||
| │ ┌─└►└─┌─── Организации nmod | ||
| │ │ │ ┌► украинских amod | ||
| │ │ ┌─└►└─ националистов nmod | ||
| │ │ │ ┌► ( punct | ||
| │ │ └►┌─└─ ОУН parataxis | ||
| │ │ └──► ) punct | ||
| │ └──────►┌─ Степана appos | ||
| │ └► Бандеры flat:name | ||
| └──────────► . punct | ||
| ''') | ||
| LEMMAS = { | ||
| '110-летия': '110-летие', | ||
| 'Бандеры': 'бандера', | ||
| 'Израиля': 'израиль', | ||
| 'Львовской': 'львовский', | ||
| 'Организации': 'организация', | ||
| 'России': 'россия', | ||
| 'Свое': 'свой', | ||
| 'Степана': 'степан', | ||
| 'Украине': 'украина', | ||
| 'властей': 'власть', | ||
| 'года': 'год', | ||
| 'годом': 'год', | ||
| 'декабря': 'декабрь', | ||
| 'дня': 'день', | ||
| 'запрещенной': 'запретить', | ||
| 'лидера': 'лидер', | ||
| 'националистов': 'националист', | ||
| 'области': 'область', | ||
| 'празднованием': 'празднование', | ||
| 'признался': 'признаться', | ||
| 'принял': 'принять', | ||
| 'пришел': 'прийти', | ||
| 'разместил': 'разместить', | ||
| 'регионе': 'регион', | ||
| 'решении': 'решение', | ||
| 'родился': 'родиться', | ||
| 'рождения': 'рождение', | ||
| 'связи': 'связь', | ||
| 'со': 'с', | ||
| 'узнав': 'узнать', | ||
| 'украинских': 'украинский', | ||
| 'января': 'январь' | ||
| } | ||
| NORMALS = { | ||
| 'Twitter': 'Twitter', | ||
| 'Израиля': 'Израиль', | ||
| 'Йоэль Лион': 'Йоэль Лион', | ||
| 'Львовский областной совет': 'Львовский областной совет', | ||
| 'Львовской области': 'Львовская область', | ||
| 'ОУН': 'ОУН', | ||
| 'Организации украинских националистов (ОУН)': 'Организация украинских ' | ||
| 'националистов (ОУН)', | ||
| 'России': 'Россия', | ||
| 'Степана Бандеры': 'Степан Бандера', | ||
| 'Украине': 'Украина' | ||
| } | ||
| FACTS = { | ||
| 'Йоэль Лион': Name(first='Йоэль', last='Лион'), | ||
| 'Степан Бандера': Name(first='Степан', last='Бандера') | ||
| } | ||
| def test_doc(segmenter, morph_vocab, | ||
| morph_tagger, syntax_parser, ner_tagger, | ||
| names_extractor, capsys): | ||
| doc = Doc(TEXT) | ||
| doc.segment(segmenter) | ||
| doc.tag_morph(morph_tagger) | ||
| doc.parse_syntax(syntax_parser) | ||
| doc.tag_ner(ner_tagger) | ||
| for span in doc.spans: | ||
| span.normalize(morph_vocab) | ||
| if span.type == PER: | ||
| span.extract_fact(names_extractor) | ||
| for token in doc.tokens: | ||
| token.lemmatize(morph_vocab) | ||
| doc.ner.print() | ||
| assert strip(capsys.readouterr().out) == NER | ||
| sent = doc.sents[0] | ||
| sent.morph.print() | ||
| assert strip(capsys.readouterr().out) == MORPH | ||
| sent.syntax.print() | ||
| assert strip(capsys.readouterr().out) == SYNTAX | ||
| lemmas = { | ||
| _.text: _.lemma | ||
| for _ in doc.tokens | ||
| if _.text.lower() != _.lemma | ||
| } | ||
| assert lemmas == LEMMAS | ||
| normals = { | ||
| _.text: _.normal | ||
| for _ in doc.spans | ||
| } | ||
| assert normals == NORMALS | ||
| facts = { | ||
| _.normal: _.fact | ||
| for _ in doc.spans | ||
| if _.fact | ||
| } | ||
| assert facts == FACTS |
+703
| Metadata-Version: 2.1 | ||
| Name: natasha | ||
| Version: 1.1.0 | ||
| Summary: Named-entity recognition for russian language | ||
| Home-page: https://github.com/natasha/natasha | ||
| Author: Natasha contributors | ||
| Author-email: d.a.veselov@yandex.ru, alex@alexkuk.ru | ||
| License: MIT | ||
| Description: | ||
| <img src="https://github.com/natasha/natasha-logos/blob/master/natasha.svg"> | ||
|  [](https://codecov.io/gh/natasha/natasha) | ||
| Natasha solves basic NLP tasks for Russian language: tokenization, sentence segmentation, word embedding, morphology tagging, lemmatization, phrase normalization, syntax parsing, NER tagging, fact extraction. Quality on every task is similar or better then current SOTAs for Russian language on news articles, see <a href="https://github.com/natasha/natasha#evaluation">evaluation section</a>. Natasha is not a research project, underlying technologies are built for production. We pay attention to model size, RAM usage and performance. Models run on CPU, use Numpy for inference. | ||
| Natasha integrates libraries from <a href="https://github.com/natasha">Natasha project</a> under one convenient API: | ||
| * <a href="https://github.com/natasha/razdel">Razdel</a> — token, sentence segmentation for Russian | ||
| * <a href="https://github.com/natasha/navec">Navec</a> — compact Russian embeddings | ||
| * <a href="https://github.com/natasha/slovnet">Slovnet</a> — modern deep-learning techniques for Russian NLP, compact models for Russian morphology, syntax, NER. | ||
| * <a href="https://github.com/natasha/yargy">Yargy</a> — rule-based fact extraction similar to Tomita parser. | ||
| * <a href="https://github.com/natasha/ipymarkup">Ipymarkup</a> — NLP visualizations for NER and syntax markups. | ||
| > ⚠ API may change, for realworld tasks consider using low level libraries from Natasha project. | ||
| > Models optimized for news articles, quality on other domain may be lower. | ||
| > To use old `NamesExtractor`, `AddressExtactor` downgrade `pip install natasha<1 yargy<0.13` | ||
| ## Install | ||
| Natasha supports Python 3.5+ and PyPy3: | ||
| ```bash | ||
| $ pip install natasha | ||
| ``` | ||
| ## Usage | ||
| For more examples and explanation see [Natasha documentation](http://nbviewer.jupyter.org/github/natasha/natasha/blob/master/docs.ipynb). | ||
| ```python | ||
| >>> from natasha import ( | ||
| Segmenter, | ||
| MorphVocab, | ||
| NewsEmbedding, | ||
| NewsMorphTagger, | ||
| NewsSyntaxParser, | ||
| NewsNERTagger, | ||
| PER, | ||
| NamesExtractor, | ||
| Doc | ||
| ) | ||
| ####### | ||
| # | ||
| # INIT | ||
| # | ||
| ##### | ||
| >>> segmenter = Segmenter() | ||
| >>> morph_vocab = MorphVocab() | ||
| >>> emb = NewsEmbedding() | ||
| >>> morph_tagger = NewsMorphTagger(emb) | ||
| >>> syntax_parser = NewsSyntaxParser(emb) | ||
| >>> ner_tagger = NewsNERTagger(emb) | ||
| >>> names_extractor = NamesExtractor(morph_vocab) | ||
| >>> text = 'Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав о решении властей Львовской области объявить 2019 год годом лидера запрещенной в России Организации украинских националистов (ОУН) Степана Бандеры. Свое заявление он разместил в Twitter. «Я не могу понять, как прославление тех, кто непосредственно принимал участие в ужасных антисемитских преступлениях, помогает бороться с антисемитизмом и ксенофобией. Украина не должна забывать о преступлениях, совершенных против украинских евреев, и никоим образом не отмечать их через почитание их исполнителей», — написал дипломат. 11 декабря Львовский областной совет принял решение провозгласить 2019 год в регионе годом Степана Бандеры в связи с празднованием 110-летия со дня рождения лидера ОУН (Бандера родился 1 января 1909 года). В июле аналогичное решение принял Житомирский областной совет. В начале месяца с предложением к президенту страны Петру Порошенко вернуть Бандере звание Героя Украины обратились депутаты Верховной Рады. Парламентарии уверены, что признание Бандеры национальным героем поможет в борьбе с подрывной деятельностью против Украины в информационном поле, а также остановит «распространение мифов, созданных российской пропагандой». Степан Бандера (1909-1959) был одним из лидеров Организации украинских националистов, выступающей за создание независимого государства на территориях с украиноязычным населением. В 2010 году в период президентства Виктора Ющенко Бандера был посмертно признан Героем Украины, однако впоследствии это решение было отменено судом. ' | ||
| >>> doc = Doc(text) | ||
| ####### | ||
| # | ||
| # SEGMENT | ||
| # | ||
| ##### | ||
| >>> doc.segment(segmenter) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> display(doc.sents[:5]) | ||
| [DocToken(stop=5, text='Посол'), | ||
| DocToken(start=6, stop=13, text='Израиля'), | ||
| DocToken(start=14, stop=16, text='на'), | ||
| DocToken(start=17, stop=24, text='Украине'), | ||
| DocToken(start=25, stop=30, text='Йоэль')] | ||
| [DocSent(stop=218, text='Посол Израиля на Украине Йоэль Лион признался, чт..., tokens=[...]), | ||
| DocSent(start=219, stop=257, text='Свое заявление он разместил в Twitter.', tokens=[...]), | ||
| DocSent(start=258, stop=424, text='«Я не могу понять, как прославление тех, кто непо..., tokens=[...]), | ||
| DocSent(start=425, stop=592, text='Украина не должна забывать о преступлениях, совер..., tokens=[...]), | ||
| DocSent(start=593, stop=798, text='11 декабря Львовский областной совет принял решен..., tokens=[...])] | ||
| ####### | ||
| # | ||
| # MORPH | ||
| # | ||
| ##### | ||
| >>> doc.tag_morph(morph_tagger) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> doc.sents[0].morph.print() | ||
| [DocToken(stop=5, text='Посол', pos='NOUN', feats=<Anim,Nom,Masc,Sing>), | ||
| DocToken(start=6, stop=13, text='Израиля', pos='PROPN', feats=<Inan,Gen,Masc,Sing>), | ||
| DocToken(start=14, stop=16, text='на', pos='ADP'), | ||
| DocToken(start=17, stop=24, text='Украине', pos='PROPN', feats=<Inan,Loc,Fem,Sing>), | ||
| DocToken(start=25, stop=30, text='Йоэль', pos='PROPN', feats=<Anim,Nom,Masc,Sing>)] | ||
| Посол NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Израиля PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing | ||
| на ADP | ||
| Украине PROPN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing | ||
| Йоэль PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Лион PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| признался VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid | ||
| , PUNCT | ||
| что SCONJ | ||
| ... | ||
| ###### | ||
| # | ||
| # LEMMA | ||
| # | ||
| ####### | ||
| >>> for token in doc.tokens: | ||
| >>> token.lemmatize(morph_vocab) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> {_.text: _.lemma for _ in doc.tokens} | ||
| [DocToken(stop=5, text='Посол', pos='NOUN', feats=<Anim,Nom,Masc,Sing>, lemma='посол'), | ||
| DocToken(start=6, stop=13, text='Израиля', pos='PROPN', feats=<Inan,Gen,Masc,Sing>, lemma='израиль'), | ||
| DocToken(start=14, stop=16, text='на', pos='ADP', lemma='на'), | ||
| DocToken(start=17, stop=24, text='Украине', pos='PROPN', feats=<Inan,Loc,Fem,Sing>, lemma='украина'), | ||
| DocToken(start=25, stop=30, text='Йоэль', pos='PROPN', feats=<Anim,Nom,Masc,Sing>, lemma='йоэль')] | ||
| {'Посол': 'посол', | ||
| 'Израиля': 'израиль', | ||
| 'на': 'на', | ||
| 'Украине': 'украина', | ||
| 'Йоэль': 'йоэль', | ||
| 'Лион': 'лион', | ||
| 'признался': 'признаться', | ||
| ',': ',', | ||
| 'что': 'что', | ||
| 'пришел': 'прийти', | ||
| 'в': 'в', | ||
| 'шок': 'шок', | ||
| 'узнав': 'узнать', | ||
| 'о': 'о', | ||
| ... | ||
| ####### | ||
| # | ||
| # SYNTAX | ||
| # | ||
| ###### | ||
| >>> doc.parse_syntax(syntax_parser) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> doc.sents[0].syntax.print() | ||
| [DocToken(stop=5, text='Посол', id='1_1', head_id='1_7', rel='nsubj', pos='NOUN', feats=<Anim,Nom,Masc,Sing>), | ||
| DocToken(start=6, stop=13, text='Израиля', id='1_2', head_id='1_1', rel='nmod', pos='PROPN', feats=<Inan,Gen,Masc,Sing>), | ||
| DocToken(start=14, stop=16, text='на', id='1_3', head_id='1_4', rel='case', pos='ADP'), | ||
| DocToken(start=17, stop=24, text='Украине', id='1_4', head_id='1_1', rel='nmod', pos='PROPN', feats=<Inan,Loc,Fem,Sing>), | ||
| DocToken(start=25, stop=30, text='Йоэль', id='1_5', head_id='1_1', rel='appos', pos='PROPN', feats=<Anim,Nom,Masc,Sing>)] | ||
| ┌──► Посол nsubj | ||
| │ Израиля | ||
| │ ┌► на case | ||
| │ └─ Украине | ||
| │ ┌─ Йоэль | ||
| │ └► Лион flat:name | ||
| ┌─────┌─└─── признался | ||
| │ │ ┌──► , punct | ||
| │ │ │ ┌► что mark | ||
| │ └►└─└─ пришел ccomp | ||
| │ │ ┌► в case | ||
| │ └──►└─ шок obl | ||
| │ ┌► , punct | ||
| │ ┌────►┌─└─ узнав advcl | ||
| │ │ │ ┌► о case | ||
| │ │ ┌───└►└─ решении obl | ||
| │ │ │ ┌─└──► властей nmod | ||
| │ │ │ │ ┌► Львовской amod | ||
| │ │ │ └──►└─ области nmod | ||
| │ └─└►┌─┌─── объявить nmod | ||
| │ │ │ ┌► 2019 amod | ||
| │ │ └►└─ год obj | ||
| │ └──►┌─ годом obl | ||
| │ ┌─────└► лидера nmod | ||
| │ │ ┌►┌─── запрещенной acl | ||
| │ │ │ │ ┌► в case | ||
| │ │ │ └►└─ России obl | ||
| │ ┌─└►└─┌─── Организации nmod | ||
| │ │ │ ┌► украинских amod | ||
| │ │ ┌─└►└─ националистов nmod | ||
| │ │ │ ┌► ( punct | ||
| │ │ └►┌─└─ ОУН parataxis | ||
| │ │ └──► ) punct | ||
| │ └──────►┌─ Степана appos | ||
| │ └► Бандеры flat:name | ||
| └──────────► . punct | ||
| ... | ||
| ####### | ||
| # | ||
| # NER | ||
| # | ||
| ###### | ||
| >>> doc.tag_ner(ner_tagger) | ||
| >>> display(doc.spans[:5]) | ||
| >>> doc.ner.print() | ||
| [DocSpan(start=6, stop=13, type='LOC', text='Израиля', tokens=[...]), | ||
| DocSpan(start=17, stop=24, type='LOC', text='Украине', tokens=[...]), | ||
| DocSpan(start=25, stop=35, type='PER', text='Йоэль Лион', tokens=[...]), | ||
| DocSpan(start=89, stop=106, type='LOC', text='Львовской области', tokens=[...]), | ||
| DocSpan(start=152, stop=158, type='LOC', text='России', tokens=[...])] | ||
| Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав | ||
| LOC──── LOC──── PER─────── | ||
| о решении властей Львовской области объявить 2019 год годом лидера | ||
| LOC────────────── | ||
| запрещенной в России Организации украинских националистов (ОУН) | ||
| LOC─── ORG─────────────────────────────────────── | ||
| Степана Бандеры. Свое заявление он разместил в Twitter. «Я не могу | ||
| PER──────────── ORG──── | ||
| понять, как прославление тех, кто непосредственно принимал участие в | ||
| ужасных антисемитских преступлениях, помогает бороться с | ||
| антисемитизмом и ксенофобией. Украина не должна забывать о | ||
| LOC──── | ||
| преступлениях, совершенных против украинских евреев, и никоим образом | ||
| не отмечать их через почитание их исполнителей», — написал дипломат. | ||
| 11 декабря Львовский областной совет принял решение провозгласить 2019 | ||
| ORG────────────────────── | ||
| год в регионе годом Степана Бандеры в связи с празднованием 110-летия | ||
| PER──────────── | ||
| со дня рождения лидера ОУН (Бандера родился 1 января 1909 года). В | ||
| ORG | ||
| июле аналогичное решение принял Житомирский областной совет. В начале | ||
| ORG──────────────────────── | ||
| месяца с предложением к президенту страны Петру Порошенко вернуть | ||
| PER──────────── | ||
| Бандере звание Героя Украины обратились депутаты Верховной Рады. | ||
| PER──── LOC──── ORG─────────── | ||
| Парламентарии уверены, что признание Бандеры национальным героем | ||
| PER──── | ||
| поможет в борьбе с подрывной деятельностью против Украины в | ||
| LOC──── | ||
| информационном поле, а также остановит «распространение мифов, | ||
| созданных российской пропагандой». Степан Бандера (1909-1959) был | ||
| PER─────────── | ||
| одним из лидеров Организации украинских националистов, выступающей за | ||
| ORG───────────────────────────────── | ||
| создание независимого государства на территориях с украиноязычным | ||
| населением. В 2010 году в период президентства Виктора Ющенко Бандера | ||
| PER─────────── PER──── | ||
| был посмертно признан Героем Украины, однако впоследствии это решение | ||
| LOC──── | ||
| было отменено судом. | ||
| ####### | ||
| # | ||
| # PHRASE NORM | ||
| # | ||
| ####### | ||
| >>> for span in doc.spans: | ||
| >>> span.normalize(morph_vocab) | ||
| >>> display(doc.spans[:5]) | ||
| >>> {_.text: _.normal for _ in doc.spans if _.text != _.normal} | ||
| [DocSpan(start=6, stop=13, type='LOC', text='Израиля', tokens=[...], normal='Израиль'), | ||
| DocSpan(start=17, stop=24, type='LOC', text='Украине', tokens=[...], normal='Украина'), | ||
| DocSpan(start=25, stop=35, type='PER', text='Йоэль Лион', tokens=[...], normal='Йоэль Лион'), | ||
| DocSpan(start=89, stop=106, type='LOC', text='Львовской области', tokens=[...], normal='Львовская область'), | ||
| DocSpan(start=152, stop=158, type='LOC', text='России', tokens=[...], normal='Россия')] | ||
| {'Израиля': 'Израиль', | ||
| 'Украине': 'Украина', | ||
| 'Львовской области': 'Львовская область', | ||
| 'России': 'Россия', | ||
| 'Организации украинских националистов (ОУН)': 'Организация украинских националистов (ОУН)', | ||
| 'Степана Бандеры': 'Степан Бандера', | ||
| 'Петру Порошенко': 'Петр Порошенко', | ||
| 'Бандере': 'Бандера', | ||
| 'Украины': 'Украина', | ||
| 'Верховной Рады': 'Верховная Рада', | ||
| 'Бандеры': 'Бандера', | ||
| 'Организации украинских националистов': 'Организация украинских националистов', | ||
| 'Виктора Ющенко': 'Виктор Ющенко'} | ||
| ####### | ||
| # | ||
| # FACT | ||
| # | ||
| ###### | ||
| >>> for span in doc.spans: | ||
| >>> if span.type == PER: | ||
| >>> span.extract_fact(names_extractor) | ||
| >>> {_.normal: _.fact for _ in doc.spans if _.type == PER} | ||
| {'Йоэль Лион': Name( | ||
| first='Йоэль', | ||
| last='Лион' | ||
| ), | ||
| 'Степан Бандера': Name( | ||
| first='Степан', | ||
| last='Бандера' | ||
| ), | ||
| 'Петр Порошенко': Name( | ||
| first='Петр', | ||
| last='Порошенко' | ||
| ), | ||
| 'Бандера': Name( | ||
| last='Бандера' | ||
| ), | ||
| 'Виктор Ющенко': Name( | ||
| first='Виктор', | ||
| last='Ющенко' | ||
| )} | ||
| ``` | ||
| ## Evaluation | ||
| ### Segmentation | ||
| Natasha uses <a href="https://github.com/natasha/razdel">Razdel</a> for text segmentation. | ||
| `errors` — number of errors aggregated over 4 datasets, see <a href="https://github.com/natasha/razdel#quality-performance">Razdel evalualtion section</a> for more info. | ||
| <!--- token ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>errors</th> | ||
| <th>time</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>razdel.tokenize</th> | ||
| <td>5439</td> | ||
| <td>9.898350</td> | ||
| </tr> | ||
| <tr> | ||
| <th>mystem</th> | ||
| <td>12192</td> | ||
| <td>17.210470</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>12288</td> | ||
| <td>19.920618</td> | ||
| </tr> | ||
| <tr> | ||
| <th>nltk.word_tokenize</th> | ||
| <td>130119</td> | ||
| <td>12.405366</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- token ---> | ||
| <!--- sent ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>errors</th> | ||
| <th>time</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>razdel.sentenize</th> | ||
| <td>32106</td> | ||
| <td>21.989045</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov/rusenttokenize</th> | ||
| <td>41722</td> | ||
| <td>32.535322</td> | ||
| </tr> | ||
| <tr> | ||
| <th>nltk.sent_tokenize</th> | ||
| <td>60378</td> | ||
| <td>29.916063</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- sent ---> | ||
| ### Embedding | ||
| Natasha uses <a href="https://github.com/natasha/navec">Navec pretrained embeddings</a>. | ||
| `precision` — Average precision over 4 datasets, see <a href="https://github.com/natasha/navec#evaluation">Navec evalualtion section</a> for more info. | ||
| <!--- emb1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>type</th> | ||
| <th>precision</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>vocab</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>hudlit_12B_500K_300d_100q</th> | ||
| <td>navec</td> | ||
| <td>0.825</td> | ||
| <td>1.0</td> | ||
| <td>50.6</td> | ||
| <td>95.3</td> | ||
| <td>500K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>news_1B_250K_300d_100q</th> | ||
| <td>navec</td> | ||
| <td>0.775</td> | ||
| <td>0.5</td> | ||
| <td>25.4</td> | ||
| <td>47.7</td> | ||
| <td>250K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>ruscorpora_upos_cbow_300_20_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.777</td> | ||
| <td>12.1</td> | ||
| <td>220.6</td> | ||
| <td>236.1</td> | ||
| <td>189K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>ruwikiruscorpora_upos_skipgram_300_2_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.776</td> | ||
| <td>15.7</td> | ||
| <td>290.0</td> | ||
| <td>309.4</td> | ||
| <td>248K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>tayga_upos_skipgram_300_2_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.795</td> | ||
| <td>15.7</td> | ||
| <td>290.7</td> | ||
| <td>310.9</td> | ||
| <td>249K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>tayga_none_fasttextcbow_300_10_2019</th> | ||
| <td>fasttext</td> | ||
| <td>0.706</td> | ||
| <td>11.3</td> | ||
| <td>2741.9</td> | ||
| <td>2746.9</td> | ||
| <td>192K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>araneum_none_fasttextcbow_300_5_2018</th> | ||
| <td>fasttext</td> | ||
| <td>0.720</td> | ||
| <td>7.8</td> | ||
| <td>2752.1</td> | ||
| <td>2754.7</td> | ||
| <td>195K</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- emb1 ---> | ||
| ### Morphology | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#morphology">Slovnet morphology tagger</a>. | ||
| `accuracy` — accuracy on news dataset, see <a href="https://github.com/natasha/slovnet#morphology-1">Slovnet evaluation section</a> for more. | ||
| <!--- morph1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>accuracy</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, sents/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.961</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>115</td> | ||
| <td>532.0</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.951</td> | ||
| <td>20.0</td> | ||
| <td>1393</td> | ||
| <td>8704</td> | ||
| <td>85.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov</th> | ||
| <td>0.940</td> | ||
| <td>4.0</td> | ||
| <td>32</td> | ||
| <td>10240</td> | ||
| <td>90.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>0.919</td> | ||
| <td>10.9</td> | ||
| <td>89</td> | ||
| <td>579</td> | ||
| <td>30.6</td> | ||
| </tr> | ||
| <tr> | ||
| <th>udpipe</th> | ||
| <td>0.918</td> | ||
| <td>6.9</td> | ||
| <td>45</td> | ||
| <td>242</td> | ||
| <td>56.2</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- morph1 ---> | ||
| ### Syntax | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#syntax">Slovnet syntax parser</a>. | ||
| `uas`, `las` — accuracy on news dataset, see <a href="https://github.com/natasha/slovnet#syntax-1">Slovnet evaluation section</a> for more. | ||
| <!--- syntax1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>uas</th> | ||
| <th>las</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, sents/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.907</td> | ||
| <td>0.880</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>125</td> | ||
| <td>450.0</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.962</td> | ||
| <td>0.910</td> | ||
| <td>34.0</td> | ||
| <td>1427</td> | ||
| <td>8704</td> | ||
| <td>75.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>0.876</td> | ||
| <td>0.818</td> | ||
| <td>10.9</td> | ||
| <td>89</td> | ||
| <td>579</td> | ||
| <td>31.6</td> | ||
| </tr> | ||
| <tr> | ||
| <th>udpipe</th> | ||
| <td>0.873</td> | ||
| <td>0.823</td> | ||
| <td>6.9</td> | ||
| <td>45</td> | ||
| <td>242</td> | ||
| <td>56.2</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- syntax1 ---> | ||
| ### NER | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#ner">Slovnet NER tagger</a>. | ||
| `f1` — score aggregated over 4 datasets, see <a href="https://github.com/natasha/slovnet#ner-1">Slovnet evaluation section</a> for more. | ||
| <!--- ner1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>PER/LOC/ORG f1</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, articles/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.97/0.91/0.85</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>205</td> | ||
| <td>25.3</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.98/0.92/0.86</td> | ||
| <td>34.5</td> | ||
| <td>2048</td> | ||
| <td>6144</td> | ||
| <td>13.1 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov</th> | ||
| <td>0.92/0.86/0.76</td> | ||
| <td>5.9</td> | ||
| <td>1024</td> | ||
| <td>3072</td> | ||
| <td>24.3 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>pullenti</th> | ||
| <td>0.92/0.82/0.64</td> | ||
| <td>2.9</td> | ||
| <td>16</td> | ||
| <td>253</td> | ||
| <td>6.0</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- ner1 ---> | ||
| ## Support | ||
| - Chat — https://telegram.me/natural_language_processing | ||
| - Issues — https://github.com/natasha/natasha/issues | ||
| - Commercial support — http://lab.alexkuk.ru/natasha | ||
| ## Development | ||
| Tests: | ||
| ```bash | ||
| make test | ||
| ``` | ||
| Package: | ||
| ```bash | ||
| make version | ||
| git push | ||
| git push --tags | ||
| make clean package publish | ||
| ``` | ||
| Keywords: natural language processing,russian | ||
| Platform: UNKNOWN | ||
| Classifier: License :: OSI Approved :: MIT License | ||
| Classifier: Programming Language :: Python :: 3 | ||
| Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence | ||
| Classifier: Topic :: Text Processing :: Linguistic | ||
| Description-Content-Type: text/markdown |
+687
| <img src="https://github.com/natasha/natasha-logos/blob/master/natasha.svg"> | ||
|  [](https://codecov.io/gh/natasha/natasha) | ||
| Natasha solves basic NLP tasks for Russian language: tokenization, sentence segmentation, word embedding, morphology tagging, lemmatization, phrase normalization, syntax parsing, NER tagging, fact extraction. Quality on every task is similar or better then current SOTAs for Russian language on news articles, see <a href="https://github.com/natasha/natasha#evaluation">evaluation section</a>. Natasha is not a research project, underlying technologies are built for production. We pay attention to model size, RAM usage and performance. Models run on CPU, use Numpy for inference. | ||
| Natasha integrates libraries from <a href="https://github.com/natasha">Natasha project</a> under one convenient API: | ||
| * <a href="https://github.com/natasha/razdel">Razdel</a> — token, sentence segmentation for Russian | ||
| * <a href="https://github.com/natasha/navec">Navec</a> — compact Russian embeddings | ||
| * <a href="https://github.com/natasha/slovnet">Slovnet</a> — modern deep-learning techniques for Russian NLP, compact models for Russian morphology, syntax, NER. | ||
| * <a href="https://github.com/natasha/yargy">Yargy</a> — rule-based fact extraction similar to Tomita parser. | ||
| * <a href="https://github.com/natasha/ipymarkup">Ipymarkup</a> — NLP visualizations for NER and syntax markups. | ||
| > ⚠ API may change, for realworld tasks consider using low level libraries from Natasha project. | ||
| > Models optimized for news articles, quality on other domain may be lower. | ||
| > To use old `NamesExtractor`, `AddressExtactor` downgrade `pip install natasha<1 yargy<0.13` | ||
| ## Install | ||
| Natasha supports Python 3.5+ and PyPy3: | ||
| ```bash | ||
| $ pip install natasha | ||
| ``` | ||
| ## Usage | ||
| For more examples and explanation see [Natasha documentation](http://nbviewer.jupyter.org/github/natasha/natasha/blob/master/docs.ipynb). | ||
| ```python | ||
| >>> from natasha import ( | ||
| Segmenter, | ||
| MorphVocab, | ||
| NewsEmbedding, | ||
| NewsMorphTagger, | ||
| NewsSyntaxParser, | ||
| NewsNERTagger, | ||
| PER, | ||
| NamesExtractor, | ||
| Doc | ||
| ) | ||
| ####### | ||
| # | ||
| # INIT | ||
| # | ||
| ##### | ||
| >>> segmenter = Segmenter() | ||
| >>> morph_vocab = MorphVocab() | ||
| >>> emb = NewsEmbedding() | ||
| >>> morph_tagger = NewsMorphTagger(emb) | ||
| >>> syntax_parser = NewsSyntaxParser(emb) | ||
| >>> ner_tagger = NewsNERTagger(emb) | ||
| >>> names_extractor = NamesExtractor(morph_vocab) | ||
| >>> text = 'Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав о решении властей Львовской области объявить 2019 год годом лидера запрещенной в России Организации украинских националистов (ОУН) Степана Бандеры. Свое заявление он разместил в Twitter. «Я не могу понять, как прославление тех, кто непосредственно принимал участие в ужасных антисемитских преступлениях, помогает бороться с антисемитизмом и ксенофобией. Украина не должна забывать о преступлениях, совершенных против украинских евреев, и никоим образом не отмечать их через почитание их исполнителей», — написал дипломат. 11 декабря Львовский областной совет принял решение провозгласить 2019 год в регионе годом Степана Бандеры в связи с празднованием 110-летия со дня рождения лидера ОУН (Бандера родился 1 января 1909 года). В июле аналогичное решение принял Житомирский областной совет. В начале месяца с предложением к президенту страны Петру Порошенко вернуть Бандере звание Героя Украины обратились депутаты Верховной Рады. Парламентарии уверены, что признание Бандеры национальным героем поможет в борьбе с подрывной деятельностью против Украины в информационном поле, а также остановит «распространение мифов, созданных российской пропагандой». Степан Бандера (1909-1959) был одним из лидеров Организации украинских националистов, выступающей за создание независимого государства на территориях с украиноязычным населением. В 2010 году в период президентства Виктора Ющенко Бандера был посмертно признан Героем Украины, однако впоследствии это решение было отменено судом. ' | ||
| >>> doc = Doc(text) | ||
| ####### | ||
| # | ||
| # SEGMENT | ||
| # | ||
| ##### | ||
| >>> doc.segment(segmenter) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> display(doc.sents[:5]) | ||
| [DocToken(stop=5, text='Посол'), | ||
| DocToken(start=6, stop=13, text='Израиля'), | ||
| DocToken(start=14, stop=16, text='на'), | ||
| DocToken(start=17, stop=24, text='Украине'), | ||
| DocToken(start=25, stop=30, text='Йоэль')] | ||
| [DocSent(stop=218, text='Посол Израиля на Украине Йоэль Лион признался, чт..., tokens=[...]), | ||
| DocSent(start=219, stop=257, text='Свое заявление он разместил в Twitter.', tokens=[...]), | ||
| DocSent(start=258, stop=424, text='«Я не могу понять, как прославление тех, кто непо..., tokens=[...]), | ||
| DocSent(start=425, stop=592, text='Украина не должна забывать о преступлениях, совер..., tokens=[...]), | ||
| DocSent(start=593, stop=798, text='11 декабря Львовский областной совет принял решен..., tokens=[...])] | ||
| ####### | ||
| # | ||
| # MORPH | ||
| # | ||
| ##### | ||
| >>> doc.tag_morph(morph_tagger) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> doc.sents[0].morph.print() | ||
| [DocToken(stop=5, text='Посол', pos='NOUN', feats=<Anim,Nom,Masc,Sing>), | ||
| DocToken(start=6, stop=13, text='Израиля', pos='PROPN', feats=<Inan,Gen,Masc,Sing>), | ||
| DocToken(start=14, stop=16, text='на', pos='ADP'), | ||
| DocToken(start=17, stop=24, text='Украине', pos='PROPN', feats=<Inan,Loc,Fem,Sing>), | ||
| DocToken(start=25, stop=30, text='Йоэль', pos='PROPN', feats=<Anim,Nom,Masc,Sing>)] | ||
| Посол NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Израиля PROPN|Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing | ||
| на ADP | ||
| Украине PROPN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing | ||
| Йоэль PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| Лион PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing | ||
| признался VERB|Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid | ||
| , PUNCT | ||
| что SCONJ | ||
| ... | ||
| ###### | ||
| # | ||
| # LEMMA | ||
| # | ||
| ####### | ||
| >>> for token in doc.tokens: | ||
| >>> token.lemmatize(morph_vocab) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> {_.text: _.lemma for _ in doc.tokens} | ||
| [DocToken(stop=5, text='Посол', pos='NOUN', feats=<Anim,Nom,Masc,Sing>, lemma='посол'), | ||
| DocToken(start=6, stop=13, text='Израиля', pos='PROPN', feats=<Inan,Gen,Masc,Sing>, lemma='израиль'), | ||
| DocToken(start=14, stop=16, text='на', pos='ADP', lemma='на'), | ||
| DocToken(start=17, stop=24, text='Украине', pos='PROPN', feats=<Inan,Loc,Fem,Sing>, lemma='украина'), | ||
| DocToken(start=25, stop=30, text='Йоэль', pos='PROPN', feats=<Anim,Nom,Masc,Sing>, lemma='йоэль')] | ||
| {'Посол': 'посол', | ||
| 'Израиля': 'израиль', | ||
| 'на': 'на', | ||
| 'Украине': 'украина', | ||
| 'Йоэль': 'йоэль', | ||
| 'Лион': 'лион', | ||
| 'признался': 'признаться', | ||
| ',': ',', | ||
| 'что': 'что', | ||
| 'пришел': 'прийти', | ||
| 'в': 'в', | ||
| 'шок': 'шок', | ||
| 'узнав': 'узнать', | ||
| 'о': 'о', | ||
| ... | ||
| ####### | ||
| # | ||
| # SYNTAX | ||
| # | ||
| ###### | ||
| >>> doc.parse_syntax(syntax_parser) | ||
| >>> display(doc.tokens[:5]) | ||
| >>> doc.sents[0].syntax.print() | ||
| [DocToken(stop=5, text='Посол', id='1_1', head_id='1_7', rel='nsubj', pos='NOUN', feats=<Anim,Nom,Masc,Sing>), | ||
| DocToken(start=6, stop=13, text='Израиля', id='1_2', head_id='1_1', rel='nmod', pos='PROPN', feats=<Inan,Gen,Masc,Sing>), | ||
| DocToken(start=14, stop=16, text='на', id='1_3', head_id='1_4', rel='case', pos='ADP'), | ||
| DocToken(start=17, stop=24, text='Украине', id='1_4', head_id='1_1', rel='nmod', pos='PROPN', feats=<Inan,Loc,Fem,Sing>), | ||
| DocToken(start=25, stop=30, text='Йоэль', id='1_5', head_id='1_1', rel='appos', pos='PROPN', feats=<Anim,Nom,Masc,Sing>)] | ||
| ┌──► Посол nsubj | ||
| │ Израиля | ||
| │ ┌► на case | ||
| │ └─ Украине | ||
| │ ┌─ Йоэль | ||
| │ └► Лион flat:name | ||
| ┌─────┌─└─── признался | ||
| │ │ ┌──► , punct | ||
| │ │ │ ┌► что mark | ||
| │ └►└─└─ пришел ccomp | ||
| │ │ ┌► в case | ||
| │ └──►└─ шок obl | ||
| │ ┌► , punct | ||
| │ ┌────►┌─└─ узнав advcl | ||
| │ │ │ ┌► о case | ||
| │ │ ┌───└►└─ решении obl | ||
| │ │ │ ┌─└──► властей nmod | ||
| │ │ │ │ ┌► Львовской amod | ||
| │ │ │ └──►└─ области nmod | ||
| │ └─└►┌─┌─── объявить nmod | ||
| │ │ │ ┌► 2019 amod | ||
| │ │ └►└─ год obj | ||
| │ └──►┌─ годом obl | ||
| │ ┌─────└► лидера nmod | ||
| │ │ ┌►┌─── запрещенной acl | ||
| │ │ │ │ ┌► в case | ||
| │ │ │ └►└─ России obl | ||
| │ ┌─└►└─┌─── Организации nmod | ||
| │ │ │ ┌► украинских amod | ||
| │ │ ┌─└►└─ националистов nmod | ||
| │ │ │ ┌► ( punct | ||
| │ │ └►┌─└─ ОУН parataxis | ||
| │ │ └──► ) punct | ||
| │ └──────►┌─ Степана appos | ||
| │ └► Бандеры flat:name | ||
| └──────────► . punct | ||
| ... | ||
| ####### | ||
| # | ||
| # NER | ||
| # | ||
| ###### | ||
| >>> doc.tag_ner(ner_tagger) | ||
| >>> display(doc.spans[:5]) | ||
| >>> doc.ner.print() | ||
| [DocSpan(start=6, stop=13, type='LOC', text='Израиля', tokens=[...]), | ||
| DocSpan(start=17, stop=24, type='LOC', text='Украине', tokens=[...]), | ||
| DocSpan(start=25, stop=35, type='PER', text='Йоэль Лион', tokens=[...]), | ||
| DocSpan(start=89, stop=106, type='LOC', text='Львовской области', tokens=[...]), | ||
| DocSpan(start=152, stop=158, type='LOC', text='России', tokens=[...])] | ||
| Посол Израиля на Украине Йоэль Лион признался, что пришел в шок, узнав | ||
| LOC──── LOC──── PER─────── | ||
| о решении властей Львовской области объявить 2019 год годом лидера | ||
| LOC────────────── | ||
| запрещенной в России Организации украинских националистов (ОУН) | ||
| LOC─── ORG─────────────────────────────────────── | ||
| Степана Бандеры. Свое заявление он разместил в Twitter. «Я не могу | ||
| PER──────────── ORG──── | ||
| понять, как прославление тех, кто непосредственно принимал участие в | ||
| ужасных антисемитских преступлениях, помогает бороться с | ||
| антисемитизмом и ксенофобией. Украина не должна забывать о | ||
| LOC──── | ||
| преступлениях, совершенных против украинских евреев, и никоим образом | ||
| не отмечать их через почитание их исполнителей», — написал дипломат. | ||
| 11 декабря Львовский областной совет принял решение провозгласить 2019 | ||
| ORG────────────────────── | ||
| год в регионе годом Степана Бандеры в связи с празднованием 110-летия | ||
| PER──────────── | ||
| со дня рождения лидера ОУН (Бандера родился 1 января 1909 года). В | ||
| ORG | ||
| июле аналогичное решение принял Житомирский областной совет. В начале | ||
| ORG──────────────────────── | ||
| месяца с предложением к президенту страны Петру Порошенко вернуть | ||
| PER──────────── | ||
| Бандере звание Героя Украины обратились депутаты Верховной Рады. | ||
| PER──── LOC──── ORG─────────── | ||
| Парламентарии уверены, что признание Бандеры национальным героем | ||
| PER──── | ||
| поможет в борьбе с подрывной деятельностью против Украины в | ||
| LOC──── | ||
| информационном поле, а также остановит «распространение мифов, | ||
| созданных российской пропагандой». Степан Бандера (1909-1959) был | ||
| PER─────────── | ||
| одним из лидеров Организации украинских националистов, выступающей за | ||
| ORG───────────────────────────────── | ||
| создание независимого государства на территориях с украиноязычным | ||
| населением. В 2010 году в период президентства Виктора Ющенко Бандера | ||
| PER─────────── PER──── | ||
| был посмертно признан Героем Украины, однако впоследствии это решение | ||
| LOC──── | ||
| было отменено судом. | ||
| ####### | ||
| # | ||
| # PHRASE NORM | ||
| # | ||
| ####### | ||
| >>> for span in doc.spans: | ||
| >>> span.normalize(morph_vocab) | ||
| >>> display(doc.spans[:5]) | ||
| >>> {_.text: _.normal for _ in doc.spans if _.text != _.normal} | ||
| [DocSpan(start=6, stop=13, type='LOC', text='Израиля', tokens=[...], normal='Израиль'), | ||
| DocSpan(start=17, stop=24, type='LOC', text='Украине', tokens=[...], normal='Украина'), | ||
| DocSpan(start=25, stop=35, type='PER', text='Йоэль Лион', tokens=[...], normal='Йоэль Лион'), | ||
| DocSpan(start=89, stop=106, type='LOC', text='Львовской области', tokens=[...], normal='Львовская область'), | ||
| DocSpan(start=152, stop=158, type='LOC', text='России', tokens=[...], normal='Россия')] | ||
| {'Израиля': 'Израиль', | ||
| 'Украине': 'Украина', | ||
| 'Львовской области': 'Львовская область', | ||
| 'России': 'Россия', | ||
| 'Организации украинских националистов (ОУН)': 'Организация украинских националистов (ОУН)', | ||
| 'Степана Бандеры': 'Степан Бандера', | ||
| 'Петру Порошенко': 'Петр Порошенко', | ||
| 'Бандере': 'Бандера', | ||
| 'Украины': 'Украина', | ||
| 'Верховной Рады': 'Верховная Рада', | ||
| 'Бандеры': 'Бандера', | ||
| 'Организации украинских националистов': 'Организация украинских националистов', | ||
| 'Виктора Ющенко': 'Виктор Ющенко'} | ||
| ####### | ||
| # | ||
| # FACT | ||
| # | ||
| ###### | ||
| >>> for span in doc.spans: | ||
| >>> if span.type == PER: | ||
| >>> span.extract_fact(names_extractor) | ||
| >>> {_.normal: _.fact for _ in doc.spans if _.type == PER} | ||
| {'Йоэль Лион': Name( | ||
| first='Йоэль', | ||
| last='Лион' | ||
| ), | ||
| 'Степан Бандера': Name( | ||
| first='Степан', | ||
| last='Бандера' | ||
| ), | ||
| 'Петр Порошенко': Name( | ||
| first='Петр', | ||
| last='Порошенко' | ||
| ), | ||
| 'Бандера': Name( | ||
| last='Бандера' | ||
| ), | ||
| 'Виктор Ющенко': Name( | ||
| first='Виктор', | ||
| last='Ющенко' | ||
| )} | ||
| ``` | ||
| ## Evaluation | ||
| ### Segmentation | ||
| Natasha uses <a href="https://github.com/natasha/razdel">Razdel</a> for text segmentation. | ||
| `errors` — number of errors aggregated over 4 datasets, see <a href="https://github.com/natasha/razdel#quality-performance">Razdel evalualtion section</a> for more info. | ||
| <!--- token ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>errors</th> | ||
| <th>time</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>razdel.tokenize</th> | ||
| <td>5439</td> | ||
| <td>9.898350</td> | ||
| </tr> | ||
| <tr> | ||
| <th>mystem</th> | ||
| <td>12192</td> | ||
| <td>17.210470</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>12288</td> | ||
| <td>19.920618</td> | ||
| </tr> | ||
| <tr> | ||
| <th>nltk.word_tokenize</th> | ||
| <td>130119</td> | ||
| <td>12.405366</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- token ---> | ||
| <!--- sent ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>errors</th> | ||
| <th>time</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>razdel.sentenize</th> | ||
| <td>32106</td> | ||
| <td>21.989045</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov/rusenttokenize</th> | ||
| <td>41722</td> | ||
| <td>32.535322</td> | ||
| </tr> | ||
| <tr> | ||
| <th>nltk.sent_tokenize</th> | ||
| <td>60378</td> | ||
| <td>29.916063</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- sent ---> | ||
| ### Embedding | ||
| Natasha uses <a href="https://github.com/natasha/navec">Navec pretrained embeddings</a>. | ||
| `precision` — Average precision over 4 datasets, see <a href="https://github.com/natasha/navec#evaluation">Navec evalualtion section</a> for more info. | ||
| <!--- emb1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>type</th> | ||
| <th>precision</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>vocab</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>hudlit_12B_500K_300d_100q</th> | ||
| <td>navec</td> | ||
| <td>0.825</td> | ||
| <td>1.0</td> | ||
| <td>50.6</td> | ||
| <td>95.3</td> | ||
| <td>500K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>news_1B_250K_300d_100q</th> | ||
| <td>navec</td> | ||
| <td>0.775</td> | ||
| <td>0.5</td> | ||
| <td>25.4</td> | ||
| <td>47.7</td> | ||
| <td>250K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>ruscorpora_upos_cbow_300_20_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.777</td> | ||
| <td>12.1</td> | ||
| <td>220.6</td> | ||
| <td>236.1</td> | ||
| <td>189K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>ruwikiruscorpora_upos_skipgram_300_2_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.776</td> | ||
| <td>15.7</td> | ||
| <td>290.0</td> | ||
| <td>309.4</td> | ||
| <td>248K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>tayga_upos_skipgram_300_2_2019</th> | ||
| <td>w2v</td> | ||
| <td>0.795</td> | ||
| <td>15.7</td> | ||
| <td>290.7</td> | ||
| <td>310.9</td> | ||
| <td>249K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>tayga_none_fasttextcbow_300_10_2019</th> | ||
| <td>fasttext</td> | ||
| <td>0.706</td> | ||
| <td>11.3</td> | ||
| <td>2741.9</td> | ||
| <td>2746.9</td> | ||
| <td>192K</td> | ||
| </tr> | ||
| <tr> | ||
| <th>araneum_none_fasttextcbow_300_5_2018</th> | ||
| <td>fasttext</td> | ||
| <td>0.720</td> | ||
| <td>7.8</td> | ||
| <td>2752.1</td> | ||
| <td>2754.7</td> | ||
| <td>195K</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- emb1 ---> | ||
| ### Morphology | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#morphology">Slovnet morphology tagger</a>. | ||
| `accuracy` — accuracy on news dataset, see <a href="https://github.com/natasha/slovnet#morphology-1">Slovnet evaluation section</a> for more. | ||
| <!--- morph1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>accuracy</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, sents/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.961</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>115</td> | ||
| <td>532.0</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.951</td> | ||
| <td>20.0</td> | ||
| <td>1393</td> | ||
| <td>8704</td> | ||
| <td>85.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov</th> | ||
| <td>0.940</td> | ||
| <td>4.0</td> | ||
| <td>32</td> | ||
| <td>10240</td> | ||
| <td>90.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>0.919</td> | ||
| <td>10.9</td> | ||
| <td>89</td> | ||
| <td>579</td> | ||
| <td>30.6</td> | ||
| </tr> | ||
| <tr> | ||
| <th>udpipe</th> | ||
| <td>0.918</td> | ||
| <td>6.9</td> | ||
| <td>45</td> | ||
| <td>242</td> | ||
| <td>56.2</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- morph1 ---> | ||
| ### Syntax | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#syntax">Slovnet syntax parser</a>. | ||
| `uas`, `las` — accuracy on news dataset, see <a href="https://github.com/natasha/slovnet#syntax-1">Slovnet evaluation section</a> for more. | ||
| <!--- syntax1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>uas</th> | ||
| <th>las</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, sents/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.907</td> | ||
| <td>0.880</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>125</td> | ||
| <td>450.0</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.962</td> | ||
| <td>0.910</td> | ||
| <td>34.0</td> | ||
| <td>1427</td> | ||
| <td>8704</td> | ||
| <td>75.0 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>spacy</th> | ||
| <td>0.876</td> | ||
| <td>0.818</td> | ||
| <td>10.9</td> | ||
| <td>89</td> | ||
| <td>579</td> | ||
| <td>31.6</td> | ||
| </tr> | ||
| <tr> | ||
| <th>udpipe</th> | ||
| <td>0.873</td> | ||
| <td>0.823</td> | ||
| <td>6.9</td> | ||
| <td>45</td> | ||
| <td>242</td> | ||
| <td>56.2</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- syntax1 ---> | ||
| ### NER | ||
| Natasha uses <a href="https://github.com/natasha/slovnet#ner">Slovnet NER tagger</a>. | ||
| `f1` — score aggregated over 4 datasets, see <a href="https://github.com/natasha/slovnet#ner-1">Slovnet evaluation section</a> for more. | ||
| <!--- ner1 ---> | ||
| <table border="0" class="dataframe"> | ||
| <thead> | ||
| <tr style="text-align: right;"> | ||
| <th></th> | ||
| <th>PER/LOC/ORG f1</th> | ||
| <th>init, s</th> | ||
| <th>disk, mb</th> | ||
| <th>ram, mb</th> | ||
| <th>speed, articles/s</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <th>slovnet</th> | ||
| <td>0.97/0.91/0.85</td> | ||
| <td>1.0</td> | ||
| <td>27</td> | ||
| <td>205</td> | ||
| <td>25.3</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov_bert</th> | ||
| <td>0.98/0.92/0.86</td> | ||
| <td>34.5</td> | ||
| <td>2048</td> | ||
| <td>6144</td> | ||
| <td>13.1 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>deeppavlov</th> | ||
| <td>0.92/0.86/0.76</td> | ||
| <td>5.9</td> | ||
| <td>1024</td> | ||
| <td>3072</td> | ||
| <td>24.3 (gpu)</td> | ||
| </tr> | ||
| <tr> | ||
| <th>pullenti</th> | ||
| <td>0.92/0.82/0.64</td> | ||
| <td>2.9</td> | ||
| <td>16</td> | ||
| <td>253</td> | ||
| <td>6.0</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
| <!--- ner1 ---> | ||
| ## Support | ||
| - Chat — https://telegram.me/natural_language_processing | ||
| - Issues — https://github.com/natasha/natasha/issues | ||
| - Commercial support — http://lab.alexkuk.ru/natasha | ||
| ## Development | ||
| Tests: | ||
| ```bash | ||
| make test | ||
| ``` | ||
| Package: | ||
| ```bash | ||
| make version | ||
| git push | ||
| git push --tags | ||
| make clean package publish | ||
| ``` |
+21
| [bumpversion] | ||
| current_version = 1.1.0 | ||
| files = setup.py natasha/__init__.py | ||
| commit = True | ||
| message = Up version | ||
| tag = True | ||
| [tool:pytest] | ||
| python_files = test_*.py test.py | ||
| pep8ignore = | ||
| E501 # E501 line too long | ||
| W503 # W503 and all( | ||
| markers = | ||
| pep8 | ||
| filterwarnings = | ||
| ignore::DeprecationWarning | ||
| [egg_info] | ||
| tag_build = | ||
| tag_date = 0 | ||
+43
| from setuptools import setup, find_packages | ||
| with open('README.md') as file: | ||
| description = file.read() | ||
| with open('requirements/main.txt') as file: | ||
| requirements = [_.strip() for _ in file] | ||
| setup( | ||
| name='natasha', | ||
| version='1.1.0', | ||
| description='Named-entity recognition for russian language', | ||
| long_description=description, | ||
| long_description_content_type='text/markdown', | ||
| url='https://github.com/natasha/natasha', | ||
| author='Natasha contributors', | ||
| author_email='d.a.veselov@yandex.ru, alex@alexkuk.ru', | ||
| license='MIT', | ||
| classifiers=[ | ||
| 'License :: OSI Approved :: MIT License', | ||
| 'Programming Language :: Python :: 3', | ||
| 'Topic :: Scientific/Engineering :: Artificial Intelligence', | ||
| 'Topic :: Text Processing :: Linguistic', | ||
| ], | ||
| keywords='natural language processing, russian', | ||
| packages=find_packages(), | ||
| package_data={ | ||
| 'natasha': [ | ||
| 'data/dict/*.txt', | ||
| 'data/emb/*.tar', | ||
| 'data/model/*.tar', | ||
| ] | ||
| }, | ||
| install_requires=requirements | ||
| ) |
+17
-13
| from .extractors import ( | ||
| NamesExtractor, | ||
| SimpleNamesExtractor, | ||
| DatesExtractor, | ||
| MoneyExtractor, | ||
| MoneyRateExtractor, | ||
| MoneyRangeExtractor, | ||
| LocationExtractor, | ||
| AddressExtractor, | ||
| OrganisationExtractor, | ||
| PersonExtractor | ||
| ) | ||
| from .const import PER, LOC, ORG # noqa | ||
| from .segment import Segmenter # noqa | ||
| from .morph.vocab import MorphVocab # noqa | ||
| __version__ = '0.10.0' | ||
| from .emb import NewsEmbedding # noqa | ||
| from .morph.tagger import NewsMorphTagger # noqa | ||
| from .syntax import NewsSyntaxParser # noqa | ||
| from .ner import NewsNERTagger # noqa | ||
| from .extractors import NamesExtractor # noqa | ||
| from .extractors import DatesExtractor # noqa | ||
| from .extractors import MoneyExtractor # noqa | ||
| from .extractors import AddrExtractor # noqa | ||
| from .doc import Doc # noqa | ||
| __version__ = '1.1.0' |
+18
-34
@@ -1,43 +0,27 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import os | ||
| import json | ||
| from os.path import join, dirname | ||
| from io import open | ||
| DATA_DIR = dirname(__file__) | ||
| DICT_DIR = join(DATA_DIR, 'dict') | ||
| MODEL_DIR = join(DATA_DIR, 'model') | ||
| EMB_DIR = join(DATA_DIR, 'emb') | ||
| def get_path(dir, filename): | ||
| return os.path.join( | ||
| os.path.dirname(__file__), | ||
| dir, filename | ||
| ) | ||
| FIRST = join(DICT_DIR, 'first.txt') | ||
| LAST = join(DICT_DIR, 'last.txt') | ||
| MAYBE_FIRST = join(DICT_DIR, 'maybe_first.txt') | ||
| NEWS_EMBEDDING = join(EMB_DIR, 'navec_news_v1_1B_250K_300d_100q.tar') | ||
| def get_dict_path(filename): | ||
| return get_path('dicts', filename) | ||
| NEWS_MORPH = join(MODEL_DIR, 'slovnet_morph_news_v1.tar') | ||
| NEWS_NER = join(MODEL_DIR, 'slovnet_ner_news_v1.tar') | ||
| NEWS_SYNTAX = join(MODEL_DIR, 'slovnet_syntax_news_v1.tar') | ||
| def get_model_path(filename): | ||
| return get_path('models', filename) | ||
| def maybe_strip_comment(line): | ||
| if '#' in line: | ||
| line = line[:line.index('#')] | ||
| line = line.rstrip() | ||
| return line | ||
| def load_dict(filename): | ||
| path = get_dict_path(filename) | ||
| with open(path, encoding='utf-8') as file: | ||
| def load_dict(path): | ||
| with open(path, encoding='utf8') as file: | ||
| for line in file: | ||
| line = line.rstrip('\n') | ||
| line = maybe_strip_comment(line) | ||
| yield line | ||
| def load_json(path): | ||
| with open(path, encoding='utf-8') as file: | ||
| return json.load(file) | ||
| index = line.find('#') | ||
| if index > 0: | ||
| line = line[:index] | ||
| yield line.rstrip() |
+119
-118
@@ -1,161 +0,162 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from collections import OrderedDict | ||
| from itertools import zip_longest | ||
| from yargy import Parser | ||
| from yargy import Parser as YargyParser | ||
| from yargy.morph import MorphAnalyzer | ||
| from yargy.tokenizer import MorphTokenizer | ||
| from .utils import Record | ||
| from .preprocess import normalize_text | ||
| from .markup import format_markup_css | ||
| from .record import Record | ||
| from .tokenizer import TOKENIZER | ||
| from .crf import ( | ||
| CrfTagger, | ||
| ##### | ||
| # | ||
| # OBJS | ||
| # | ||
| ##### | ||
| STREET_MODEL, | ||
| get_street_features, | ||
| NAME_MODEL, | ||
| get_name_features | ||
| ) | ||
| class Obj(Record): | ||
| # default none values | ||
| def __init__(self, *args, **kwargs): | ||
| for key, value in zip_longest(self.__attributes__, args): | ||
| self.__dict__[key] = value | ||
| self.__dict__.update(kwargs) | ||
| from .grammars.name import ( | ||
| SIMPLE_NAME, | ||
| NAME | ||
| ) | ||
| from .grammars.date import DATE | ||
| from .grammars.money import ( | ||
| MONEY, | ||
| RATE as MONEY_RATE, | ||
| RANGE as MONEY_RANGE | ||
| ) | ||
| from .grammars.location import LOCATION | ||
| from .grammars.address import ADDRESS | ||
| from .grammars.organisation import ORGANISATION | ||
| from .grammars.person import PERSON | ||
| # but skip undef values in repr | ||
| def __repr__(self): | ||
| name = self.__class__.__name__ | ||
| args = ', '.join( | ||
| '{key}={value!r}'.format( | ||
| key=_, | ||
| value=getattr(self, _) | ||
| ) | ||
| for _ in self.__attributes__ | ||
| if getattr(self, _) | ||
| ) | ||
| return '{name}({args})'.format( | ||
| name=name, | ||
| args=args | ||
| ) | ||
| from .dsl import can_be_normalized | ||
| def _repr_pretty_(self, printer, cycle): | ||
| name = self.__class__.__name__ | ||
| if cycle: | ||
| printer.text('{name}(...)'.format(name=name)) | ||
| else: | ||
| printer.text('{name}('.format(name=name)) | ||
| pairs = [] | ||
| for key in self.__attributes__: | ||
| value = getattr(self, key) | ||
| if value: | ||
| pairs.append([key, value]) | ||
| def serialize(match): | ||
| span = match.span | ||
| fact = match.fact | ||
| if can_be_normalized(fact): | ||
| fact = fact.normalized | ||
| type = fact.__class__.__name__ | ||
| return OrderedDict([ | ||
| ('type', type), | ||
| ('fact', fact.as_json), | ||
| ('span', span), | ||
| ]) | ||
| size = len(pairs) | ||
| if size: | ||
| with printer.indent(4): | ||
| printer.break_() | ||
| for index, (key, value) in enumerate(pairs): | ||
| printer.text(key + '=') | ||
| printer.pretty(value) | ||
| if index < size - 1: | ||
| printer.text(',') | ||
| printer.break_() | ||
| printer.break_() | ||
| printer.text(')') | ||
| class Matches(Record): | ||
| __attributes__ = ['text', 'matches'] | ||
| class Name(Obj): | ||
| __attributes__ = ['first', 'last', 'middle'] | ||
| def __init__(self, text, matches): | ||
| self.text = text | ||
| self.matches = sorted(matches, key=lambda _: _.span) | ||
| def __iter__(self): | ||
| return iter(self.matches) | ||
| class Date(Obj): | ||
| __attributes__ = ['year', 'month', 'day'] | ||
| def __getitem__(self, index): | ||
| return self.matches[index] | ||
| def __len__(self): | ||
| return len(self.matches) | ||
| class Money(Obj): | ||
| __attributes__ = ['amount', 'currency'] | ||
| def __bool__(self): | ||
| return bool(self.matches) | ||
| @property | ||
| def as_json(self): | ||
| return [serialize(_) for _ in self.matches] | ||
| class AddrPart(Obj): | ||
| __attributes__ = ['value', 'type'] | ||
| def _repr_html_(self): | ||
| spans = [_.span for _ in self.matches] | ||
| return ''.join(format_markup_css(self.text, spans)) | ||
| class Addr(Obj): | ||
| __attributes__ = ['parts'] | ||
| class Extractor(object): | ||
| def __init__(self, rule, tokenizer=TOKENIZER, tagger=None): | ||
| self.parser = Parser(rule, tokenizer=tokenizer, tagger=tagger) | ||
| def __call__(self, text): | ||
| text = normalize_text(text) | ||
| matches = self.parser.findall(text) | ||
| return Matches(text, matches) | ||
| ####### | ||
| # | ||
| # EXTRACTOR | ||
| # | ||
| ###### | ||
| class NamesExtractor(Extractor): | ||
| def __init__(self): | ||
| tagger = CrfTagger( | ||
| NAME_MODEL, | ||
| get_name_features | ||
| ) | ||
| super(NamesExtractor, self).__init__( | ||
| NAME, | ||
| tagger=tagger | ||
| ) | ||
| class Parser(YargyParser): | ||
| def __init__(self, rule, morph): | ||
| # wraps pymorphy subclass | ||
| # add methods check_gram, normalized | ||
| # uses parse method that is cached | ||
| morph = MorphAnalyzer(morph) | ||
| tokenizer = MorphTokenizer(morph=morph) | ||
| YargyParser.__init__(self, rule, tokenizer=tokenizer) | ||
| class SimpleNamesExtractor(Extractor): | ||
| def __init__(self): | ||
| super(SimpleNamesExtractor, self).__init__(SIMPLE_NAME) | ||
| class Match(Record): | ||
| __attributes__ = ['start', 'stop', 'fact'] | ||
| class PersonExtractor(Extractor): | ||
| def __init__(self): | ||
| tagger = CrfTagger( | ||
| NAME_MODEL, | ||
| get_name_features | ||
| ) | ||
| super(PersonExtractor, self).__init__( | ||
| PERSON, | ||
| tagger=tagger | ||
| ) | ||
| def adapt_match(match): | ||
| start, stop = match.span | ||
| fact = match.fact.obj | ||
| return Match(start, stop, fact) | ||
| class DatesExtractor(Extractor): | ||
| def __init__(self): | ||
| super(DatesExtractor, self).__init__(DATE) | ||
| class Extractor: | ||
| def __init__(self, rule, morph): | ||
| self.parser = Parser(rule, morph) | ||
| class MoneyExtractor(Extractor): | ||
| def __init__(self): | ||
| super(MoneyExtractor, self).__init__(MONEY) | ||
| def __call__(self, text): | ||
| for match in self.parser.findall(text): | ||
| yield adapt_match(match) | ||
| def find(self, text): | ||
| match = self.parser.find(text) | ||
| if match: | ||
| return adapt_match(match) | ||
| class MoneyRateExtractor(Extractor): | ||
| def __init__(self): | ||
| super(MoneyRateExtractor, self).__init__(MONEY_RATE) | ||
| class NamesExtractor(Extractor): | ||
| def __init__(self, morph): | ||
| from .grammars.name import NAME | ||
| Extractor.__init__(self, NAME, morph) | ||
| class MoneyRangeExtractor(Extractor): | ||
| def __init__(self): | ||
| super(MoneyRangeExtractor, self).__init__(MONEY_RANGE) | ||
| class DatesExtractor(Extractor): | ||
| def __init__(self, morph): | ||
| from .grammars.date import DATE | ||
| Extractor.__init__(self, DATE, morph) | ||
| class AddressExtractor(Extractor): | ||
| def __init__(self): | ||
| tagger = CrfTagger( | ||
| STREET_MODEL, | ||
| get_street_features | ||
| ) | ||
| super(AddressExtractor, self).__init__( | ||
| ADDRESS, | ||
| tagger=tagger | ||
| ) | ||
| class MoneyExtractor(Extractor): | ||
| def __init__(self, morph): | ||
| from .grammars.money import MONEY | ||
| Extractor.__init__(self, MONEY, morph) | ||
| class LocationExtractor(Extractor): | ||
| def __init__(self): | ||
| super(LocationExtractor, self).__init__(LOCATION) | ||
| class AddrExtractor(Extractor): | ||
| def __init__(self, morph): | ||
| from .grammars.addr import ADDR_PART | ||
| Extractor.__init__(self, ADDR_PART, morph) | ||
| class OrganisationExtractor(Extractor): | ||
| def __init__(self): | ||
| super(OrganisationExtractor, self).__init__(ORGANISATION) | ||
| def find(self, text): | ||
| matches = list(self(text)) | ||
| if not matches: | ||
| return | ||
| matches = sorted(matches, key=lambda _: _.start) | ||
| start = matches[0].start | ||
| stop = matches[-1].stop | ||
| parts = [_.fact for _ in matches] | ||
| return Match(start, stop, Addr(parts)) |
@@ -1,3 +0,1 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
@@ -8,3 +6,3 @@ from yargy import ( | ||
| ) | ||
| from yargy.interpretation import fact, attribute | ||
| from yargy.interpretation import fact | ||
| from yargy.predicates import ( | ||
@@ -18,6 +16,13 @@ eq, gte, lte, length_eq, | ||
| 'Date', | ||
| ['year', 'month', 'day', attribute('current_era', True)] | ||
| ['year', 'month', 'day'] | ||
| ) | ||
| class Date(Date): | ||
| @property | ||
| def obj(self): | ||
| from natasha.extractors import Date | ||
| return Date(self.year, self.month, self.day) | ||
| MONTHS = { | ||
@@ -77,19 +82,2 @@ 'январь': 1, | ||
| ERA_YEAR = and_( | ||
| gte(1), | ||
| lte(100000) | ||
| ).interpretation( | ||
| Date.year.custom(int) | ||
| ) | ||
| ERA_WORD = rule( | ||
| eq('до'), | ||
| or_( | ||
| rule('н', eq('.'), 'э', eq('.').optional()), | ||
| rule(normalized('наша'), normalized('эра')) | ||
| ) | ||
| ).interpretation( | ||
| Date.current_era.const(False) | ||
| ) | ||
| DATE = or_( | ||
@@ -126,9 +114,4 @@ rule( | ||
| ), | ||
| rule( | ||
| ERA_YEAR, | ||
| YEAR_WORD, | ||
| ERA_WORD, | ||
| ) | ||
| ).interpretation( | ||
| Date | ||
| ) |
+14
-120
@@ -1,3 +0,1 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals, division | ||
@@ -21,8 +19,7 @@ import re | ||
| from natasha.utils import Record | ||
| from natasha.dsl import ( | ||
| Normalizable, | ||
| money as dsl | ||
| ) | ||
| class Currency: | ||
| RUBLES = 'RUB' | ||
| DOLLARS = 'USD' | ||
| EURO = 'EUR' | ||
@@ -36,5 +33,5 @@ | ||
| class Money(Money, Normalizable): | ||
| class Money(Money): | ||
| @property | ||
| def normalized(self): | ||
| def amount(self): | ||
| amount = self.integer | ||
@@ -47,36 +44,10 @@ if self.fraction: | ||
| amount += self.coins / 100 | ||
| return dsl.Money(amount, self.currency) | ||
| return amount | ||
| Rate = fact( | ||
| 'Rate', | ||
| ['money', 'period'] | ||
| ) | ||
| class Rate(Rate, Normalizable): | ||
| @property | ||
| def normalized(self): | ||
| return dsl.Rate( | ||
| self.money.normalized, | ||
| self.period | ||
| ) | ||
| def obj(self): | ||
| from natasha.extractors import Money | ||
| return Money(self.amount, self.currency) | ||
| Range = fact( | ||
| 'Range', | ||
| ['min', 'max'] | ||
| ) | ||
| class Range(Range, Normalizable): | ||
| @property | ||
| def normalized(self): | ||
| min = self.min.normalized | ||
| max = self.max.normalized | ||
| if not min.currency: | ||
| min.currency = max.currency | ||
| return dsl.Range(min, max) | ||
| DOT = eq('.') | ||
@@ -97,3 +68,3 @@ INT = type('INT') | ||
| ).interpretation( | ||
| const(dsl.EURO) | ||
| const(Currency.EURO) | ||
| ) | ||
@@ -105,3 +76,3 @@ | ||
| ).interpretation( | ||
| const(dsl.DOLLARS) | ||
| const(Currency.DOLLARS) | ||
| ) | ||
@@ -120,3 +91,3 @@ | ||
| ).interpretation( | ||
| const(dsl.RUBLES) | ||
| const(Currency.RUBLES) | ||
| ) | ||
@@ -249,3 +220,3 @@ | ||
| def normalize_integer(value): | ||
| integer = re.sub('[\s.,]+', '', value) | ||
| integer = re.sub(r'[\s.,]+', '', value) | ||
| return int(integer) | ||
@@ -327,78 +298,1 @@ | ||
| ) | ||
| ########### | ||
| # | ||
| # RATE | ||
| # | ||
| ########## | ||
| RATE_MONEY = MONEY.interpretation( | ||
| Rate.money | ||
| ) | ||
| PERIODS = { | ||
| 'день': dsl.DAY, | ||
| 'сутки': dsl.DAY, | ||
| 'час': dsl.HOUR, | ||
| 'смена': dsl.SHIFT | ||
| } | ||
| PERIOD = dictionary( | ||
| PERIODS | ||
| ).interpretation( | ||
| Rate.period.normalized().custom(PERIODS.__getitem__) | ||
| ) | ||
| PER = or_( | ||
| eq('/'), | ||
| in_caseless({'в', 'за'}) | ||
| ) | ||
| RATE = rule( | ||
| RATE_MONEY, | ||
| PER, | ||
| PERIOD | ||
| ).interpretation( | ||
| Rate | ||
| ) | ||
| ####### | ||
| # | ||
| # RANGE | ||
| # | ||
| ######## | ||
| DASH = eq('-') | ||
| RANGE_MONEY = rule( | ||
| AMOUNT, | ||
| CURRENCY.optional() | ||
| ).interpretation( | ||
| Money | ||
| ) | ||
| RANGE_MIN = rule( | ||
| eq('от').optional(), | ||
| RANGE_MONEY.interpretation( | ||
| Range.min | ||
| ) | ||
| ) | ||
| RANGE_MAX = rule( | ||
| eq('до').optional(), | ||
| RANGE_MONEY.interpretation( | ||
| Range.max | ||
| ) | ||
| ) | ||
| RANGE = rule( | ||
| RANGE_MIN, | ||
| DASH.optional(), | ||
| RANGE_MAX | ||
| ).interpretation( | ||
| Range | ||
| ) |
+48
-74
@@ -1,3 +0,1 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
@@ -10,39 +8,37 @@ from yargy import ( | ||
| from yargy.predicates import ( | ||
| eq, length_eq, | ||
| gram, tag, | ||
| is_single, is_capitalized | ||
| gram, | ||
| is_capitalized | ||
| ) | ||
| from yargy.predicates.bank import DictionaryPredicate as dictionary | ||
| from yargy.relations import gnc_relation | ||
| from natasha.data import load_dict | ||
| from natasha.data import ( | ||
| FIRST, MAYBE_FIRST, LAST, | ||
| load_dict | ||
| ) | ||
| from yargy.rule.transformators import RuleTransformator | ||
| from yargy.rule.constructors import Rule | ||
| from yargy.predicates.constructors import AndPredicate | ||
| Name = fact( | ||
| 'Name', | ||
| ['first', 'middle', 'last', 'nick'] | ||
| ['first', 'last', 'middle'] | ||
| ) | ||
| FIRST_DICT = set(load_dict('first.txt')) | ||
| MAYBE_FIRST_DICT = set(load_dict('maybe_first.txt')) | ||
| LAST_DICT = set(load_dict('last.txt')) | ||
| class Name(Name): | ||
| @property | ||
| def obj(self): | ||
| from natasha.extractors import Name | ||
| return Name(self.first, self.last, self.middle) | ||
| ########## | ||
| # | ||
| # COMPONENTS | ||
| # | ||
| ########### | ||
| FIRST_DICT = { | ||
| item | ||
| for path in (FIRST, MAYBE_FIRST) | ||
| for item in load_dict(path) | ||
| } | ||
| LAST_DICT = set(load_dict(LAST)) | ||
| IN_FIRST = dictionary(FIRST_DICT) | ||
| IN_MAYBE_FIRST = dictionary(MAYBE_FIRST_DICT) | ||
| IN_LAST = dictionary(LAST_DICT) | ||
| gnc = gnc_relation() | ||
| TITLE = is_capitalized() | ||
@@ -57,9 +53,3 @@ | ||
| TITLE = is_capitalized() | ||
| NOUN = gram('NOUN') | ||
| NAME_CRF = tag('I') | ||
| ABBR = gram('Abbr') | ||
| SURN = gram('Surn') | ||
| NAME = and_( | ||
@@ -74,12 +64,8 @@ gram('Name'), | ||
| FIRST = and_( | ||
| NAME_CRF, | ||
| or_( | ||
| NAME, | ||
| IN_MAYBE_FIRST, | ||
| IN_FIRST | ||
| ) | ||
| FIRST = or_( | ||
| NAME, | ||
| IN_FIRST | ||
| ).interpretation( | ||
| Name.first.inflected() | ||
| ).match(gnc) | ||
| Name.first | ||
| ) | ||
@@ -91,3 +77,3 @@ FIRST_ABBR = and_( | ||
| Name.first | ||
| ).match(gnc) | ||
| ) | ||
@@ -102,13 +88,17 @@ | ||
| LAST = and_( | ||
| NAME_CRF, | ||
| or_( | ||
| SURN, | ||
| IN_LAST | ||
| ) | ||
| SURN = gram('Surn') | ||
| LAST = or_( | ||
| SURN, | ||
| IN_LAST | ||
| ).interpretation( | ||
| Name.last.inflected() | ||
| ).match(gnc) | ||
| Name.last | ||
| ) | ||
| MAYBE_LAST = and_( | ||
| TITLE, | ||
| not_(ABBR) | ||
| ).interpretation(Name.last) | ||
| ######## | ||
@@ -122,4 +112,4 @@ # | ||
| MIDDLE = PATR.interpretation( | ||
| Name.middle.inflected() | ||
| ).match(gnc) | ||
| Name.middle | ||
| ) | ||
@@ -131,3 +121,3 @@ MIDDLE_ABBR = and_( | ||
| Name.middle | ||
| ).match(gnc) | ||
| ) | ||
@@ -144,7 +134,7 @@ | ||
| FIRST, | ||
| LAST | ||
| MAYBE_LAST | ||
| ) | ||
| LAST_FIRST = rule( | ||
| LAST, | ||
| MAYBE_LAST, | ||
| FIRST | ||
@@ -164,7 +154,7 @@ ) | ||
| '.', | ||
| LAST | ||
| MAYBE_LAST | ||
| ) | ||
| LAST_ABBR_FIRST = rule( | ||
| LAST, | ||
| MAYBE_LAST, | ||
| FIRST_ABBR, | ||
@@ -179,7 +169,7 @@ '.', | ||
| '.', | ||
| LAST | ||
| MAYBE_LAST | ||
| ) | ||
| LAST_ABBR_FIRST_MIDDLE = rule( | ||
| LAST, | ||
| MAYBE_LAST, | ||
| FIRST_ABBR, | ||
@@ -207,7 +197,7 @@ '.', | ||
| MIDDLE, | ||
| LAST | ||
| MAYBE_LAST | ||
| ) | ||
| LAST_FIRST_MIDDLE = rule( | ||
| LAST, | ||
| MAYBE_LAST, | ||
| FIRST, | ||
@@ -255,17 +245,1 @@ MIDDLE | ||
| ) | ||
| class StripCrfTransformator(RuleTransformator): | ||
| def visit_term(self, item): | ||
| if isinstance(item, Rule): | ||
| return self.visit(item) | ||
| elif isinstance(item, AndPredicate): | ||
| predicates = [_ for _ in item.predicates if _ != NAME_CRF] | ||
| return AndPredicate(predicates) | ||
| else: | ||
| return item | ||
| SIMPLE_NAME = NAME.transform( | ||
| StripCrfTransformator | ||
| ) |
@@ -1,71 +0,36 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import DatesExtractor | ||
| from natasha.grammars.date import Date | ||
| from natasha.extractors import Date | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return DatesExtractor() | ||
| tests = [ | ||
| [ | ||
| '24.01.2017', | ||
| Date( | ||
| year=2017, | ||
| month=1, | ||
| day=24 | ||
| ) | ||
| Date(2017, 1, 24) | ||
| ], | ||
| [ | ||
| '27. 05.99', | ||
| Date( | ||
| year=1999, | ||
| month=5, | ||
| day=27 | ||
| ) | ||
| Date(1999, 5, 27) | ||
| ], | ||
| [ | ||
| '2015 год', | ||
| Date(year=2015) | ||
| Date(2015) | ||
| ], | ||
| [ | ||
| '2014 г', | ||
| Date(year=2014) | ||
| Date(2014) | ||
| ], | ||
| [ | ||
| '1 апреля', | ||
| Date( | ||
| month=4, | ||
| day=1 | ||
| ) | ||
| Date(None, 4, 1) | ||
| ], | ||
| [ | ||
| 'май 2017 г.', | ||
| Date( | ||
| year=2017, | ||
| month=5 | ||
| ) | ||
| Date(2017, 5) | ||
| ], | ||
| [ | ||
| '9 мая 2017 года', | ||
| Date( | ||
| year=2017, | ||
| month=5, | ||
| day=9, | ||
| current_era=True, | ||
| ) | ||
| Date(2017, 5, 9) | ||
| ], | ||
| [ | ||
| '5000 год до н.э.', | ||
| Date(year=5000, current_era=False) | ||
| ], | ||
| [ | ||
| '100 г. до нашей эры', | ||
| Date(year=100, current_era=False) | ||
| ] | ||
| ] | ||
@@ -75,7 +40,5 @@ | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| line, etalon = test | ||
| matches = list(extractor(line)) | ||
| assert len(matches) == 1 | ||
| guess = matches[0].fact | ||
| assert guess == etalon | ||
| def test_extractor(dates_extractor, test): | ||
| text, target = test | ||
| pred = dates_extractor.find(text).fact | ||
| assert pred == target |
@@ -1,14 +0,5 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import MoneyExtractor | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return MoneyExtractor() | ||
| tests = [ | ||
@@ -25,54 +16,53 @@ [ | ||
| '420 долларов', | ||
| '420 USD' | ||
| '420.00 USD' | ||
| ], | ||
| [ | ||
| '20 млн руб', | ||
| '20000000 RUB'], | ||
| '20000000.00 RUB'], | ||
| [ | ||
| '20 000 долларов', | ||
| '20000 USD' | ||
| '20000.00 USD' | ||
| ], | ||
| [ | ||
| '2,2 млн.руб.', | ||
| '2200000.0 RUB' | ||
| '2200000.00 RUB' | ||
| ], | ||
| [ | ||
| '2,20 млн.руб.', | ||
| '2200000.0 RUB' | ||
| '2200000.00 RUB' | ||
| ], | ||
| [ | ||
| '2,02 млн.руб.', | ||
| '2020000.0 RUB' | ||
| '2020000.00 RUB' | ||
| ], | ||
| [ | ||
| '20 тыс руб', | ||
| '20000 RUB' | ||
| '20000.00 RUB' | ||
| ], | ||
| [ | ||
| '20 т. р.', | ||
| '20000 RUB' | ||
| '20000.00 RUB' | ||
| ], | ||
| [ | ||
| '2 200 000 руб.', | ||
| '2200000 RUB' | ||
| '2200000.00 RUB' | ||
| ], | ||
| [ | ||
| '20.000 руб.', | ||
| '20000 RUB' | ||
| '20000.00 RUB' | ||
| ], | ||
| [ | ||
| '20,000 руб', | ||
| '20000 RUB' | ||
| '20000.00 RUB' | ||
| ], | ||
| [ | ||
| '20,00 руб', | ||
| '20 RUB' | ||
| '20.00 RUB' | ||
| ], | ||
| [ | ||
| '124 451 рубль 50 копеек', | ||
| '124451.5 RUB', | ||
| '124451.50 RUB', | ||
| ], | ||
| [ | ||
| ('881 913 (Восемьсот восемьдесят одна ' | ||
| 'тысяча девятьсот тринадцать) руб. 98 коп.'), | ||
| '881 913 (Восемьсот восемьдесят одна тысяча девятьсот тринадцать) руб. 98 коп.', | ||
| '881913.98 RUB' | ||
@@ -84,8 +74,6 @@ ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| line, etalon = test | ||
| matches = list(extractor(line)) | ||
| assert len(matches) == 1 | ||
| fact = matches[0].fact | ||
| guess = str(fact.normalized) | ||
| assert guess == etalon | ||
| def test_extractor(money_extractor, test): | ||
| text, target = test | ||
| fact = money_extractor.find(text).fact | ||
| pred = '{fact.amount:0.2f} {fact.currency}'.format(fact=fact) | ||
| assert pred == target |
@@ -1,59 +0,72 @@ | ||
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import SimpleNamesExtractor | ||
| from natasha.grammars.name import Name | ||
| from natasha.extractors import Name | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return SimpleNamesExtractor() | ||
| tests = [ | ||
| ['Мустафа Джемилев', Name(first='мустафа', last='джемилев')], | ||
| ['Егору Свиридову', Name(first='егор', last='свиридов')], | ||
| ['Стрыжак Алеся', Name(first='алеся', last='стрыжак')], | ||
| ['владимир путин', Name(first='владимир', last='путин')], | ||
| ['плаксюк саша', Name(first='саша', last='плаксюк')], | ||
| ['О. Дерипаска', Name(first='О', last='дерипаск')], | ||
| ['Ищенко Е.П.', Name(first='Е', last='ищенко', middle='П')], | ||
| ['Фёдора Ивановича Шаляпина', | ||
| Name(first='фёдор', last='шаляпин', middle='иванович')], | ||
| ['Ипполит Матвеевич', Name(first='ипполит', middle='матвеевич')], | ||
| ['Януковичем', Name(last='янукович')], | ||
| ['Авраама', Name(first='авраам')], | ||
| ['Гоша Куценко', Name(first='гоша', last='куценко')], | ||
| [ | ||
| 'Мустафа Джемилев', | ||
| Name('Мустафа', 'Джемилев') | ||
| ], | ||
| [ | ||
| 'Егор Свиридов', | ||
| Name('Егор', 'Свиридов') | ||
| ], | ||
| [ | ||
| 'Владимир Путин', | ||
| Name('Владимир', 'Путин') | ||
| ], | ||
| [ | ||
| 'Плаксюк Саша', | ||
| Name('Саша', 'Плаксюк') | ||
| ], | ||
| [ | ||
| 'О. Дерипаска', | ||
| Name('О', 'Дерипаска') | ||
| ], | ||
| [ | ||
| 'Ищенко Е.П.', | ||
| Name('Е', 'Ищенко', 'П') | ||
| ], | ||
| [ | ||
| 'Фёдор Иванович Шаляпин', | ||
| Name('Фёдор', 'Шаляпин', 'Иванович') | ||
| ], | ||
| [ | ||
| 'Ипполит Матвеевич', | ||
| Name('Ипполит', 'Матвеевич') | ||
| ], | ||
| [ | ||
| 'Янукович', | ||
| Name(None, 'Янукович') | ||
| ], | ||
| [ | ||
| 'Авраам', | ||
| Name('Авраам') | ||
| ], | ||
| [ | ||
| 'Гоша Куценко', | ||
| Name('Гоша', 'Куценко') | ||
| ], | ||
| [ | ||
| 'Юрий Георгиевич Куценко', | ||
| Name(first='юрий', last='куценко', middle='георгиевич'), | ||
| Name('Юрий', 'Куценко', 'Георгиевич') | ||
| ], | ||
| ['Наталья Ищенко', Name(first='наталья', last='ищенко')], | ||
| [ | ||
| 'Наталья Сергеевна Ищенко', | ||
| Name(first='наталья', last='ищенко', middle='сергеевна'), | ||
| 'Наталья Ищенко', | ||
| Name('Наталья', 'Ищенко') | ||
| ], | ||
| [ | ||
| 'МОНИНОЙ Нине Гафуровне', | ||
| Name(first='нина', last='монина', middle='гафуровна') | ||
| 'Наталья Сергеевна Ищенко', | ||
| Name('Наталья', 'Ищенко', 'Сергеевна') | ||
| ], | ||
| [ | ||
| 'АЗЫЕВОЙ ГАЛИНЕ АЛЕКСАНДРОВНЕ', | ||
| Name(first='галина', last='азыева', middle='александровна') | ||
| 'Монина Нина Гафуровна', | ||
| Name('Нина', 'Монина', 'Гафуровна') | ||
| ], | ||
| [ | ||
| 'В. И. Ленин', | ||
| Name(first='В', last='ленин', middle='И', nick=None), | ||
| Name('В', 'Ленин', 'И'), | ||
| ], | ||
| # TODO | ||
| # С одной версией словарей получается горбачёв, с другой горбачев | ||
| # ['М.С. Горбачевым', Name(first='М', last='горбачёв', middle='С')], | ||
| # ['Ахмат-Хаджи Кадырова', Name(first='ахмат-хаджи', last='кадыров')], | ||
| ] | ||
@@ -63,6 +76,5 @@ | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| text = test[0] | ||
| etalon = test[1:] | ||
| guess = [_.fact for _ in extractor(text)] | ||
| assert guess == etalon | ||
| def test_extractor(names_extractor, test): | ||
| text, target = test | ||
| pred = names_extractor.find(text).fact | ||
| assert pred == target |
| UNKNOWN | ||
-33
| Metadata-Version: 2.0 | ||
| Name: natasha | ||
| Version: 0.10.0 | ||
| Summary: Named-entity recognition for russian language | ||
| Home-page: https://github.com/natasha/natasha | ||
| Author: Natasha contributors | ||
| Author-email: d.a.veselov@yandex.ru | ||
| License: MIT | ||
| Description-Content-Type: UNKNOWN | ||
| Keywords: natural language processing,russian morphology,named entity recognition,tomita | ||
| Platform: UNKNOWN | ||
| Classifier: Development Status :: 3 - Alpha | ||
| Classifier: Intended Audience :: Developers | ||
| Classifier: Intended Audience :: Science/Research | ||
| Classifier: License :: OSI Approved :: MIT License | ||
| Classifier: Programming Language :: Python | ||
| Classifier: Programming Language :: Python :: 2 | ||
| Classifier: Programming Language :: Python :: 2.7 | ||
| Classifier: Programming Language :: Python :: 3 | ||
| Classifier: Programming Language :: Python :: 3.3 | ||
| Classifier: Programming Language :: Python :: 3.4 | ||
| Classifier: Programming Language :: Python :: 3.5 | ||
| Classifier: Programming Language :: Python :: 3.6 | ||
| Classifier: Programming Language :: Python :: Implementation :: CPython | ||
| Classifier: Programming Language :: Python :: Implementation :: PyPy | ||
| Classifier: Topic :: Software Development :: Libraries :: Python Modules | ||
| Classifier: Topic :: Scientific/Engineering :: Information Analysis | ||
| Classifier: Topic :: Text Processing :: Linguistic | ||
| Requires-Dist: yargy | ||
| UNKNOWN | ||
| {"classifiers": ["Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing :: Linguistic"], "description_content_type": "UNKNOWN", "extensions": {"python.details": {"contacts": [{"email": "d.a.veselov@yandex.ru", "name": "Natasha contributors", "role": "author"}], "document_names": {"description": "DESCRIPTION.rst"}, "project_urls": {"Home": "https://github.com/natasha/natasha"}}}, "extras": [], "generator": "bdist_wheel (0.30.0)", "keywords": ["natural", "language", "processing", "russian", "morphology", "named", "entity", "recognition", "tomita"], "license": "MIT", "metadata_version": "2.0", "name": "natasha", "run_requires": [{"requires": ["yargy"]}], "summary": "Named-entity recognition for russian language", "version": "0.10.0"} |
-238
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from collections import Counter | ||
| from yargy.tagger import Tagger | ||
| from .data import ( | ||
| load_json, | ||
| get_model_path | ||
| ) | ||
| NAME_MODEL = get_model_path('name.crf.json') | ||
| STREET_MODEL = get_model_path('street.crf.json') | ||
| ########## | ||
| # | ||
| # MODEL | ||
| # | ||
| ########### | ||
| OUTSIDE = 'O' | ||
| INSIDE = 'I' | ||
| LABELS = [OUTSIDE, INSIDE] | ||
| class Model(object): | ||
| def __init__(self, transitions, state_features): | ||
| self.transitions = transitions | ||
| self.state_features = state_features | ||
| def parse_model(data): | ||
| transitions = {} | ||
| for a, b, weight in data[0]: | ||
| transitions[LABELS.index(b), LABELS.index(a)] = weight | ||
| state_features = Counter() | ||
| for feature, label, weight in data[1]: | ||
| state_features[feature, LABELS.index(label)] = weight | ||
| return Model(transitions, state_features) | ||
| def load_model(path): | ||
| data = load_json(path) | ||
| return parse_model(data) | ||
| def argmax(items): | ||
| position = None | ||
| value = None | ||
| for index, item in enumerate(items): | ||
| if value is None or item > value: | ||
| value = item | ||
| position = index | ||
| return position, value | ||
| def viterbi(features, model): | ||
| if not features: | ||
| return [] | ||
| assert len(LABELS) == 2 | ||
| labels = range(len(LABELS)) | ||
| state = [] | ||
| weights = model.state_features | ||
| for attributes in features: | ||
| scores = [ | ||
| sum(weights[__, _] for __ in attributes) | ||
| for _ in labels | ||
| ] | ||
| state.append(scores) | ||
| previous = state[0] | ||
| path = [] | ||
| weights = model.transitions | ||
| for scores in state[1:]: | ||
| step = [] | ||
| current = [] | ||
| for target in labels: | ||
| options = [ | ||
| previous[_] + weights[_, target] | ||
| for _ in labels | ||
| ] | ||
| index, value = argmax(options) | ||
| current.append(scores[target] + value) | ||
| step.append(index) | ||
| previous = current | ||
| path.append(step) | ||
| index, _ = argmax(previous) | ||
| labels = [index] | ||
| for step in reversed(path): | ||
| index = step[index] | ||
| labels.append(index) | ||
| return [LABELS[_] for _ in reversed(labels)] | ||
| class CrfTagger(Tagger): | ||
| def __init__(self, path, get_features): | ||
| self.model = load_model(path) | ||
| self.get_features = get_features | ||
| def check_tag(self, tag): | ||
| return tag in LABELS | ||
| def __call__(self, tokens): | ||
| tokens = list(tokens) | ||
| features = list(self.get_features(tokens)) | ||
| labels = viterbi(features, self.model) | ||
| assert len(tokens) == len(labels) | ||
| for token, label in zip(tokens, labels): | ||
| yield token.tagged(label) | ||
| ############ | ||
| # | ||
| # STREET | ||
| # | ||
| ########### | ||
| def get_shape(token): | ||
| item = token.value | ||
| if item.isdigit(): | ||
| return 'DIGIT' | ||
| elif item.isalpha(): | ||
| if item.islower(): | ||
| return 'oo' | ||
| elif item.isupper(): | ||
| return 'OO' | ||
| elif item.istitle(): | ||
| return 'Oo' | ||
| else: | ||
| return 'OTHER' | ||
| else: | ||
| return 'PUNCT' | ||
| def get_normalized(token): | ||
| if token.type == 'RU': | ||
| return token.normalized | ||
| elif token.type == 'PUNCT': | ||
| return token.value | ||
| else: | ||
| return 'OTHER' | ||
| def get_street_token_features(tokens, index): | ||
| token = tokens[index] | ||
| yield 'bias' | ||
| yield 'shape=' + get_shape(token) | ||
| yield 'norm=' + get_normalized(token) | ||
| if index > 1: | ||
| token = tokens[index - 2] | ||
| yield '-2:shape=' + get_shape(token) | ||
| yield '-2:norm=' + get_normalized(token) | ||
| else: | ||
| yield 'BOS' | ||
| if index > 0: | ||
| token = tokens[index - 1] | ||
| yield '-1:shape=' + get_shape(token) | ||
| yield '-1:norm=' + get_normalized(token) | ||
| else: | ||
| yield 'BOS' | ||
| if index < len(tokens) - 1: | ||
| token = tokens[index + 1] | ||
| yield '+1:shape=' + get_shape(token) | ||
| yield '+1:norm=' + get_normalized(token) | ||
| else: | ||
| yield 'EOS' | ||
| def get_street_features(tokens): | ||
| for index, _ in enumerate(tokens): | ||
| yield list(get_street_token_features(tokens, index)) | ||
| ######## | ||
| # | ||
| # NAME | ||
| # | ||
| ########### | ||
| def get_pos(token): | ||
| if token.type == 'RU': | ||
| form = token.forms[0].raw | ||
| return form.tag.POS or 'UNKNOWN' | ||
| else: | ||
| return 'UNKNOWN' | ||
| def get_name(token): | ||
| if token.type == 'RU': | ||
| form = token.forms[0] | ||
| grams = form.grams | ||
| if 'Name' in grams: | ||
| return 'Name' | ||
| elif 'Surn' in grams: | ||
| return 'Surn' | ||
| return 'UNKNOWN' | ||
| def get_name_token_features(base, size, **features): | ||
| yield 'bias' | ||
| for offset in (-3, -2, -1, 0, 1, 2, 3): | ||
| index = base + offset | ||
| if index < 0: | ||
| yield 'BOS' | ||
| elif index >= size: | ||
| yield 'EOS' | ||
| else: | ||
| for key, values in features.items(): | ||
| value = values[index] | ||
| yield '{offset}:{key}={value}'.format( | ||
| offset=offset, | ||
| key=key, | ||
| value=value | ||
| ) | ||
| def get_name_features(tokens): | ||
| shapes = [get_shape(_) for _ in tokens] | ||
| poses = [get_pos(_) for _ in tokens] | ||
| names = [get_name(_) for _ in tokens] | ||
| size = len(tokens) | ||
| for index in range(len(tokens)): | ||
| yield list(get_name_token_features( | ||
| index, size, | ||
| shape=shapes, pos=poses, name=names | ||
| )) |
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is too big to display
Sorry, the diff of this file is too big to display
| август | ||
| августа | ||
| агата # агаты | ||
| ада | ||
| аи | ||
| алмаз | ||
| алтай | ||
| альф # Альфа-банку | ||
| аля | ||
| ангел | ||
| ангела | ||
| арен # арену O2 | ||
| ария | ||
| афина | ||
| ая # 5-ая | ||
| баги | ||
| барак | ||
| бела # белой | ||
| бета | ||
| боян | ||
| буся | ||
| валентинка | ||
| валентиночка | ||
| вели | ||
| вера | ||
| вики # wiki | ||
| викторина | ||
| галька | ||
| гера | ||
| гор | ||
| дада | ||
| дан # дана | ||
| дана | ||
| данна # данной | ||
| даня # дань | ||
| дельфин | ||
| джем | ||
| джин | ||
| дрон | ||
| женя # жени | ||
| ида # иду | ||
| иза # из | ||
| ислам | ||
| калия # нитрата калия | ||
| канат | ||
| капа | ||
| кися | ||
| кола | ||
| коля | ||
| костя | ||
| куба | ||
| лада | ||
| ландыш | ||
| лев | ||
| лен | ||
| лена | ||
| лиана | ||
| ливия | ||
| лил | ||
| лила | ||
| лилия | ||
| лука | ||
| люба # любой | ||
| любов | ||
| любовь | ||
| лёва # левой | ||
| лёня | ||
| майк | ||
| майя | ||
| макс | ||
| макса | ||
| маргаритка | ||
| марк # марки | ||
| марта | ||
| мая | ||
| мила | ||
| надежда | ||
| нано | ||
| нарик | ||
| ник | ||
| ника | ||
| никон | ||
| нил | ||
| ной # 50%-ной | ||
| нора | ||
| олимпиада | ||
| ося # ось | ||
| павлина | ||
| петрушка | ||
| петя | ||
| питер | ||
| пол | ||
| поль | ||
| поля | ||
| рада | ||
| раш # раша | ||
| рая | ||
| рим | ||
| рима | ||
| родя # в своём роде | ||
| роза | ||
| розочка | ||
| рой | ||
| ром | ||
| рома | ||
| роман | ||
| романа | ||
| рулон | ||
| сами | ||
| света | ||
| семён # семена | ||
| серёжка | ||
| сирена | ||
| слава | ||
| спартак | ||
| султан | ||
| султана | ||
| талия | ||
| таньчик | ||
| тарана | ||
| тиша | ||
| толя # толи | ||
| том # в том | ||
| тома | ||
| томик | ||
| тёма # тем | ||
| урал | ||
| феня | ||
| фиалка | ||
| флора | ||
| хамит | ||
| ширин | ||
| ширина | ||
| эльбрус | ||
| ёлка |
| [ | ||
| [ | ||
| [ | ||
| "O", | ||
| "I", | ||
| -1.604116 | ||
| ], | ||
| [ | ||
| "I", | ||
| "I", | ||
| 2.740212 | ||
| ], | ||
| [ | ||
| "O", | ||
| "O", | ||
| 0.270131 | ||
| ], | ||
| [ | ||
| "I", | ||
| "O", | ||
| -1.406226 | ||
| ] | ||
| ], | ||
| [ | ||
| [ | ||
| "0:shape=oo", | ||
| "O", | ||
| 2.545642 | ||
| ], | ||
| [ | ||
| "0:shape=Oo", | ||
| "I", | ||
| 2.063586 | ||
| ], | ||
| [ | ||
| "0:name=UNKNOWN", | ||
| "O", | ||
| 1.482655 | ||
| ], | ||
| [ | ||
| "0:pos=NPRO", | ||
| "O", | ||
| 1.479144 | ||
| ], | ||
| [ | ||
| "0:pos=PRTF", | ||
| "O", | ||
| 1.3569 | ||
| ], | ||
| [ | ||
| "0:pos=ADJS", | ||
| "I", | ||
| 1.352732 | ||
| ], | ||
| [ | ||
| "0:pos=PRTS", | ||
| "I", | ||
| 1.018069 | ||
| ], | ||
| [ | ||
| "0:shape=DIGIT", | ||
| "O", | ||
| 0.987927 | ||
| ], | ||
| [ | ||
| "-1:name=Name", | ||
| "I", | ||
| 0.974886 | ||
| ], | ||
| [ | ||
| "0:pos=NUMR", | ||
| "I", | ||
| 0.934883 | ||
| ], | ||
| [ | ||
| "0:name=Surn", | ||
| "I", | ||
| 0.922177 | ||
| ], | ||
| [ | ||
| "0:pos=GRND", | ||
| "I", | ||
| 0.816026 | ||
| ], | ||
| [ | ||
| "2:name=Surn", | ||
| "O", | ||
| 0.798898 | ||
| ], | ||
| [ | ||
| "0:pos=UNKNOWN", | ||
| "O", | ||
| 0.777475 | ||
| ], | ||
| [ | ||
| "1:name=UNKNOWN", | ||
| "I", | ||
| 0.770902 | ||
| ], | ||
| [ | ||
| "BOS", | ||
| "O", | ||
| 0.685842 | ||
| ], | ||
| [ | ||
| "0:shape=OTHER", | ||
| "I", | ||
| 0.678433 | ||
| ], | ||
| [ | ||
| "0:pos=INFN", | ||
| "O", | ||
| 0.668857 | ||
| ], | ||
| [ | ||
| "-1:name=UNKNOWN", | ||
| "I", | ||
| 0.624275 | ||
| ], | ||
| [ | ||
| "-2:name=Name", | ||
| "O", | ||
| 0.624101 | ||
| ], | ||
| [ | ||
| "0:pos=COMP", | ||
| "O", | ||
| 0.585776 | ||
| ], | ||
| [ | ||
| "1:shape=oo", | ||
| "I", | ||
| 0.569471 | ||
| ], | ||
| [ | ||
| "1:name=Surn", | ||
| "I", | ||
| 0.564479 | ||
| ], | ||
| [ | ||
| "1:pos=GRND", | ||
| "O", | ||
| 0.539257 | ||
| ], | ||
| [ | ||
| "1:pos=VERB", | ||
| "I", | ||
| 0.536149 | ||
| ], | ||
| [ | ||
| "-1:pos=PRCL", | ||
| "I", | ||
| 0.53309 | ||
| ], | ||
| [ | ||
| "0:shape=OO", | ||
| "I", | ||
| 0.504874 | ||
| ], | ||
| [ | ||
| "-1:pos=VERB", | ||
| "I", | ||
| 0.501695 | ||
| ], | ||
| [ | ||
| "0:pos=ADJF", | ||
| "I", | ||
| 0.49816 | ||
| ], | ||
| [ | ||
| "-1:pos=NUMR", | ||
| "O", | ||
| 0.489015 | ||
| ], | ||
| [ | ||
| "-1:shape=oo", | ||
| "I", | ||
| 0.487965 | ||
| ], | ||
| [ | ||
| "1:pos=PRCL", | ||
| "I", | ||
| 0.484468 | ||
| ], | ||
| [ | ||
| "-1:name=Surn", | ||
| "O", | ||
| 0.463065 | ||
| ], | ||
| [ | ||
| "EOS", | ||
| "O", | ||
| 0.451363 | ||
| ], | ||
| [ | ||
| "0:pos=VERB", | ||
| "I", | ||
| 0.445968 | ||
| ], | ||
| [ | ||
| "0:pos=CONJ", | ||
| "O", | ||
| 0.442416 | ||
| ], | ||
| [ | ||
| "bias", | ||
| "O", | ||
| 0.441029 | ||
| ], | ||
| [ | ||
| "-2:name=UNKNOWN", | ||
| "O", | ||
| 0.433369 | ||
| ], | ||
| [ | ||
| "0:pos=PREP", | ||
| "O", | ||
| 0.42425 | ||
| ], | ||
| [ | ||
| "3:shape=PUNCT", | ||
| "O", | ||
| 0.412711 | ||
| ], | ||
| [ | ||
| "0:pos=PRED", | ||
| "O", | ||
| 0.403999 | ||
| ], | ||
| [ | ||
| "0:pos=NOUN", | ||
| "I", | ||
| 0.397826 | ||
| ], | ||
| [ | ||
| "3:shape=DIGIT", | ||
| "O", | ||
| 0.396768 | ||
| ], | ||
| [ | ||
| "1:shape=OO", | ||
| "I", | ||
| 0.394134 | ||
| ], | ||
| [ | ||
| "1:pos=CONJ", | ||
| "I", | ||
| 0.384873 | ||
| ], | ||
| [ | ||
| "1:pos=NPRO", | ||
| "O", | ||
| 0.371859 | ||
| ], | ||
| [ | ||
| "-1:pos=PREP", | ||
| "O", | ||
| 0.370365 | ||
| ], | ||
| [ | ||
| "2:shape=OTHER", | ||
| "O", | ||
| 0.367487 | ||
| ], | ||
| [ | ||
| "3:name=UNKNOWN", | ||
| "O", | ||
| 0.365273 | ||
| ], | ||
| [ | ||
| "-1:shape=OO", | ||
| "I", | ||
| 0.349842 | ||
| ], | ||
| [ | ||
| "2:shape=Oo", | ||
| "O", | ||
| 0.349626 | ||
| ], | ||
| [ | ||
| "1:pos=ADVB", | ||
| "I", | ||
| 0.344719 | ||
| ], | ||
| [ | ||
| "1:shape=PUNCT", | ||
| "I", | ||
| 0.342614 | ||
| ], | ||
| [ | ||
| "1:shape=Oo", | ||
| "O", | ||
| 0.337235 | ||
| ], | ||
| [ | ||
| "-1:pos=ADJF", | ||
| "O", | ||
| 0.331917 | ||
| ], | ||
| [ | ||
| "-1:pos=ADVB", | ||
| "I", | ||
| 0.329949 | ||
| ], | ||
| [ | ||
| "2:pos=GRND", | ||
| "I", | ||
| 0.31951 | ||
| ], | ||
| [ | ||
| "-1:pos=INFN", | ||
| "I", | ||
| 0.318389 | ||
| ], | ||
| [ | ||
| "1:pos=PRED", | ||
| "I", | ||
| 0.316141 | ||
| ], | ||
| [ | ||
| "1:shape=DIGIT", | ||
| "I", | ||
| 0.311652 | ||
| ], | ||
| [ | ||
| "2:pos=PRED", | ||
| "O", | ||
| 0.310205 | ||
| ], | ||
| [ | ||
| "1:pos=PRTF", | ||
| "O", | ||
| 0.307525 | ||
| ], | ||
| [ | ||
| "2:name=Name", | ||
| "O", | ||
| 0.30217 | ||
| ], | ||
| [ | ||
| "-2:shape=oo", | ||
| "O", | ||
| 0.301559 | ||
| ], | ||
| [ | ||
| "1:pos=COMP", | ||
| "I", | ||
| 0.286817 | ||
| ], | ||
| [ | ||
| "2:name=UNKNOWN", | ||
| "O", | ||
| 0.285496 | ||
| ], | ||
| [ | ||
| "3:name=Surn", | ||
| "O", | ||
| 0.272739 | ||
| ], | ||
| [ | ||
| "3:pos=UNKNOWN", | ||
| "I", | ||
| 0.266997 | ||
| ], | ||
| [ | ||
| "2:shape=OO", | ||
| "O", | ||
| 0.263057 | ||
| ], | ||
| [ | ||
| "-1:shape=PUNCT", | ||
| "O", | ||
| 0.256568 | ||
| ], | ||
| [ | ||
| "-3:name=UNKNOWN", | ||
| "O", | ||
| 0.25586 | ||
| ], | ||
| [ | ||
| "2:pos=UNKNOWN", | ||
| "O", | ||
| 0.252993 | ||
| ], | ||
| [ | ||
| "-1:shape=OTHER", | ||
| "I", | ||
| 0.2505 | ||
| ], | ||
| [ | ||
| "-2:shape=Oo", | ||
| "O", | ||
| 0.249383 | ||
| ], | ||
| [ | ||
| "-2:shape=PUNCT", | ||
| "O", | ||
| 0.242514 | ||
| ], | ||
| [ | ||
| "-2:name=Surn", | ||
| "O", | ||
| 0.239682 | ||
| ], | ||
| [ | ||
| "1:pos=NOUN", | ||
| "O", | ||
| 0.236719 | ||
| ], | ||
| [ | ||
| "2:shape=oo", | ||
| "O", | ||
| 0.23224 | ||
| ], | ||
| [ | ||
| "-1:shape=DIGIT", | ||
| "I", | ||
| 0.230359 | ||
| ], | ||
| [ | ||
| "-1:pos=COMP", | ||
| "I", | ||
| 0.229363 | ||
| ], | ||
| [ | ||
| "-2:pos=CONJ", | ||
| "O", | ||
| 0.22522 | ||
| ], | ||
| [ | ||
| "-2:shape=DIGIT", | ||
| "O", | ||
| 0.214741 | ||
| ], | ||
| [ | ||
| "-1:pos=INTJ", | ||
| "I", | ||
| 0.209495 | ||
| ], | ||
| [ | ||
| "-1:pos=NPRO", | ||
| "I", | ||
| 0.206034 | ||
| ], | ||
| [ | ||
| "2:pos=NOUN", | ||
| "O", | ||
| 0.20095 | ||
| ], | ||
| [ | ||
| "1:pos=ADJF", | ||
| "O", | ||
| 0.199187 | ||
| ], | ||
| [ | ||
| "3:pos=INTJ", | ||
| "O", | ||
| 0.195788 | ||
| ], | ||
| [ | ||
| "-3:name=Surn", | ||
| "O", | ||
| 0.191436 | ||
| ], | ||
| [ | ||
| "2:shape=DIGIT", | ||
| "O", | ||
| 0.190611 | ||
| ], | ||
| [ | ||
| "-1:pos=ADJS", | ||
| "O", | ||
| 0.184221 | ||
| ], | ||
| [ | ||
| "3:name=Name", | ||
| "O", | ||
| 0.181134 | ||
| ], | ||
| [ | ||
| "-3:pos=COMP", | ||
| "O", | ||
| 0.178985 | ||
| ], | ||
| [ | ||
| "-2:pos=NUMR", | ||
| "O", | ||
| 0.175882 | ||
| ], | ||
| [ | ||
| "3:pos=NUMR", | ||
| "O", | ||
| 0.172133 | ||
| ], | ||
| [ | ||
| "1:pos=INTJ", | ||
| "I", | ||
| 0.168811 | ||
| ], | ||
| [ | ||
| "-2:shape=OO", | ||
| "O", | ||
| 0.168566 | ||
| ], | ||
| [ | ||
| "0:pos=INTJ", | ||
| "I", | ||
| 0.161937 | ||
| ], | ||
| [ | ||
| "-2:pos=ADJF", | ||
| "O", | ||
| 0.161821 | ||
| ], | ||
| [ | ||
| "2:pos=ADJS", | ||
| "O", | ||
| 0.159444 | ||
| ], | ||
| [ | ||
| "0:pos=PRCL", | ||
| "I", | ||
| 0.158935 | ||
| ], | ||
| [ | ||
| "1:pos=UNKNOWN", | ||
| "I", | ||
| 0.154923 | ||
| ], | ||
| [ | ||
| "0:shape=PUNCT", | ||
| "O", | ||
| 0.154353 | ||
| ], | ||
| [ | ||
| "-2:pos=PRTF", | ||
| "O", | ||
| 0.153652 | ||
| ], | ||
| [ | ||
| "2:pos=INFN", | ||
| "O", | ||
| 0.153512 | ||
| ], | ||
| [ | ||
| "-2:pos=VERB", | ||
| "O", | ||
| 0.1477 | ||
| ], | ||
| [ | ||
| "-3:pos=GRND", | ||
| "I", | ||
| 0.147024 | ||
| ], | ||
| [ | ||
| "2:pos=NUMR", | ||
| "O", | ||
| 0.146466 | ||
| ], | ||
| [ | ||
| "-3:shape=oo", | ||
| "O", | ||
| 0.1422 | ||
| ], | ||
| [ | ||
| "2:pos=ADVB", | ||
| "O", | ||
| 0.139637 | ||
| ], | ||
| [ | ||
| "-3:shape=DIGIT", | ||
| "O", | ||
| 0.131758 | ||
| ], | ||
| [ | ||
| "3:pos=PRTS", | ||
| "O", | ||
| 0.128962 | ||
| ], | ||
| [ | ||
| "2:pos=ADJF", | ||
| "O", | ||
| 0.128302 | ||
| ], | ||
| [ | ||
| "1:pos=ADJS", | ||
| "I", | ||
| 0.126624 | ||
| ], | ||
| [ | ||
| "2:pos=PRTS", | ||
| "O", | ||
| 0.124956 | ||
| ], | ||
| [ | ||
| "-2:pos=PRCL", | ||
| "O", | ||
| 0.124747 | ||
| ], | ||
| [ | ||
| "1:pos=PREP", | ||
| "I", | ||
| 0.123561 | ||
| ], | ||
| [ | ||
| "-2:pos=ADVB", | ||
| "O", | ||
| 0.123356 | ||
| ], | ||
| [ | ||
| "-1:pos=PRTS", | ||
| "I", | ||
| 0.122716 | ||
| ], | ||
| [ | ||
| "-2:shape=OTHER", | ||
| "O", | ||
| 0.120389 | ||
| ], | ||
| [ | ||
| "0:name=Name", | ||
| "I", | ||
| 0.119448 | ||
| ], | ||
| [ | ||
| "3:pos=INFN", | ||
| "O", | ||
| 0.118709 | ||
| ], | ||
| [ | ||
| "1:pos=NUMR", | ||
| "O", | ||
| 0.118617 | ||
| ], | ||
| [ | ||
| "2:pos=PRTF", | ||
| "I", | ||
| 0.113287 | ||
| ], | ||
| [ | ||
| "1:pos=PRTS", | ||
| "I", | ||
| 0.111509 | ||
| ], | ||
| [ | ||
| "-3:pos=PREP", | ||
| "O", | ||
| 0.104848 | ||
| ], | ||
| [ | ||
| "2:pos=NPRO", | ||
| "I", | ||
| 0.103037 | ||
| ], | ||
| [ | ||
| "-3:shape=OTHER", | ||
| "O", | ||
| 0.100032 | ||
| ], | ||
| [ | ||
| "2:pos=VERB", | ||
| "O", | ||
| 0.090745 | ||
| ], | ||
| [ | ||
| "-2:pos=GRND", | ||
| "I", | ||
| 0.090559 | ||
| ], | ||
| [ | ||
| "-3:pos=PRED", | ||
| "O", | ||
| 0.088779 | ||
| ], | ||
| [ | ||
| "-3:pos=PRTF", | ||
| "O", | ||
| 0.088737 | ||
| ], | ||
| [ | ||
| "0:pos=ADVB", | ||
| "O", | ||
| 0.086747 | ||
| ], | ||
| [ | ||
| "-3:pos=PRTS", | ||
| "O", | ||
| 0.083656 | ||
| ], | ||
| [ | ||
| "3:pos=PREP", | ||
| "O", | ||
| 0.083061 | ||
| ], | ||
| [ | ||
| "-1:pos=PRTF", | ||
| "I", | ||
| 0.082751 | ||
| ], | ||
| [ | ||
| "-1:pos=UNKNOWN", | ||
| "O", | ||
| 0.080542 | ||
| ], | ||
| [ | ||
| "-1:pos=CONJ", | ||
| "I", | ||
| 0.079216 | ||
| ], | ||
| [ | ||
| "2:pos=CONJ", | ||
| "O", | ||
| 0.079087 | ||
| ], | ||
| [ | ||
| "3:pos=NPRO", | ||
| "I", | ||
| 0.076143 | ||
| ], | ||
| [ | ||
| "-3:pos=NOUN", | ||
| "O", | ||
| 0.075921 | ||
| ], | ||
| [ | ||
| "-2:pos=ADJS", | ||
| "O", | ||
| 0.074495 | ||
| ], | ||
| [ | ||
| "-1:shape=Oo", | ||
| "I", | ||
| 0.073997 | ||
| ], | ||
| [ | ||
| "-2:pos=PRED", | ||
| "O", | ||
| 0.073023 | ||
| ], | ||
| [ | ||
| "-3:shape=Oo", | ||
| "O", | ||
| 0.072301 | ||
| ], | ||
| [ | ||
| "-2:pos=PRTS", | ||
| "O", | ||
| 0.07179 | ||
| ], | ||
| [ | ||
| "-3:pos=INTJ", | ||
| "I", | ||
| 0.07163 | ||
| ], | ||
| [ | ||
| "3:pos=ADJF", | ||
| "O", | ||
| 0.07152 | ||
| ], | ||
| [ | ||
| "3:pos=VERB", | ||
| "O", | ||
| 0.071371 | ||
| ], | ||
| [ | ||
| "-3:pos=PRCL", | ||
| "I", | ||
| 0.071075 | ||
| ], | ||
| [ | ||
| "3:pos=PRTF", | ||
| "O", | ||
| 0.069526 | ||
| ], | ||
| [ | ||
| "3:pos=ADVB", | ||
| "O", | ||
| 0.068716 | ||
| ], | ||
| [ | ||
| "3:pos=COMP", | ||
| "O", | ||
| 0.068668 | ||
| ], | ||
| [ | ||
| "1:pos=INFN", | ||
| "I", | ||
| 0.068557 | ||
| ], | ||
| [ | ||
| "-2:pos=NOUN", | ||
| "O", | ||
| 0.068547 | ||
| ], | ||
| [ | ||
| "2:pos=PREP", | ||
| "O", | ||
| 0.067738 | ||
| ], | ||
| [ | ||
| "3:shape=OO", | ||
| "O", | ||
| 0.064825 | ||
| ], | ||
| [ | ||
| "-2:pos=INTJ", | ||
| "I", | ||
| 0.06307 | ||
| ], | ||
| [ | ||
| "3:pos=NOUN", | ||
| "O", | ||
| 0.063023 | ||
| ], | ||
| [ | ||
| "3:shape=Oo", | ||
| "I", | ||
| 0.061156 | ||
| ], | ||
| [ | ||
| "-3:pos=NPRO", | ||
| "I", | ||
| 0.054906 | ||
| ], | ||
| [ | ||
| "1:shape=OTHER", | ||
| "I", | ||
| 0.05335 | ||
| ], | ||
| [ | ||
| "3:shape=OTHER", | ||
| "O", | ||
| 0.053096 | ||
| ], | ||
| [ | ||
| "-2:pos=UNKNOWN", | ||
| "O", | ||
| 0.048581 | ||
| ], | ||
| [ | ||
| "3:shape=oo", | ||
| "I", | ||
| 0.047099 | ||
| ], | ||
| [ | ||
| "2:pos=PRCL", | ||
| "O", | ||
| 0.044289 | ||
| ], | ||
| [ | ||
| "3:pos=GRND", | ||
| "O", | ||
| 0.043114 | ||
| ], | ||
| [ | ||
| "-3:pos=ADJS", | ||
| "O", | ||
| 0.042959 | ||
| ], | ||
| [ | ||
| "-1:pos=NOUN", | ||
| "O", | ||
| 0.04167 | ||
| ], | ||
| [ | ||
| "-3:pos=NUMR", | ||
| "O", | ||
| 0.041314 | ||
| ], | ||
| [ | ||
| "-3:pos=CONJ", | ||
| "O", | ||
| 0.037775 | ||
| ], | ||
| [ | ||
| "-2:pos=INFN", | ||
| "I", | ||
| 0.03374 | ||
| ], | ||
| [ | ||
| "-3:name=Name", | ||
| "O", | ||
| 0.028893 | ||
| ], | ||
| [ | ||
| "2:pos=COMP", | ||
| "O", | ||
| 0.028676 | ||
| ], | ||
| [ | ||
| "-3:pos=UNKNOWN", | ||
| "O", | ||
| 0.028462 | ||
| ], | ||
| [ | ||
| "3:pos=PRED", | ||
| "I", | ||
| 0.02632 | ||
| ], | ||
| [ | ||
| "3:pos=PRCL", | ||
| "O", | ||
| 0.025225 | ||
| ], | ||
| [ | ||
| "-3:shape=PUNCT", | ||
| "O", | ||
| 0.024241 | ||
| ], | ||
| [ | ||
| "-2:pos=PREP", | ||
| "O", | ||
| 0.024196 | ||
| ], | ||
| [ | ||
| "-2:pos=NPRO", | ||
| "O", | ||
| 0.024137 | ||
| ], | ||
| [ | ||
| "-3:pos=VERB", | ||
| "O", | ||
| 0.020688 | ||
| ], | ||
| [ | ||
| "3:pos=CONJ", | ||
| "O", | ||
| 0.018175 | ||
| ], | ||
| [ | ||
| "-3:pos=ADVB", | ||
| "O", | ||
| 0.016958 | ||
| ], | ||
| [ | ||
| "2:shape=PUNCT", | ||
| "I", | ||
| 0.016457 | ||
| ], | ||
| [ | ||
| "-1:pos=PRED", | ||
| "I", | ||
| 0.013972 | ||
| ], | ||
| [ | ||
| "-2:pos=COMP", | ||
| "I", | ||
| 0.012627 | ||
| ], | ||
| [ | ||
| "-3:pos=INFN", | ||
| "O", | ||
| 0.009667 | ||
| ], | ||
| [ | ||
| "3:pos=ADJS", | ||
| "I", | ||
| 0.009383 | ||
| ], | ||
| [ | ||
| "-1:pos=GRND", | ||
| "I", | ||
| 0.007155 | ||
| ], | ||
| [ | ||
| "-3:shape=OO", | ||
| "O", | ||
| 0.005659 | ||
| ], | ||
| [ | ||
| "2:pos=INTJ", | ||
| "I", | ||
| 0.004604 | ||
| ], | ||
| [ | ||
| "-3:pos=ADJF", | ||
| "O", | ||
| 0.002077 | ||
| ], | ||
| [ | ||
| "1:name=Name", | ||
| "O", | ||
| 0.001396 | ||
| ], | ||
| [ | ||
| "1:name=Name", | ||
| "I", | ||
| -0.001396 | ||
| ], | ||
| [ | ||
| "-3:pos=ADJF", | ||
| "I", | ||
| -0.002077 | ||
| ], | ||
| [ | ||
| "2:pos=INTJ", | ||
| "O", | ||
| -0.004604 | ||
| ], | ||
| [ | ||
| "-3:shape=OO", | ||
| "I", | ||
| -0.005659 | ||
| ], | ||
| [ | ||
| "-1:pos=GRND", | ||
| "O", | ||
| -0.007155 | ||
| ], | ||
| [ | ||
| "3:pos=ADJS", | ||
| "O", | ||
| -0.009383 | ||
| ], | ||
| [ | ||
| "-3:pos=INFN", | ||
| "I", | ||
| -0.009667 | ||
| ], | ||
| [ | ||
| "-2:pos=COMP", | ||
| "O", | ||
| -0.012627 | ||
| ], | ||
| [ | ||
| "-1:pos=PRED", | ||
| "O", | ||
| -0.013972 | ||
| ], | ||
| [ | ||
| "2:shape=PUNCT", | ||
| "O", | ||
| -0.016457 | ||
| ], | ||
| [ | ||
| "-3:pos=ADVB", | ||
| "I", | ||
| -0.016958 | ||
| ], | ||
| [ | ||
| "3:pos=CONJ", | ||
| "I", | ||
| -0.018175 | ||
| ], | ||
| [ | ||
| "-3:pos=VERB", | ||
| "I", | ||
| -0.020688 | ||
| ], | ||
| [ | ||
| "-2:pos=NPRO", | ||
| "I", | ||
| -0.024137 | ||
| ], | ||
| [ | ||
| "-2:pos=PREP", | ||
| "I", | ||
| -0.024196 | ||
| ], | ||
| [ | ||
| "-3:shape=PUNCT", | ||
| "I", | ||
| -0.024241 | ||
| ], | ||
| [ | ||
| "3:pos=PRCL", | ||
| "I", | ||
| -0.025225 | ||
| ], | ||
| [ | ||
| "3:pos=PRED", | ||
| "O", | ||
| -0.02632 | ||
| ], | ||
| [ | ||
| "-3:pos=UNKNOWN", | ||
| "I", | ||
| -0.028462 | ||
| ], | ||
| [ | ||
| "2:pos=COMP", | ||
| "I", | ||
| -0.028676 | ||
| ], | ||
| [ | ||
| "-3:name=Name", | ||
| "I", | ||
| -0.028893 | ||
| ], | ||
| [ | ||
| "-2:pos=INFN", | ||
| "O", | ||
| -0.03374 | ||
| ], | ||
| [ | ||
| "-3:pos=CONJ", | ||
| "I", | ||
| -0.037775 | ||
| ], | ||
| [ | ||
| "-3:pos=NUMR", | ||
| "I", | ||
| -0.041314 | ||
| ], | ||
| [ | ||
| "-1:pos=NOUN", | ||
| "I", | ||
| -0.04167 | ||
| ], | ||
| [ | ||
| "-3:pos=ADJS", | ||
| "I", | ||
| -0.042959 | ||
| ], | ||
| [ | ||
| "3:pos=GRND", | ||
| "I", | ||
| -0.043114 | ||
| ], | ||
| [ | ||
| "2:pos=PRCL", | ||
| "I", | ||
| -0.044289 | ||
| ], | ||
| [ | ||
| "3:shape=oo", | ||
| "O", | ||
| -0.047099 | ||
| ], | ||
| [ | ||
| "-2:pos=UNKNOWN", | ||
| "I", | ||
| -0.048581 | ||
| ], | ||
| [ | ||
| "3:shape=OTHER", | ||
| "I", | ||
| -0.053096 | ||
| ], | ||
| [ | ||
| "1:shape=OTHER", | ||
| "O", | ||
| -0.05335 | ||
| ], | ||
| [ | ||
| "-3:pos=NPRO", | ||
| "O", | ||
| -0.054906 | ||
| ], | ||
| [ | ||
| "3:shape=Oo", | ||
| "O", | ||
| -0.061156 | ||
| ], | ||
| [ | ||
| "3:pos=NOUN", | ||
| "I", | ||
| -0.063023 | ||
| ], | ||
| [ | ||
| "-2:pos=INTJ", | ||
| "O", | ||
| -0.06307 | ||
| ], | ||
| [ | ||
| "3:shape=OO", | ||
| "I", | ||
| -0.064825 | ||
| ], | ||
| [ | ||
| "2:pos=PREP", | ||
| "I", | ||
| -0.067738 | ||
| ], | ||
| [ | ||
| "-2:pos=NOUN", | ||
| "I", | ||
| -0.068547 | ||
| ], | ||
| [ | ||
| "1:pos=INFN", | ||
| "O", | ||
| -0.068557 | ||
| ], | ||
| [ | ||
| "3:pos=COMP", | ||
| "I", | ||
| -0.068668 | ||
| ], | ||
| [ | ||
| "3:pos=ADVB", | ||
| "I", | ||
| -0.068716 | ||
| ], | ||
| [ | ||
| "3:pos=PRTF", | ||
| "I", | ||
| -0.069526 | ||
| ], | ||
| [ | ||
| "-3:pos=PRCL", | ||
| "O", | ||
| -0.071075 | ||
| ], | ||
| [ | ||
| "3:pos=VERB", | ||
| "I", | ||
| -0.071371 | ||
| ], | ||
| [ | ||
| "3:pos=ADJF", | ||
| "I", | ||
| -0.07152 | ||
| ], | ||
| [ | ||
| "-3:pos=INTJ", | ||
| "O", | ||
| -0.07163 | ||
| ], | ||
| [ | ||
| "-2:pos=PRTS", | ||
| "I", | ||
| -0.07179 | ||
| ], | ||
| [ | ||
| "-3:shape=Oo", | ||
| "I", | ||
| -0.072301 | ||
| ], | ||
| [ | ||
| "-2:pos=PRED", | ||
| "I", | ||
| -0.073023 | ||
| ], | ||
| [ | ||
| "-1:shape=Oo", | ||
| "O", | ||
| -0.073997 | ||
| ], | ||
| [ | ||
| "-2:pos=ADJS", | ||
| "I", | ||
| -0.074495 | ||
| ], | ||
| [ | ||
| "-3:pos=NOUN", | ||
| "I", | ||
| -0.075921 | ||
| ], | ||
| [ | ||
| "3:pos=NPRO", | ||
| "O", | ||
| -0.076143 | ||
| ], | ||
| [ | ||
| "2:pos=CONJ", | ||
| "I", | ||
| -0.079087 | ||
| ], | ||
| [ | ||
| "-1:pos=CONJ", | ||
| "O", | ||
| -0.079216 | ||
| ], | ||
| [ | ||
| "-1:pos=UNKNOWN", | ||
| "I", | ||
| -0.080542 | ||
| ], | ||
| [ | ||
| "-1:pos=PRTF", | ||
| "O", | ||
| -0.082751 | ||
| ], | ||
| [ | ||
| "3:pos=PREP", | ||
| "I", | ||
| -0.083061 | ||
| ], | ||
| [ | ||
| "-3:pos=PRTS", | ||
| "I", | ||
| -0.083656 | ||
| ], | ||
| [ | ||
| "0:pos=ADVB", | ||
| "I", | ||
| -0.086747 | ||
| ], | ||
| [ | ||
| "-3:pos=PRTF", | ||
| "I", | ||
| -0.088737 | ||
| ], | ||
| [ | ||
| "-3:pos=PRED", | ||
| "I", | ||
| -0.088779 | ||
| ], | ||
| [ | ||
| "-2:pos=GRND", | ||
| "O", | ||
| -0.090559 | ||
| ], | ||
| [ | ||
| "2:pos=VERB", | ||
| "I", | ||
| -0.090745 | ||
| ], | ||
| [ | ||
| "-3:shape=OTHER", | ||
| "I", | ||
| -0.100032 | ||
| ], | ||
| [ | ||
| "2:pos=NPRO", | ||
| "O", | ||
| -0.103037 | ||
| ], | ||
| [ | ||
| "-3:pos=PREP", | ||
| "I", | ||
| -0.104848 | ||
| ], | ||
| [ | ||
| "1:pos=PRTS", | ||
| "O", | ||
| -0.111509 | ||
| ], | ||
| [ | ||
| "2:pos=PRTF", | ||
| "O", | ||
| -0.113287 | ||
| ], | ||
| [ | ||
| "1:pos=NUMR", | ||
| "I", | ||
| -0.118617 | ||
| ], | ||
| [ | ||
| "3:pos=INFN", | ||
| "I", | ||
| -0.118709 | ||
| ], | ||
| [ | ||
| "0:name=Name", | ||
| "O", | ||
| -0.119448 | ||
| ], | ||
| [ | ||
| "-2:shape=OTHER", | ||
| "I", | ||
| -0.120389 | ||
| ], | ||
| [ | ||
| "-1:pos=PRTS", | ||
| "O", | ||
| -0.122716 | ||
| ], | ||
| [ | ||
| "-2:pos=ADVB", | ||
| "I", | ||
| -0.123356 | ||
| ], | ||
| [ | ||
| "1:pos=PREP", | ||
| "O", | ||
| -0.123561 | ||
| ], | ||
| [ | ||
| "-2:pos=PRCL", | ||
| "I", | ||
| -0.124747 | ||
| ], | ||
| [ | ||
| "2:pos=PRTS", | ||
| "I", | ||
| -0.124956 | ||
| ], | ||
| [ | ||
| "1:pos=ADJS", | ||
| "O", | ||
| -0.126624 | ||
| ], | ||
| [ | ||
| "2:pos=ADJF", | ||
| "I", | ||
| -0.128302 | ||
| ], | ||
| [ | ||
| "3:pos=PRTS", | ||
| "I", | ||
| -0.128962 | ||
| ], | ||
| [ | ||
| "-3:shape=DIGIT", | ||
| "I", | ||
| -0.131758 | ||
| ], | ||
| [ | ||
| "2:pos=ADVB", | ||
| "I", | ||
| -0.139637 | ||
| ], | ||
| [ | ||
| "-3:shape=oo", | ||
| "I", | ||
| -0.1422 | ||
| ], | ||
| [ | ||
| "2:pos=NUMR", | ||
| "I", | ||
| -0.146466 | ||
| ], | ||
| [ | ||
| "-3:pos=GRND", | ||
| "O", | ||
| -0.147024 | ||
| ], | ||
| [ | ||
| "-2:pos=VERB", | ||
| "I", | ||
| -0.1477 | ||
| ], | ||
| [ | ||
| "2:pos=INFN", | ||
| "I", | ||
| -0.153512 | ||
| ], | ||
| [ | ||
| "-2:pos=PRTF", | ||
| "I", | ||
| -0.153652 | ||
| ], | ||
| [ | ||
| "0:shape=PUNCT", | ||
| "I", | ||
| -0.154353 | ||
| ], | ||
| [ | ||
| "1:pos=UNKNOWN", | ||
| "O", | ||
| -0.154923 | ||
| ], | ||
| [ | ||
| "0:pos=PRCL", | ||
| "O", | ||
| -0.158935 | ||
| ], | ||
| [ | ||
| "2:pos=ADJS", | ||
| "I", | ||
| -0.159444 | ||
| ], | ||
| [ | ||
| "-2:pos=ADJF", | ||
| "I", | ||
| -0.161821 | ||
| ], | ||
| [ | ||
| "0:pos=INTJ", | ||
| "O", | ||
| -0.161937 | ||
| ], | ||
| [ | ||
| "-2:shape=OO", | ||
| "I", | ||
| -0.168566 | ||
| ], | ||
| [ | ||
| "1:pos=INTJ", | ||
| "O", | ||
| -0.168811 | ||
| ], | ||
| [ | ||
| "3:pos=NUMR", | ||
| "I", | ||
| -0.172133 | ||
| ], | ||
| [ | ||
| "-2:pos=NUMR", | ||
| "I", | ||
| -0.175882 | ||
| ], | ||
| [ | ||
| "-3:pos=COMP", | ||
| "I", | ||
| -0.178985 | ||
| ], | ||
| [ | ||
| "3:name=Name", | ||
| "I", | ||
| -0.181134 | ||
| ], | ||
| [ | ||
| "-1:pos=ADJS", | ||
| "I", | ||
| -0.184221 | ||
| ], | ||
| [ | ||
| "2:shape=DIGIT", | ||
| "I", | ||
| -0.190611 | ||
| ], | ||
| [ | ||
| "-3:name=Surn", | ||
| "I", | ||
| -0.191436 | ||
| ], | ||
| [ | ||
| "3:pos=INTJ", | ||
| "I", | ||
| -0.195788 | ||
| ], | ||
| [ | ||
| "1:pos=ADJF", | ||
| "I", | ||
| -0.199187 | ||
| ], | ||
| [ | ||
| "2:pos=NOUN", | ||
| "I", | ||
| -0.20095 | ||
| ], | ||
| [ | ||
| "-1:pos=NPRO", | ||
| "O", | ||
| -0.206034 | ||
| ], | ||
| [ | ||
| "-1:pos=INTJ", | ||
| "O", | ||
| -0.209495 | ||
| ], | ||
| [ | ||
| "-2:shape=DIGIT", | ||
| "I", | ||
| -0.214741 | ||
| ], | ||
| [ | ||
| "-2:pos=CONJ", | ||
| "I", | ||
| -0.22522 | ||
| ], | ||
| [ | ||
| "-1:pos=COMP", | ||
| "O", | ||
| -0.229363 | ||
| ], | ||
| [ | ||
| "-1:shape=DIGIT", | ||
| "O", | ||
| -0.230359 | ||
| ], | ||
| [ | ||
| "2:shape=oo", | ||
| "I", | ||
| -0.23224 | ||
| ], | ||
| [ | ||
| "1:pos=NOUN", | ||
| "I", | ||
| -0.236719 | ||
| ], | ||
| [ | ||
| "-2:name=Surn", | ||
| "I", | ||
| -0.239682 | ||
| ], | ||
| [ | ||
| "-2:shape=PUNCT", | ||
| "I", | ||
| -0.242514 | ||
| ], | ||
| [ | ||
| "-2:shape=Oo", | ||
| "I", | ||
| -0.249383 | ||
| ], | ||
| [ | ||
| "-1:shape=OTHER", | ||
| "O", | ||
| -0.2505 | ||
| ], | ||
| [ | ||
| "2:pos=UNKNOWN", | ||
| "I", | ||
| -0.252993 | ||
| ], | ||
| [ | ||
| "-3:name=UNKNOWN", | ||
| "I", | ||
| -0.25586 | ||
| ], | ||
| [ | ||
| "-1:shape=PUNCT", | ||
| "I", | ||
| -0.256568 | ||
| ], | ||
| [ | ||
| "2:shape=OO", | ||
| "I", | ||
| -0.263057 | ||
| ], | ||
| [ | ||
| "3:pos=UNKNOWN", | ||
| "O", | ||
| -0.266997 | ||
| ], | ||
| [ | ||
| "3:name=Surn", | ||
| "I", | ||
| -0.272739 | ||
| ], | ||
| [ | ||
| "2:name=UNKNOWN", | ||
| "I", | ||
| -0.285496 | ||
| ], | ||
| [ | ||
| "1:pos=COMP", | ||
| "O", | ||
| -0.286817 | ||
| ], | ||
| [ | ||
| "-2:shape=oo", | ||
| "I", | ||
| -0.301559 | ||
| ], | ||
| [ | ||
| "2:name=Name", | ||
| "I", | ||
| -0.30217 | ||
| ], | ||
| [ | ||
| "1:pos=PRTF", | ||
| "I", | ||
| -0.307525 | ||
| ], | ||
| [ | ||
| "2:pos=PRED", | ||
| "I", | ||
| -0.310205 | ||
| ], | ||
| [ | ||
| "1:shape=DIGIT", | ||
| "O", | ||
| -0.311652 | ||
| ], | ||
| [ | ||
| "1:pos=PRED", | ||
| "O", | ||
| -0.316141 | ||
| ], | ||
| [ | ||
| "-1:pos=INFN", | ||
| "O", | ||
| -0.318389 | ||
| ], | ||
| [ | ||
| "2:pos=GRND", | ||
| "O", | ||
| -0.31951 | ||
| ], | ||
| [ | ||
| "-1:pos=ADVB", | ||
| "O", | ||
| -0.329949 | ||
| ], | ||
| [ | ||
| "-1:pos=ADJF", | ||
| "I", | ||
| -0.331917 | ||
| ], | ||
| [ | ||
| "1:shape=Oo", | ||
| "I", | ||
| -0.337235 | ||
| ], | ||
| [ | ||
| "1:shape=PUNCT", | ||
| "O", | ||
| -0.342614 | ||
| ], | ||
| [ | ||
| "1:pos=ADVB", | ||
| "O", | ||
| -0.344719 | ||
| ], | ||
| [ | ||
| "2:shape=Oo", | ||
| "I", | ||
| -0.349626 | ||
| ], | ||
| [ | ||
| "-1:shape=OO", | ||
| "O", | ||
| -0.349842 | ||
| ], | ||
| [ | ||
| "3:name=UNKNOWN", | ||
| "I", | ||
| -0.365273 | ||
| ], | ||
| [ | ||
| "2:shape=OTHER", | ||
| "I", | ||
| -0.367487 | ||
| ], | ||
| [ | ||
| "-1:pos=PREP", | ||
| "I", | ||
| -0.370365 | ||
| ], | ||
| [ | ||
| "1:pos=NPRO", | ||
| "I", | ||
| -0.371859 | ||
| ], | ||
| [ | ||
| "1:pos=CONJ", | ||
| "O", | ||
| -0.384873 | ||
| ], | ||
| [ | ||
| "1:shape=OO", | ||
| "O", | ||
| -0.394134 | ||
| ], | ||
| [ | ||
| "3:shape=DIGIT", | ||
| "I", | ||
| -0.396768 | ||
| ], | ||
| [ | ||
| "0:pos=NOUN", | ||
| "O", | ||
| -0.397826 | ||
| ], | ||
| [ | ||
| "0:pos=PRED", | ||
| "I", | ||
| -0.403999 | ||
| ], | ||
| [ | ||
| "3:shape=PUNCT", | ||
| "I", | ||
| -0.412711 | ||
| ], | ||
| [ | ||
| "0:pos=PREP", | ||
| "I", | ||
| -0.42425 | ||
| ], | ||
| [ | ||
| "-2:name=UNKNOWN", | ||
| "I", | ||
| -0.433369 | ||
| ], | ||
| [ | ||
| "bias", | ||
| "I", | ||
| -0.441029 | ||
| ], | ||
| [ | ||
| "0:pos=CONJ", | ||
| "I", | ||
| -0.442416 | ||
| ], | ||
| [ | ||
| "0:pos=VERB", | ||
| "O", | ||
| -0.445968 | ||
| ], | ||
| [ | ||
| "EOS", | ||
| "I", | ||
| -0.451363 | ||
| ], | ||
| [ | ||
| "-1:name=Surn", | ||
| "I", | ||
| -0.463065 | ||
| ], | ||
| [ | ||
| "1:pos=PRCL", | ||
| "O", | ||
| -0.484468 | ||
| ], | ||
| [ | ||
| "-1:shape=oo", | ||
| "O", | ||
| -0.487965 | ||
| ], | ||
| [ | ||
| "-1:pos=NUMR", | ||
| "I", | ||
| -0.489015 | ||
| ], | ||
| [ | ||
| "0:pos=ADJF", | ||
| "O", | ||
| -0.49816 | ||
| ], | ||
| [ | ||
| "-1:pos=VERB", | ||
| "O", | ||
| -0.501695 | ||
| ], | ||
| [ | ||
| "0:shape=OO", | ||
| "O", | ||
| -0.504874 | ||
| ], | ||
| [ | ||
| "-1:pos=PRCL", | ||
| "O", | ||
| -0.53309 | ||
| ], | ||
| [ | ||
| "1:pos=VERB", | ||
| "O", | ||
| -0.536149 | ||
| ], | ||
| [ | ||
| "1:pos=GRND", | ||
| "I", | ||
| -0.539257 | ||
| ], | ||
| [ | ||
| "1:name=Surn", | ||
| "O", | ||
| -0.564479 | ||
| ], | ||
| [ | ||
| "1:shape=oo", | ||
| "O", | ||
| -0.569471 | ||
| ], | ||
| [ | ||
| "0:pos=COMP", | ||
| "I", | ||
| -0.585776 | ||
| ], | ||
| [ | ||
| "-2:name=Name", | ||
| "I", | ||
| -0.624101 | ||
| ], | ||
| [ | ||
| "-1:name=UNKNOWN", | ||
| "O", | ||
| -0.624275 | ||
| ], | ||
| [ | ||
| "0:pos=INFN", | ||
| "I", | ||
| -0.668857 | ||
| ], | ||
| [ | ||
| "0:shape=OTHER", | ||
| "O", | ||
| -0.678433 | ||
| ], | ||
| [ | ||
| "BOS", | ||
| "I", | ||
| -0.685842 | ||
| ], | ||
| [ | ||
| "1:name=UNKNOWN", | ||
| "O", | ||
| -0.770902 | ||
| ], | ||
| [ | ||
| "0:pos=UNKNOWN", | ||
| "I", | ||
| -0.777475 | ||
| ], | ||
| [ | ||
| "2:name=Surn", | ||
| "I", | ||
| -0.798898 | ||
| ], | ||
| [ | ||
| "0:pos=GRND", | ||
| "O", | ||
| -0.816026 | ||
| ], | ||
| [ | ||
| "0:name=Surn", | ||
| "O", | ||
| -0.922177 | ||
| ], | ||
| [ | ||
| "0:pos=NUMR", | ||
| "O", | ||
| -0.934883 | ||
| ], | ||
| [ | ||
| "-1:name=Name", | ||
| "O", | ||
| -0.974886 | ||
| ], | ||
| [ | ||
| "0:shape=DIGIT", | ||
| "I", | ||
| -0.987927 | ||
| ], | ||
| [ | ||
| "0:pos=PRTS", | ||
| "O", | ||
| -1.018069 | ||
| ], | ||
| [ | ||
| "0:pos=ADJS", | ||
| "O", | ||
| -1.352732 | ||
| ], | ||
| [ | ||
| "0:pos=PRTF", | ||
| "I", | ||
| -1.3569 | ||
| ], | ||
| [ | ||
| "0:pos=NPRO", | ||
| "I", | ||
| -1.479144 | ||
| ], | ||
| [ | ||
| "0:name=UNKNOWN", | ||
| "I", | ||
| -1.482655 | ||
| ], | ||
| [ | ||
| "0:shape=Oo", | ||
| "O", | ||
| -2.063586 | ||
| ], | ||
| [ | ||
| "0:shape=oo", | ||
| "I", | ||
| -2.545642 | ||
| ] | ||
| ] | ||
| ] |
Sorry, the diff of this file is too big to display
| class Normalizable(object): | ||
| pass | ||
| def can_be_normalized(item): | ||
| return isinstance(item, Normalizable) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from collections import OrderedDict | ||
| from natasha.utils import Record | ||
| EURO = 'EUR' | ||
| DOLLARS = 'USD' | ||
| RUBLES = 'RUB' | ||
| DAY = 'DAY' | ||
| HOUR = 'HOUR' | ||
| SHIFT = 'SHIFT' | ||
| class Money(Record): | ||
| __attributes__ = ['amount', 'currency'] | ||
| def __init__(self, amount, currency): | ||
| self.amount = amount | ||
| self.currency = currency | ||
| @property | ||
| def as_json(self): | ||
| return OrderedDict([ | ||
| ('amount', self.amount), | ||
| ('currency', self.currency) | ||
| ]) | ||
| def __str__(self): | ||
| return '{self.amount} {self.currency}'.format( | ||
| self=self | ||
| ) | ||
| class Rate(Record): | ||
| __attributes__ = ['money', 'period'] | ||
| def __init__(self, money, period): | ||
| self.money = money | ||
| self.period = period | ||
| @property | ||
| def as_json(self): | ||
| return OrderedDict([ | ||
| ('money', self.money.as_json), | ||
| ('period', self.period) | ||
| ]) | ||
| def __str__(self): | ||
| return '{self.money}/{self.period}'.format( | ||
| self=self | ||
| ) | ||
| class Range(Record): | ||
| __attributes__ = ['min', 'max'] | ||
| def __init__(self, min, max): | ||
| self.min = min | ||
| self.max = max | ||
| @property | ||
| def as_json(self): | ||
| return OrderedDict([ | ||
| ('min', self.min.as_json), | ||
| ('max', self.max.as_json) | ||
| ]) | ||
| def __str__(self): | ||
| return '{self.min}-{self.max}'.format( | ||
| self=self | ||
| ) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from yargy import ( | ||
| rule, | ||
| or_, and_ | ||
| ) | ||
| from yargy.interpretation import fact, attribute | ||
| from yargy.predicates import ( | ||
| eq, lte, gte, gram, type, tag, | ||
| length_eq, | ||
| in_, in_caseless, dictionary, | ||
| normalized, caseless, | ||
| is_title | ||
| ) | ||
| from yargy.pipelines import morph_pipeline | ||
| from yargy.tokenizer import QUOTES | ||
| Address = fact( | ||
| 'Address', | ||
| [attribute('parts').repeatable()] | ||
| ) | ||
| Index = fact( | ||
| 'Index', | ||
| ['value'] | ||
| ) | ||
| Country = fact( | ||
| 'Country', | ||
| ['name'] | ||
| ) | ||
| Region = fact( | ||
| 'Region', | ||
| ['name', 'type'] | ||
| ) | ||
| Settlement = fact( | ||
| 'Settlement', | ||
| ['name', 'type'] | ||
| ) | ||
| Street = fact( | ||
| 'Street', | ||
| ['name', 'type'] | ||
| ) | ||
| Building = fact( | ||
| 'Building', | ||
| ['number', 'type'] | ||
| ) | ||
| Room = fact( | ||
| 'Room', | ||
| ['number', 'type'] | ||
| ) | ||
| DASH = eq('-') | ||
| DOT = eq('.') | ||
| ADJF = gram('ADJF') | ||
| NOUN = gram('NOUN') | ||
| INT = type('INT') | ||
| TITLE = is_title() | ||
| ANUM = rule( | ||
| INT, | ||
| DASH.optional(), | ||
| in_caseless({ | ||
| 'я', 'й', 'е', | ||
| 'ое', 'ая', 'ий', 'ой' | ||
| }) | ||
| ) | ||
| ######### | ||
| # | ||
| # STRANA | ||
| # | ||
| ########## | ||
| # TODO | ||
| COUNTRY_VALUE = dictionary({ | ||
| 'россия', | ||
| 'украина' | ||
| }) | ||
| ABBR_COUNTRY_VALUE = in_caseless({ | ||
| 'рф' | ||
| }) | ||
| COUNTRY = or_( | ||
| COUNTRY_VALUE, | ||
| ABBR_COUNTRY_VALUE | ||
| ).interpretation( | ||
| Country.name | ||
| ).interpretation( | ||
| Country | ||
| ) | ||
| ############# | ||
| # | ||
| # FED OKRUGA | ||
| # | ||
| ############ | ||
| FED_OKRUG_NAME = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'дальневосточный', | ||
| 'приволжский', | ||
| 'сибирский', | ||
| 'уральский', | ||
| 'центральный', | ||
| 'южный', | ||
| }) | ||
| ), | ||
| rule( | ||
| caseless('северо'), | ||
| DASH.optional(), | ||
| dictionary({ | ||
| 'западный', | ||
| 'кавказский' | ||
| }) | ||
| ) | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| FED_OKRUG_WORDS = or_( | ||
| rule( | ||
| normalized('федеральный'), | ||
| normalized('округ') | ||
| ), | ||
| rule(caseless('фо')) | ||
| ).interpretation( | ||
| Region.type.const('федеральный округ') | ||
| ) | ||
| FED_OKRUG = rule( | ||
| FED_OKRUG_WORDS, | ||
| FED_OKRUG_NAME | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ######### | ||
| # | ||
| # RESPUBLIKA | ||
| # | ||
| ############ | ||
| RESPUBLIKA_WORDS = or_( | ||
| rule(caseless('респ'), DOT.optional()), | ||
| rule(normalized('республика')) | ||
| ).interpretation( | ||
| Region.type.const('республика') | ||
| ) | ||
| RESPUBLIKA_ADJF = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'удмуртский', | ||
| 'чеченский', | ||
| 'чувашский', | ||
| }) | ||
| ), | ||
| rule( | ||
| caseless('карачаево'), | ||
| DASH.optional(), | ||
| normalized('черкесский') | ||
| ), | ||
| rule( | ||
| caseless('кабардино'), | ||
| DASH.optional(), | ||
| normalized('балкарский') | ||
| ) | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| RESPUBLIKA_NAME = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'адыгея', | ||
| 'алтай', | ||
| 'башкортостан', | ||
| 'бурятия', | ||
| 'дагестан', | ||
| 'ингушетия', | ||
| 'калмыкия', | ||
| 'карелия', | ||
| 'коми', | ||
| 'крым', | ||
| 'мордовия', | ||
| 'татарстан', | ||
| 'тыва', | ||
| 'удмуртия', | ||
| 'хакасия', | ||
| 'саха', | ||
| 'якутия', | ||
| }) | ||
| ), | ||
| rule(caseless('марий'), caseless('эл')), | ||
| rule( | ||
| normalized('северный'), normalized('осетия'), | ||
| rule('-', normalized('алания')).optional() | ||
| ) | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| RESPUBLIKA_ABBR = in_caseless({ | ||
| 'кбр', | ||
| 'кчр', | ||
| 'рт', # Татарстан | ||
| }).interpretation( | ||
| Region.name # TODO type | ||
| ) | ||
| RESPUBLIKA = or_( | ||
| rule(RESPUBLIKA_ADJF, RESPUBLIKA_WORDS), | ||
| rule(RESPUBLIKA_WORDS, RESPUBLIKA_NAME), | ||
| rule(RESPUBLIKA_ABBR) | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########## | ||
| # | ||
| # KRAI | ||
| # | ||
| ######## | ||
| KRAI_WORDS = normalized('край').interpretation( | ||
| Region.type.const('край') | ||
| ) | ||
| KRAI_NAME = dictionary({ | ||
| 'алтайский', | ||
| 'забайкальский', | ||
| 'камчатский', | ||
| 'краснодарский', | ||
| 'красноярский', | ||
| 'пермский', | ||
| 'приморский', | ||
| 'ставропольский', | ||
| 'хабаровский', | ||
| }).interpretation( | ||
| Region.name | ||
| ) | ||
| KRAI = rule( | ||
| KRAI_NAME, KRAI_WORDS | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ############ | ||
| # | ||
| # OBLAST | ||
| # | ||
| ############ | ||
| OBLAST_WORDS = or_( | ||
| rule(normalized('область')), | ||
| rule( | ||
| caseless('обл'), | ||
| DOT.optional() | ||
| ) | ||
| ).interpretation( | ||
| Region.type.const('область') | ||
| ) | ||
| OBLAST_NAME = dictionary({ | ||
| 'амурский', | ||
| 'архангельский', | ||
| 'астраханский', | ||
| 'белгородский', | ||
| 'брянский', | ||
| 'владимирский', | ||
| 'волгоградский', | ||
| 'вологодский', | ||
| 'воронежский', | ||
| 'горьковский', | ||
| 'ивановский', | ||
| 'ивановский', | ||
| 'иркутский', | ||
| 'калининградский', | ||
| 'калужский', | ||
| 'камчатский', | ||
| 'кемеровский', | ||
| 'кировский', | ||
| 'костромской', | ||
| 'курганский', | ||
| 'курский', | ||
| 'ленинградский', | ||
| 'липецкий', | ||
| 'магаданский', | ||
| 'московский', | ||
| 'мурманский', | ||
| 'нижегородский', | ||
| 'новгородский', | ||
| 'новосибирский', | ||
| 'омский', | ||
| 'оренбургский', | ||
| 'орловский', | ||
| 'пензенский', | ||
| 'пермский', | ||
| 'псковский', | ||
| 'ростовский', | ||
| 'рязанский', | ||
| 'самарский', | ||
| 'саратовский', | ||
| 'сахалинский', | ||
| 'свердловский', | ||
| 'смоленский', | ||
| 'тамбовский', | ||
| 'тверской', | ||
| 'томский', | ||
| 'тульский', | ||
| 'тюменский', | ||
| 'ульяновский', | ||
| 'челябинский', | ||
| 'читинский', | ||
| 'ярославский', | ||
| }).interpretation( | ||
| Region.name | ||
| ) | ||
| OBLAST = rule( | ||
| OBLAST_NAME, | ||
| OBLAST_WORDS | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########## | ||
| # | ||
| # AUTO OKRUG | ||
| # | ||
| ############# | ||
| AUTO_OKRUG_NAME = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'чукотский', | ||
| 'эвенкийский', | ||
| 'корякский', | ||
| 'ненецкий', | ||
| 'таймырский', | ||
| 'агинский', | ||
| 'бурятский', | ||
| }) | ||
| ), | ||
| rule(caseless('коми'), '-', normalized('пермяцкий')), | ||
| rule(caseless('долгано'), '-', normalized('ненецкий')), | ||
| rule(caseless('ямало'), '-', normalized('ненецкий')), | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| AUTO_OKRUG_WORDS = or_( | ||
| rule( | ||
| normalized('автономный'), | ||
| normalized('округ') | ||
| ), | ||
| rule(caseless('ао')) | ||
| ).interpretation( | ||
| Region.type.const('автономный округ') | ||
| ) | ||
| HANTI = rule( | ||
| caseless('ханты'), '-', normalized('мансийский') | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| BURAT = rule( | ||
| caseless('усть'), '-', normalized('ордынский'), | ||
| normalized('бурятский') | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| AUTO_OKRUG = or_( | ||
| rule(AUTO_OKRUG_NAME, AUTO_OKRUG_WORDS), | ||
| or_( | ||
| rule( | ||
| HANTI, | ||
| AUTO_OKRUG_WORDS, | ||
| '-', normalized('югра') | ||
| ), | ||
| rule( | ||
| caseless('хмао'), | ||
| ).interpretation(Region.name), | ||
| rule( | ||
| caseless('хмао'), | ||
| '-', caseless('югра') | ||
| ).interpretation(Region.name), | ||
| ), | ||
| rule( | ||
| BURAT, | ||
| AUTO_OKRUG_WORDS | ||
| ) | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########## | ||
| # | ||
| # RAION | ||
| # | ||
| ########### | ||
| RAION_WORDS = or_( | ||
| rule(caseless('р'), '-', in_caseless({'он', 'н'})), | ||
| rule(normalized('район')) | ||
| ).interpretation( | ||
| Region.type.const('район') | ||
| ) | ||
| RAION_SIMPLE_NAME = and_( | ||
| ADJF, | ||
| TITLE | ||
| ) | ||
| RAION_MODIFIERS = rule( | ||
| in_caseless({ | ||
| 'усть', | ||
| 'северо', | ||
| 'александрово', | ||
| 'гаврилово', | ||
| }), | ||
| DASH.optional(), | ||
| TITLE | ||
| ) | ||
| RAION_COMPLEX_NAME = rule( | ||
| RAION_MODIFIERS, | ||
| RAION_SIMPLE_NAME | ||
| ) | ||
| RAION_NAME = or_( | ||
| rule(RAION_SIMPLE_NAME), | ||
| RAION_COMPLEX_NAME | ||
| ).interpretation( | ||
| Region.name | ||
| ) | ||
| RAION = rule( | ||
| RAION_NAME, | ||
| RAION_WORDS | ||
| ).interpretation( | ||
| Region | ||
| ) | ||
| ########### | ||
| # | ||
| # GOROD | ||
| # | ||
| ########### | ||
| # Top 200 Russia cities, cover 75% of population | ||
| COMPLEX = morph_pipeline([ | ||
| 'санкт-петербург', | ||
| 'нижний новгород', | ||
| 'н.новгород', | ||
| 'ростов-на-дону', | ||
| 'набережные челны', | ||
| 'улан-удэ', | ||
| 'нижний тагил', | ||
| 'комсомольск-на-амуре', | ||
| 'йошкар-ола', | ||
| 'старый оскол', | ||
| 'великий новгород', | ||
| 'южно-сахалинск', | ||
| 'петропавловск-камчатский', | ||
| 'каменск-уральский', | ||
| 'орехово-зуево', | ||
| 'сергиев посад', | ||
| 'новый уренгой', | ||
| 'ленинск-кузнецкий', | ||
| 'великие луки', | ||
| 'каменск-шахтинский', | ||
| 'усть-илимск', | ||
| 'усолье-сибирский', | ||
| 'кирово-чепецк', | ||
| ]) | ||
| SIMPLE = dictionary({ | ||
| 'москва', | ||
| 'новосибирск', | ||
| 'екатеринбург', | ||
| 'казань', | ||
| 'самар', | ||
| 'омск', | ||
| 'челябинск', | ||
| 'уфа', | ||
| 'волгоград', | ||
| 'пермь', | ||
| 'красноярск', | ||
| 'воронеж', | ||
| 'саратов', | ||
| 'краснодар', | ||
| 'тольятти', | ||
| 'барнаул', | ||
| 'ижевск', | ||
| 'ульяновск', | ||
| 'владивосток', | ||
| 'ярославль', | ||
| 'иркутск', | ||
| 'тюмень', | ||
| 'махачкала', | ||
| 'хабаровск', | ||
| 'оренбург', | ||
| 'новокузнецк', | ||
| 'кемерово', | ||
| 'рязань', | ||
| 'томск', | ||
| 'астрахань', | ||
| 'пенза', | ||
| 'липецк', | ||
| 'тула', | ||
| 'киров', | ||
| 'чебоксары', | ||
| 'калининград', | ||
| 'брянск', | ||
| 'курск', | ||
| 'иваново', | ||
| 'магнитогорск', | ||
| 'тверь', | ||
| 'ставрополь', | ||
| 'симферополь', | ||
| 'белгород', | ||
| 'архангельск', | ||
| # 'владимир', | ||
| 'севастополь', | ||
| 'сочи', | ||
| 'курган', | ||
| 'смоленск', | ||
| 'калуга', | ||
| 'чита', | ||
| 'орёл', | ||
| # 'волжский', | ||
| 'череповец', | ||
| 'владикавказ', | ||
| 'мурманск', | ||
| 'сургут', | ||
| 'вологда', | ||
| 'саранск', | ||
| 'тамбов', | ||
| 'стерлитамак', | ||
| 'грозный', | ||
| 'якутск', | ||
| 'кострома', | ||
| 'петрозаводск', | ||
| 'таганрог', | ||
| 'нижневартовск', | ||
| 'братск', | ||
| 'новороссийск', | ||
| 'дзержинск', | ||
| 'шахта', | ||
| 'нальчик', | ||
| 'орск', | ||
| 'сыктывкар', | ||
| 'нижнекамск', | ||
| 'ангарск', | ||
| 'балашиха', | ||
| 'благовещенск', | ||
| 'прокопьевск', | ||
| 'химки', | ||
| 'псков', | ||
| 'бийск', | ||
| 'энгельс', | ||
| 'рыбинск', | ||
| 'балаково', | ||
| 'северодвинск', | ||
| 'армавир', | ||
| 'подольск', | ||
| # 'королёв', | ||
| 'сызрань', | ||
| 'норильск', | ||
| 'златоуст', | ||
| 'мытищи', | ||
| 'люберцы', | ||
| 'волгодонск', | ||
| 'новочеркасск', | ||
| 'абакан', | ||
| 'находка', | ||
| 'уссурийск', | ||
| 'березники', | ||
| 'салават', | ||
| 'электросталь', | ||
| 'миасс', | ||
| 'первоуральск', | ||
| 'рубцовск', | ||
| 'альметьевск', | ||
| 'ковровый', | ||
| 'коломна', | ||
| 'керчь', | ||
| 'майкоп', | ||
| 'пятигорск', | ||
| 'одинцово', | ||
| 'копейск', | ||
| 'хасавюрт', | ||
| 'новомосковск', | ||
| 'кисловодск', | ||
| 'серпухов', | ||
| 'новочебоксарск', | ||
| 'нефтеюганск', | ||
| 'димитровград', | ||
| 'нефтекамск', | ||
| 'черкесск', | ||
| 'дербент', | ||
| 'камышин', | ||
| 'невинномысск', | ||
| 'красногорск', | ||
| 'мур', | ||
| 'батайск', | ||
| 'новошахтинск', | ||
| 'ноябрьск', | ||
| 'кызыл', | ||
| # 'октябрьский', | ||
| 'ачинск', | ||
| 'северск', | ||
| 'новокуйбышевск', | ||
| 'елец', | ||
| 'евпатория', | ||
| 'арзамас', | ||
| 'обнинск', | ||
| 'каспийск', | ||
| 'элиста', | ||
| 'пушкино', | ||
| # 'жуковский', | ||
| 'междуреченск', | ||
| 'сарапул', | ||
| 'ессентуки', | ||
| 'воткинск', | ||
| 'ногинск', | ||
| 'тобольск', | ||
| 'ухта', | ||
| 'серов', | ||
| 'бердск', | ||
| 'мичуринск', | ||
| 'киселёвск', | ||
| 'новотроицк', | ||
| 'зеленодольск', | ||
| 'соликамск', | ||
| 'раменский', | ||
| 'домодедово', | ||
| 'магадан', | ||
| 'глазов', | ||
| 'железногорск', | ||
| 'канск', | ||
| 'назрань', | ||
| 'гатчина', | ||
| 'саров', | ||
| 'новоуральск', | ||
| 'воскресенск', | ||
| 'долгопрудный', | ||
| 'бугульма', | ||
| 'кузнецк', | ||
| 'губкин', | ||
| 'кинешма', | ||
| 'ейск', | ||
| 'реутов', | ||
| 'железногорск', | ||
| 'чайковский', | ||
| 'азов', | ||
| 'бузулук', | ||
| 'озёрск', | ||
| 'балашов', | ||
| 'юрга', | ||
| 'кропоткин', | ||
| 'клин' | ||
| }) | ||
| GOROD_ABBR = in_caseless({ | ||
| 'спб', | ||
| 'мск', | ||
| 'нск' # Новосибирск | ||
| }) | ||
| GOROD_NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX, | ||
| rule(GOROD_ABBR) | ||
| ).interpretation( | ||
| Settlement.name | ||
| ) | ||
| SIMPLE = and_( | ||
| TITLE, | ||
| or_( | ||
| NOUN, | ||
| ADJF # Железнодорожный, Юбилейный | ||
| ) | ||
| ) | ||
| COMPLEX = or_( | ||
| rule( | ||
| SIMPLE, | ||
| DASH.optional(), | ||
| SIMPLE | ||
| ), | ||
| rule( | ||
| TITLE, | ||
| DASH.optional(), | ||
| caseless('на'), | ||
| DASH.optional(), | ||
| TITLE | ||
| ) | ||
| ) | ||
| NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX | ||
| ) | ||
| MAYBE_GOROD_NAME = or_( | ||
| NAME, | ||
| rule(NAME, '-', INT) | ||
| ).interpretation( | ||
| Settlement.name | ||
| ) | ||
| GOROD_WORDS = or_( | ||
| rule(normalized('город')), | ||
| rule( | ||
| caseless('г'), | ||
| DOT.optional() | ||
| ) | ||
| ).interpretation( | ||
| Settlement.type.const('город') | ||
| ) | ||
| GOROD = or_( | ||
| rule(GOROD_WORDS, MAYBE_GOROD_NAME), | ||
| rule( | ||
| GOROD_WORDS.optional(), | ||
| GOROD_NAME | ||
| ) | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ########## | ||
| # | ||
| # SETTLEMENT NAME | ||
| # | ||
| ########## | ||
| ADJS = gram('ADJS') | ||
| SIMPLE = and_( | ||
| or_( | ||
| NOUN, # Александровка, Заречье, Горки | ||
| ADJS, # Кузнецово | ||
| ADJF, # Никольское, Новая, Марьино | ||
| ), | ||
| TITLE | ||
| ) | ||
| COMPLEX = rule( | ||
| SIMPLE, | ||
| DASH.optional(), | ||
| SIMPLE | ||
| ) | ||
| NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX | ||
| ) | ||
| SETTLEMENT_NAME = or_( | ||
| NAME, | ||
| rule(NAME, '-', INT), | ||
| rule(NAME, ANUM) | ||
| ) | ||
| ########### | ||
| # | ||
| # SELO | ||
| # | ||
| ############# | ||
| SELO_WORDS = or_( | ||
| rule( | ||
| caseless('с'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('село')) | ||
| ).interpretation( | ||
| Settlement.type.const('село') | ||
| ) | ||
| SELO_NAME = SETTLEMENT_NAME.interpretation( | ||
| Settlement.name | ||
| ) | ||
| SELO = rule( | ||
| SELO_WORDS, | ||
| SELO_NAME | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ########### | ||
| # | ||
| # DEREVNYA | ||
| # | ||
| ############# | ||
| DEREVNYA_WORDS = or_( | ||
| rule( | ||
| caseless('д'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('деревня')) | ||
| ).interpretation( | ||
| Settlement.type.const('деревня') | ||
| ) | ||
| DEREVNYA_NAME = SETTLEMENT_NAME.interpretation( | ||
| Settlement.name | ||
| ) | ||
| DEREVNYA = rule( | ||
| DEREVNYA_WORDS, | ||
| DEREVNYA_NAME | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ########### | ||
| # | ||
| # POSELOK | ||
| # | ||
| ############# | ||
| POSELOK_WORDS = or_( | ||
| rule( | ||
| in_caseless({'п', 'пос'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('посёлок')), | ||
| rule( | ||
| caseless('р'), | ||
| DOT.optional(), | ||
| caseless('п'), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| normalized('рабочий'), | ||
| normalized('посёлок') | ||
| ), | ||
| rule( | ||
| caseless('пгт'), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| caseless('п'), DOT, caseless('г'), DOT, caseless('т'), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| normalized('посёлок'), | ||
| normalized('городского'), | ||
| normalized('типа'), | ||
| ), | ||
| ).interpretation( | ||
| Settlement.type.const('посёлок') | ||
| ) | ||
| POSELOK_NAME = SETTLEMENT_NAME.interpretation( | ||
| Settlement.name | ||
| ) | ||
| POSELOK = rule( | ||
| POSELOK_WORDS, | ||
| POSELOK_NAME | ||
| ).interpretation( | ||
| Settlement | ||
| ) | ||
| ############## | ||
| # | ||
| # ADDRESS PERSON | ||
| # | ||
| ############ | ||
| ABBR = and_( | ||
| length_eq(1), | ||
| is_title() | ||
| ) | ||
| PART = and_( | ||
| TITLE, | ||
| or_( | ||
| gram('Name'), | ||
| gram('Surn') | ||
| ) | ||
| ) | ||
| MAYBE_FIO = or_( | ||
| rule(TITLE, PART), | ||
| rule(PART, TITLE), | ||
| rule(ABBR, '.', TITLE), | ||
| rule(ABBR, '.', ABBR, '.', TITLE), | ||
| rule(TITLE, ABBR, '.', ABBR, '.') | ||
| ) | ||
| POSITION_WORDS_ = or_( | ||
| rule( | ||
| dictionary({ | ||
| 'мичман', | ||
| 'геолог', | ||
| 'подводник', | ||
| 'краевед', | ||
| 'снайпер', | ||
| 'штурман', | ||
| 'бригадир', | ||
| 'учитель', | ||
| 'политрук', | ||
| 'военком', | ||
| 'ветеран', | ||
| 'историк', | ||
| 'пулемётчик', | ||
| 'авиаконструктор', | ||
| 'адмирал', | ||
| 'академик', | ||
| 'актер', | ||
| 'актриса', | ||
| 'архитектор', | ||
| 'атаман', | ||
| 'врач', | ||
| 'воевода', | ||
| 'генерал', | ||
| 'губернатор', | ||
| 'хирург', | ||
| 'декабрист', | ||
| 'разведчик', | ||
| 'граф', | ||
| 'десантник', | ||
| 'конструктор', | ||
| 'скульптор', | ||
| 'писатель', | ||
| 'поэт', | ||
| 'капитан', | ||
| 'князь', | ||
| 'комиссар', | ||
| 'композитор', | ||
| 'космонавт', | ||
| 'купец', | ||
| 'лейтенант', | ||
| 'лётчик', | ||
| 'майор', | ||
| 'маршал', | ||
| 'матрос', | ||
| 'подполковник', | ||
| 'полковник', | ||
| 'профессор', | ||
| 'сержант', | ||
| 'старшина', | ||
| 'танкист', | ||
| 'художник', | ||
| 'герой', | ||
| 'княгиня', | ||
| 'строитель', | ||
| 'дружинник', | ||
| 'диктор', | ||
| 'прапорщик', | ||
| 'артиллерист', | ||
| 'графиня', | ||
| 'большевик', | ||
| 'патриарх', | ||
| 'сварщик', | ||
| 'офицер', | ||
| 'рыбак', | ||
| 'брат', | ||
| }) | ||
| ), | ||
| rule(normalized('генерал'), normalized('армия')), | ||
| rule(normalized('герой'), normalized('россия')), | ||
| rule( | ||
| normalized('герой'), | ||
| normalized('российский'), normalized('федерация')), | ||
| rule( | ||
| normalized('герой'), | ||
| normalized('советский'), normalized('союз') | ||
| ), | ||
| ) | ||
| ABBR_POSITION_WORDS = rule( | ||
| in_caseless({ | ||
| 'адм', | ||
| 'ак', | ||
| 'акад', | ||
| }), | ||
| DOT.optional() | ||
| ) | ||
| POSITION_WORDS = or_( | ||
| POSITION_WORDS_, | ||
| ABBR_POSITION_WORDS | ||
| ) | ||
| MAYBE_PERSON = or_( | ||
| MAYBE_FIO, | ||
| rule(POSITION_WORDS, MAYBE_FIO), | ||
| rule(POSITION_WORDS, TITLE) | ||
| ) | ||
| ########### | ||
| # | ||
| # IMENI | ||
| # | ||
| ########## | ||
| IMENI_WORDS = or_( | ||
| rule( | ||
| caseless('им'), | ||
| DOT.optional() | ||
| ), | ||
| rule(caseless('имени')) | ||
| ) | ||
| IMENI = or_( | ||
| rule( | ||
| IMENI_WORDS.optional(), | ||
| MAYBE_PERSON | ||
| ), | ||
| rule( | ||
| IMENI_WORDS, | ||
| TITLE | ||
| ) | ||
| ) | ||
| ########## | ||
| # | ||
| # LET | ||
| # | ||
| ########## | ||
| LET_WORDS = or_( | ||
| rule(caseless('лет')), | ||
| rule( | ||
| DASH.optional(), | ||
| caseless('летия') | ||
| ) | ||
| ) | ||
| LET_NAME = in_caseless({ | ||
| 'влксм', | ||
| 'ссср', | ||
| 'алтая', | ||
| 'башкирии', | ||
| 'бурятии', | ||
| 'дагестана', | ||
| 'калмыкии', | ||
| 'колхоза', | ||
| 'комсомола', | ||
| 'космонавтики', | ||
| 'москвы', | ||
| 'октября', | ||
| 'пионерии', | ||
| 'победы', | ||
| 'приморья', | ||
| 'района', | ||
| 'совхоза', | ||
| 'совхозу', | ||
| 'татарстана', | ||
| 'тувы', | ||
| 'удмуртии', | ||
| 'улуса', | ||
| 'хакасии', | ||
| 'целины', | ||
| 'чувашии', | ||
| 'якутии', | ||
| }) | ||
| LET = rule( | ||
| INT, | ||
| LET_WORDS, | ||
| LET_NAME | ||
| ) | ||
| ########## | ||
| # | ||
| # ADDRESS DATE | ||
| # | ||
| ############# | ||
| MONTH_WORDS = dictionary({ | ||
| 'январь', | ||
| 'февраль', | ||
| 'март', | ||
| 'апрель', | ||
| 'май', | ||
| 'июнь', | ||
| 'июль', | ||
| 'август', | ||
| 'сентябрь', | ||
| 'октябрь', | ||
| 'ноябрь', | ||
| 'декабрь', | ||
| }) | ||
| DAY = and_( | ||
| INT, | ||
| gte(1), | ||
| lte(31) | ||
| ) | ||
| YEAR = and_( | ||
| INT, | ||
| gte(1), | ||
| lte(2100) | ||
| ) | ||
| YEAR_WORDS = normalized('год') | ||
| DATE = or_( | ||
| rule(DAY, MONTH_WORDS), | ||
| rule(YEAR, YEAR_WORDS) | ||
| ) | ||
| ######### | ||
| # | ||
| # MODIFIER | ||
| # | ||
| ############ | ||
| MODIFIER_WORDS_ = rule( | ||
| dictionary({ | ||
| 'большой', | ||
| 'малый', | ||
| 'средний', | ||
| 'верхний', | ||
| 'центральный', | ||
| 'нижний', | ||
| 'северный', | ||
| 'дальний', | ||
| 'первый', | ||
| 'второй', | ||
| 'старый', | ||
| 'новый', | ||
| 'красный', | ||
| 'лесной', | ||
| 'тихий', | ||
| }), | ||
| DASH.optional() | ||
| ) | ||
| ABBR_MODIFIER_WORDS = rule( | ||
| in_caseless({ | ||
| 'б', 'м', 'н' | ||
| }), | ||
| DOT.optional() | ||
| ) | ||
| SHORT_MODIFIER_WORDS = rule( | ||
| in_caseless({ | ||
| 'больше', | ||
| 'мало', | ||
| 'средне', | ||
| 'верх', | ||
| 'верхне', | ||
| 'центрально', | ||
| 'нижне', | ||
| 'северо', | ||
| 'дальне', | ||
| 'восточно', | ||
| 'западно', | ||
| 'перво', | ||
| 'второ', | ||
| 'старо', | ||
| 'ново', | ||
| 'красно', | ||
| 'тихо', | ||
| 'горно', | ||
| }), | ||
| DASH.optional() | ||
| ) | ||
| MODIFIER_WORDS = or_( | ||
| MODIFIER_WORDS_, | ||
| ABBR_MODIFIER_WORDS, | ||
| SHORT_MODIFIER_WORDS, | ||
| ) | ||
| ########## | ||
| # | ||
| # ADDRESS NAME | ||
| # | ||
| ########## | ||
| ROD = gram('gent') | ||
| SIMPLE = and_( | ||
| or_( | ||
| ADJF, # Школьная | ||
| and_(NOUN, ROD), # Ленина, Победы | ||
| ), | ||
| TITLE | ||
| ) | ||
| COMPLEX = or_( | ||
| rule( | ||
| and_(ADJF, TITLE), | ||
| NOUN | ||
| ), | ||
| rule( | ||
| TITLE, | ||
| DASH.optional(), | ||
| TITLE | ||
| ), | ||
| ) | ||
| # TODO | ||
| EXCEPTION = dictionary({ | ||
| 'арбат', | ||
| 'варварка' | ||
| }) | ||
| MAYBE_NAME = or_( | ||
| rule(SIMPLE), | ||
| COMPLEX, | ||
| rule(EXCEPTION) | ||
| ) | ||
| NAME = or_( | ||
| MAYBE_NAME, | ||
| LET, | ||
| DATE, | ||
| IMENI | ||
| ) | ||
| NAME = rule( | ||
| MODIFIER_WORDS.optional(), | ||
| NAME | ||
| ) | ||
| ADDRESS_CRF = tag('I').repeatable() | ||
| NAME = or_( | ||
| NAME, | ||
| ANUM, | ||
| rule(NAME, ANUM), | ||
| rule(ANUM, NAME), | ||
| rule(INT, DASH.optional(), NAME), | ||
| rule(NAME, DASH, INT), | ||
| ADDRESS_CRF | ||
| ) | ||
| ADDRESS_NAME = NAME | ||
| ######## | ||
| # | ||
| # STREET | ||
| # | ||
| ######### | ||
| STREET_WORDS = or_( | ||
| rule(normalized('улица')), | ||
| rule( | ||
| caseless('ул'), | ||
| DOT.optional() | ||
| ) | ||
| ).interpretation( | ||
| Street.type.const('улица') | ||
| ) | ||
| STREET_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| STREET = or_( | ||
| rule(STREET_WORDS, STREET_NAME), | ||
| rule(STREET_NAME, STREET_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ########## | ||
| # | ||
| # PROSPEKT | ||
| # | ||
| ########## | ||
| PROSPEKT_WORDS = or_( | ||
| rule( | ||
| in_caseless({'пр', 'просп'}), | ||
| DOT.optional() | ||
| ), | ||
| rule( | ||
| caseless('пр'), | ||
| '-', | ||
| in_caseless({'кт', 'т'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('проспект')) | ||
| ).interpretation( | ||
| Street.type.const('проспект') | ||
| ) | ||
| PROSPEKT_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PROSPEKT = or_( | ||
| rule(PROSPEKT_WORDS, PROSPEKT_NAME), | ||
| rule(PROSPEKT_NAME, PROSPEKT_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ############ | ||
| # | ||
| # PROEZD | ||
| # | ||
| ############# | ||
| PROEZD_WORDS = or_( | ||
| rule(caseless('пр'), DOT.optional()), | ||
| rule( | ||
| caseless('пр'), | ||
| '-', | ||
| in_caseless({'зд', 'д'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('проезд')) | ||
| ).interpretation( | ||
| Street.type.const('проезд') | ||
| ) | ||
| PROEZD_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PROEZD = or_( | ||
| rule(PROEZD_WORDS, PROEZD_NAME), | ||
| rule(PROEZD_NAME, PROEZD_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ########### | ||
| # | ||
| # PEREULOK | ||
| # | ||
| ############## | ||
| PEREULOK_WORDS = or_( | ||
| rule( | ||
| caseless('п'), | ||
| DOT | ||
| ), | ||
| rule( | ||
| caseless('пер'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('переулок')) | ||
| ).interpretation( | ||
| Street.type.const('переулок') | ||
| ) | ||
| PEREULOK_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PEREULOK = or_( | ||
| rule(PEREULOK_WORDS, PEREULOK_NAME), | ||
| rule(PEREULOK_NAME, PEREULOK_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ######## | ||
| # | ||
| # PLOSHAD | ||
| # | ||
| ########## | ||
| PLOSHAD_WORDS = or_( | ||
| rule( | ||
| caseless('пл'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('площадь')) | ||
| ).interpretation( | ||
| Street.type.const('площадь') | ||
| ) | ||
| PLOSHAD_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| PLOSHAD = or_( | ||
| rule(PLOSHAD_WORDS, PLOSHAD_NAME), | ||
| rule(PLOSHAD_NAME, PLOSHAD_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ############ | ||
| # | ||
| # SHOSSE | ||
| # | ||
| ########### | ||
| # TODO | ||
| # Покровское 17 км. | ||
| # Сергеляхское 13 км | ||
| # Сергеляхское 14 км. | ||
| SHOSSE_WORDS = or_( | ||
| rule( | ||
| caseless('ш'), | ||
| DOT | ||
| ), | ||
| rule(normalized('шоссе')) | ||
| ).interpretation( | ||
| Street.type.const('шоссе') | ||
| ) | ||
| SHOSSE_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| SHOSSE = or_( | ||
| rule(SHOSSE_WORDS, SHOSSE_NAME), | ||
| rule(SHOSSE_NAME, SHOSSE_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ######## | ||
| # | ||
| # NABEREG | ||
| # | ||
| ########## | ||
| NABEREG_WORDS = or_( | ||
| rule( | ||
| caseless('наб'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('набережная')) | ||
| ).interpretation( | ||
| Street.type.const('набережная') | ||
| ) | ||
| NABEREG_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| NABEREG = or_( | ||
| rule(NABEREG_WORDS, NABEREG_NAME), | ||
| rule(NABEREG_NAME, NABEREG_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ######## | ||
| # | ||
| # BULVAR | ||
| # | ||
| ########## | ||
| BULVAR_WORDS = or_( | ||
| rule( | ||
| caseless('б'), | ||
| '-', | ||
| caseless('р') | ||
| ), | ||
| rule( | ||
| caseless('б'), | ||
| DOT | ||
| ), | ||
| rule( | ||
| caseless('бул'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('бульвар')) | ||
| ).interpretation( | ||
| Street.type.const('бульвар') | ||
| ) | ||
| BULVAR_NAME = ADDRESS_NAME.interpretation( | ||
| Street.name | ||
| ) | ||
| BULVAR = or_( | ||
| rule(BULVAR_WORDS, BULVAR_NAME), | ||
| rule(BULVAR_NAME, BULVAR_WORDS) | ||
| ).interpretation( | ||
| Street | ||
| ) | ||
| ############## | ||
| # | ||
| # ADDRESS VALUE | ||
| # | ||
| ############# | ||
| LETTER = in_caseless(set('абвгдежзиклмнопрстуфхшщэюя')) | ||
| QUOTE = in_(QUOTES) | ||
| LETTER = or_( | ||
| rule(LETTER), | ||
| rule(QUOTE, LETTER, QUOTE) | ||
| ) | ||
| VALUE = rule( | ||
| INT, | ||
| LETTER.optional() | ||
| ) | ||
| SEP = in_(r'/\-') | ||
| VALUE = or_( | ||
| rule(VALUE), | ||
| rule(VALUE, SEP, VALUE), | ||
| rule(VALUE, SEP, LETTER) | ||
| ) | ||
| ADDRESS_VALUE = rule( | ||
| eq('№').optional(), | ||
| VALUE | ||
| ) | ||
| ############ | ||
| # | ||
| # DOM | ||
| # | ||
| ############# | ||
| DOM_WORDS = or_( | ||
| rule(normalized('дом')), | ||
| rule( | ||
| caseless('д'), | ||
| DOT | ||
| ) | ||
| ).interpretation( | ||
| Building.type.const('дом') | ||
| ) | ||
| DOM_VALUE = ADDRESS_VALUE.interpretation( | ||
| Building.number | ||
| ) | ||
| DOM = rule( | ||
| DOM_WORDS.optional(), | ||
| DOM_VALUE | ||
| ).interpretation( | ||
| Building | ||
| ) | ||
| ########### | ||
| # | ||
| # KORPUS | ||
| # | ||
| ########## | ||
| KORPUS_WORDS = or_( | ||
| rule( | ||
| in_caseless({'корп', 'кор'}), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('корпус')) | ||
| ).interpretation( | ||
| Building.type.const('корпус') | ||
| ) | ||
| KORPUS_VALUE = ADDRESS_VALUE.interpretation( | ||
| Building.number | ||
| ) | ||
| KORPUS = or_( | ||
| rule( | ||
| KORPUS_WORDS, | ||
| KORPUS_VALUE | ||
| ), | ||
| rule( | ||
| KORPUS_VALUE, | ||
| KORPUS_WORDS | ||
| ) | ||
| ).interpretation( | ||
| Building | ||
| ) | ||
| ########### | ||
| # | ||
| # STROENIE | ||
| # | ||
| ########## | ||
| STROENIE_WORDS = or_( | ||
| rule( | ||
| caseless('стр'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('строение')) | ||
| ).interpretation( | ||
| Building.type.const('строение') | ||
| ) | ||
| STROENIE_VALUE = ADDRESS_VALUE.interpretation( | ||
| Building.number | ||
| ) | ||
| STROENIE = rule( | ||
| STROENIE_WORDS, | ||
| ADDRESS_VALUE | ||
| ).interpretation( | ||
| Building | ||
| ) | ||
| ########### | ||
| # | ||
| # OFIS | ||
| # | ||
| ############# | ||
| OFIS_WORDS = or_( | ||
| rule( | ||
| caseless('оф'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('офис')) | ||
| ).interpretation( | ||
| Room.type.const('офис') | ||
| ) | ||
| OFIS_VALUE = ADDRESS_VALUE.interpretation( | ||
| Room.number | ||
| ) | ||
| OFIS = rule( | ||
| OFIS_WORDS, | ||
| OFIS_VALUE | ||
| ).interpretation( | ||
| Room | ||
| ) | ||
| ########### | ||
| # | ||
| # KVARTIRA | ||
| # | ||
| ############# | ||
| KVARTIRA_WORDS = or_( | ||
| rule( | ||
| caseless('кв'), | ||
| DOT.optional() | ||
| ), | ||
| rule(normalized('квартира')) | ||
| ).interpretation( | ||
| Room.type.const('квартира') | ||
| ) | ||
| KVARTIRA_VALUE = ADDRESS_VALUE.interpretation( | ||
| Room.number | ||
| ) | ||
| KVARTIRA = rule( | ||
| KVARTIRA_WORDS, | ||
| KVARTIRA_VALUE | ||
| ).interpretation( | ||
| Room | ||
| ) | ||
| ########### | ||
| # | ||
| # INDEX | ||
| # | ||
| ############# | ||
| INDEX = and_( | ||
| INT, | ||
| gte(100000), | ||
| lte(999999) | ||
| ).interpretation( | ||
| Index.value | ||
| ).interpretation( | ||
| Index | ||
| ) | ||
| ############# | ||
| # | ||
| # FULL ADDRESS | ||
| # | ||
| ############ | ||
| OBLAST_LEVEL = or_( | ||
| RESPUBLIKA, | ||
| KRAI, | ||
| OBLAST, | ||
| AUTO_OKRUG | ||
| ) | ||
| GOROD_LEVEL = or_( | ||
| GOROD, | ||
| DEREVNYA, | ||
| SELO, | ||
| POSELOK | ||
| ) | ||
| STREET_LEVEL = or_( | ||
| STREET, | ||
| PROSPEKT, | ||
| PROEZD, | ||
| PEREULOK, | ||
| PLOSHAD, | ||
| SHOSSE, | ||
| NABEREG, | ||
| BULVAR | ||
| ) | ||
| OFIS_LEVEL = or_( | ||
| OFIS, | ||
| KVARTIRA | ||
| ) | ||
| PRE_STREET_LEVEL = or_( | ||
| INDEX, | ||
| COUNTRY, | ||
| OBLAST_LEVEL, | ||
| RAION, | ||
| GOROD_LEVEL | ||
| ) | ||
| POST_STREET_LEVEL = or_( | ||
| KORPUS, | ||
| STROENIE, | ||
| OFIS_LEVEL | ||
| ) | ||
| SEP = in_(',;') | ||
| ADDRESS = rule( | ||
| rule( | ||
| PRE_STREET_LEVEL.interpretation( | ||
| Address.parts | ||
| ), | ||
| SEP.optional() | ||
| ).optional().repeatable(), | ||
| STREET_LEVEL.interpretation( | ||
| Address.parts | ||
| ), | ||
| SEP.optional(), | ||
| DOM.interpretation( | ||
| Address.parts | ||
| ), | ||
| rule( | ||
| SEP.optional(), | ||
| POST_STREET_LEVEL.interpretation( | ||
| Address.parts | ||
| ), | ||
| ).optional().repeatable(), | ||
| ).interpretation( | ||
| Address | ||
| ) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from yargy import ( | ||
| rule, | ||
| and_, or_, not_ | ||
| ) | ||
| from yargy.interpretation import fact | ||
| from yargy.predicates import ( | ||
| caseless, normalized, | ||
| eq, length_eq, | ||
| gram, dictionary, | ||
| is_single, is_title | ||
| ) | ||
| from yargy.relations import gnc_relation | ||
| Location = fact( | ||
| 'Location', | ||
| ['name'], | ||
| ) | ||
| gnc = gnc_relation() | ||
| REGION = rule( | ||
| gram('ADJF').match(gnc), | ||
| dictionary({ | ||
| 'край', | ||
| 'район', | ||
| 'область', | ||
| 'губерния', | ||
| 'уезд', | ||
| }).match(gnc), | ||
| ).interpretation(Location.name.inflected()) | ||
| gnc = gnc_relation() | ||
| FEDERAL_DISTRICT = rule( | ||
| rule(caseless('северо'), '-').optional(), | ||
| dictionary({ | ||
| 'центральный', | ||
| 'западный', | ||
| 'южный', | ||
| 'кавказский', | ||
| 'приволжский', | ||
| 'уральский', | ||
| 'сибирский', | ||
| 'дальневосточный', | ||
| }).match(gnc), | ||
| or_( | ||
| rule( | ||
| dictionary({'федеральный'}).match(gnc), | ||
| dictionary({'округ'}).match(gnc), | ||
| ), | ||
| rule('ФО'), | ||
| ), | ||
| ).interpretation(Location.name.inflected()) | ||
| gnc = gnc_relation() | ||
| AUTONOMOUS_DISTRICT = rule( | ||
| gram('ADJF').match(gnc).repeatable(), | ||
| or_( | ||
| rule( | ||
| dictionary({'автономный'}).match(gnc), | ||
| dictionary({'округ'}).match(gnc), | ||
| ), | ||
| rule('АО'), | ||
| ), | ||
| ).interpretation(Location.name.inflected()) | ||
| gnc = gnc_relation() | ||
| FEDERATION = rule( | ||
| gram('ADJF').match(gnc).repeatable(), | ||
| dictionary({ | ||
| 'республика', | ||
| 'федерация', | ||
| }).match(gnc) | ||
| ).interpretation(Location.name.inflected()) | ||
| gnc = gnc_relation() | ||
| ADJX_FEDERATION = rule( | ||
| or_( | ||
| gram('Adjx'), | ||
| gram('ADJF'), | ||
| ).match(gnc).repeatable(), | ||
| dictionary({ | ||
| 'штат', | ||
| 'эмират', | ||
| }).match(gnc), | ||
| gram('gent').optional().repeatable() | ||
| ).interpretation(Location.name.inflected()) | ||
| gnc = gnc_relation() | ||
| STATE = rule( | ||
| dictionary({ | ||
| 'графство', | ||
| 'штат', | ||
| }), | ||
| gram('ADJF').match(gnc).optional(), | ||
| gram('NOUN').match(gnc), | ||
| ).interpretation(Location.name.inflected()) | ||
| gnc = gnc_relation() | ||
| LOCALITY = rule( | ||
| and_( | ||
| dictionary({ | ||
| 'город', | ||
| 'деревня', | ||
| 'село', | ||
| }), | ||
| not_( | ||
| or_( | ||
| gram('Abbr'), | ||
| gram('PREP'), | ||
| gram('CONJ'), | ||
| gram('PRCL'), | ||
| ), | ||
| ), | ||
| ).optional(), | ||
| and_( | ||
| gram('ADJF'), | ||
| ).match(gnc).optional(), | ||
| and_( | ||
| gram('Geox'), | ||
| not_( | ||
| or_( | ||
| gram('Abbr'), | ||
| gram('PREP'), | ||
| gram('CONJ'), | ||
| gram('PRCL'), | ||
| ), | ||
| ), | ||
| ).match(gnc) | ||
| ).interpretation(Location.name.inflected()) | ||
| LOCATION = or_( | ||
| REGION, | ||
| FEDERAL_DISTRICT, | ||
| AUTONOMOUS_DISTRICT, | ||
| FEDERATION, | ||
| ADJX_FEDERATION, | ||
| STATE, | ||
| LOCALITY, | ||
| ).interpretation(Location) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from yargy import ( | ||
| rule, | ||
| not_, | ||
| and_, | ||
| or_, | ||
| ) | ||
| from yargy.interpretation import attribute, fact | ||
| from yargy.predicates import ( | ||
| eq, | ||
| in_, | ||
| true, | ||
| gram, | ||
| type, | ||
| caseless, | ||
| normalized, | ||
| is_capitalized, | ||
| ) | ||
| from yargy.relations import ( | ||
| gnc_relation, | ||
| case_relation, | ||
| ) | ||
| from yargy.pipelines import morph_pipeline | ||
| from yargy.tokenizer import QUOTES | ||
| from .name import SIMPLE_NAME | ||
| from .person import POSITION_NAME | ||
| from yargy.rule.transformators import RuleTransformator | ||
| class StripInterpretationTransformator(RuleTransformator): | ||
| def visit_InterpretationRule(self, item): | ||
| return self.visit(item.rule) | ||
| NAME = SIMPLE_NAME.transform(StripInterpretationTransformator) | ||
| PERSON = POSITION_NAME.transform(StripInterpretationTransformator) | ||
| Organisation = fact('Organisation', ['name']) | ||
| TYPE = morph_pipeline([ | ||
| 'АО', | ||
| 'ОАО', | ||
| 'ООО', | ||
| 'ЗАО', | ||
| 'ПАО', | ||
| # TODO Check abbrs | ||
| # 'ик', | ||
| # 'нк', | ||
| # 'хк', | ||
| # 'ип', | ||
| # 'чп', | ||
| # 'ичп', | ||
| # 'гпф', | ||
| # 'нпф', | ||
| # 'бф', | ||
| # 'спао', | ||
| # 'сро', | ||
| 'общество', | ||
| 'акционерное общество', | ||
| 'открытое акционерное общество', | ||
| 'общество с ограниченной ответственностью', | ||
| 'закрытое акционерное общество', | ||
| 'публичное акционерное общество', | ||
| 'агентство', | ||
| 'компания', | ||
| 'организация', | ||
| 'издательство', | ||
| 'газета', | ||
| 'концерн' | ||
| 'фирма', | ||
| 'завод', | ||
| 'предприятие', | ||
| 'корпорация', | ||
| 'группа', | ||
| 'группа компаний', | ||
| 'санаторий', | ||
| 'объединение', | ||
| 'бюро', | ||
| 'подразделение', | ||
| 'филиал', | ||
| 'представительство', | ||
| 'фонд', | ||
| 'центр', | ||
| 'нии', | ||
| 'академия', | ||
| 'академия наук', | ||
| 'обсерватория', | ||
| 'университет', | ||
| 'институт', | ||
| 'политех', | ||
| 'колледж', | ||
| 'техникум', | ||
| 'училище', | ||
| 'школа', | ||
| 'музей', | ||
| 'библиотека', | ||
| 'авиакомпания', | ||
| 'госкомпания', | ||
| 'инвесткомпания', | ||
| 'медиакомпания', | ||
| 'оффшор-компания', | ||
| 'радиокомпания', | ||
| 'телекомпания', | ||
| 'телерадиокомпания', | ||
| 'траст-компания', | ||
| 'фактор-компания', | ||
| 'холдинг-компания', | ||
| 'энергокомпания', | ||
| 'компания-производитель', | ||
| 'компания-изготовитель', | ||
| 'компания-заказчик', | ||
| 'компания-исполнитель', | ||
| 'компания-посредник', | ||
| 'группа управляющих компаний', | ||
| 'агрофирма', | ||
| 'турфирма', | ||
| 'юрфирма', | ||
| 'фирма-производитель', | ||
| 'фирма-изготовитель', | ||
| 'фирма-заказчик', | ||
| 'фирма-исполнитель', | ||
| 'фирма-посредник', | ||
| 'авиапредприятие', | ||
| 'агропредприятие', | ||
| 'госпредприятие', | ||
| 'нацпредприятие', | ||
| 'промпредприятие', | ||
| 'энергопредприятие', | ||
| 'авиакорпорация', | ||
| 'госкорпорация', | ||
| 'профорганизация', | ||
| 'стартап', | ||
| 'нотариальная контора', | ||
| 'букмекерская контора', | ||
| 'авиазавод', | ||
| 'автозавод', | ||
| 'винзавод', | ||
| 'подстанция', | ||
| 'гидроэлектростанция', | ||
| ]) | ||
| gnc = gnc_relation() | ||
| ADJF_PREFIX = rule( | ||
| or_( | ||
| rule(gram('ADJF').match(gnc)), # международное | ||
| rule( # историко-просветительское | ||
| true(), | ||
| eq('-'), | ||
| gram('ADJF').match(gnc), | ||
| ), | ||
| ), | ||
| or_(caseless('и'), eq(',')).optional(), | ||
| ).repeatable() | ||
| case = case_relation() | ||
| GENT_GROUP = rule( | ||
| gram('gent').match(case) | ||
| ).repeatable().optional() | ||
| QUOTED = rule( | ||
| TYPE, | ||
| in_(QUOTES), | ||
| not_(in_(QUOTES)).repeatable(), | ||
| in_(QUOTES), | ||
| ) | ||
| QUOTED_WITH_ADJF_PREFIX = rule( | ||
| ADJF_PREFIX, | ||
| QUOTED, | ||
| ) | ||
| BASIC = rule( | ||
| ADJF_PREFIX, | ||
| TYPE, | ||
| ) | ||
| NAMED = rule( | ||
| or_( | ||
| QUOTED, | ||
| QUOTED_WITH_ADJF_PREFIX, | ||
| BASIC, | ||
| ), | ||
| GENT_GROUP, | ||
| or_( | ||
| rule(normalized('имя')), | ||
| rule(caseless('им'), eq('.').optional()), | ||
| ), | ||
| or_( | ||
| NAME, | ||
| PERSON, | ||
| ), | ||
| ) | ||
| LATIN = rule( | ||
| TYPE, | ||
| or_( | ||
| rule( | ||
| and_( | ||
| type('LATIN'), | ||
| is_capitalized(), | ||
| ) | ||
| ), | ||
| rule( | ||
| type('LATIN'), | ||
| in_({'&', '/', '.'}), | ||
| type('LATIN'), | ||
| ) | ||
| ).repeatable() | ||
| ) | ||
| KNOWN = rule( | ||
| gram('Orgn'), | ||
| GENT_GROUP, | ||
| ) | ||
| ORGANISATION_ = or_( | ||
| QUOTED, | ||
| QUOTED_WITH_ADJF_PREFIX, | ||
| BASIC, | ||
| NAMED, | ||
| LATIN, | ||
| KNOWN, | ||
| ) | ||
| ORGANISATION = ORGANISATION_.interpretation( | ||
| Organisation.name | ||
| ).interpretation( | ||
| Organisation | ||
| ) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from yargy import ( | ||
| rule, | ||
| or_ | ||
| ) | ||
| from yargy.interpretation import fact | ||
| from yargy.predicates import gram | ||
| from yargy.pipelines import morph_pipeline | ||
| from .name import ( | ||
| NAME, | ||
| SIMPLE_NAME | ||
| ) | ||
| Person = fact( | ||
| 'Person', | ||
| ['position', 'name'] | ||
| ) | ||
| POSITION = morph_pipeline([ | ||
| 'святой', | ||
| 'патриарх', | ||
| 'митрополит', | ||
| 'царь', | ||
| 'король', | ||
| 'царица', | ||
| 'император', | ||
| 'императрица', | ||
| 'принц', | ||
| 'принцесса', | ||
| 'князь', | ||
| 'граф', | ||
| 'графиня', | ||
| 'барон', | ||
| 'баронесса', | ||
| 'княгиня', | ||
| 'президент', | ||
| 'премьер-министр', | ||
| 'экс-премьер', | ||
| 'пресс-секретарь', | ||
| 'министр', | ||
| 'замминистр', | ||
| 'заместитель', | ||
| 'глава', | ||
| 'канцлер', | ||
| 'помощник', | ||
| 'посол', | ||
| 'губернатор', | ||
| 'председатель', | ||
| 'спикер', | ||
| 'диктатор', | ||
| 'лидер', | ||
| 'генсек', | ||
| 'премьер', | ||
| 'депутат', | ||
| 'вице-премьер', | ||
| 'сенатор', | ||
| 'полпред', | ||
| 'госсекретарь', | ||
| 'вице-президент', | ||
| 'сопредседатель', | ||
| 'зам', | ||
| 'мэр', | ||
| 'вице-спикер', | ||
| 'замруководителя', | ||
| 'зампред', | ||
| 'муфтий', | ||
| 'спецпредставитель', | ||
| 'руководитель', | ||
| 'статс-секретарь', | ||
| 'зампредседатель', | ||
| 'представитель', | ||
| 'ставленник', | ||
| 'мадеро', | ||
| 'вице-губернатор', | ||
| 'зампредсовмин', | ||
| 'наркоминдела', | ||
| 'генпрокурор', | ||
| 'комиссар', | ||
| 'рейхсканцлер', | ||
| 'советник', | ||
| 'замглавы', | ||
| 'секретарь', | ||
| 'парламентарий', | ||
| 'замгендиректор', | ||
| 'вице-председатель', | ||
| 'постпред', | ||
| 'госкомтруд', | ||
| 'предсовмин', | ||
| 'преемник', | ||
| 'делегат', | ||
| 'шеф', | ||
| 'консул', | ||
| 'замминистра', | ||
| 'главкомпис', | ||
| 'чиновник', | ||
| 'врио', | ||
| 'управделами', | ||
| 'нарком', | ||
| 'донпродкомиссар', | ||
| 'председ', | ||
| 'гендиректор', | ||
| 'генерал-губернатор', | ||
| 'обревком', | ||
| 'правитель', | ||
| 'замсекретарь', | ||
| 'главнокомандующий', | ||
| 'вице-мэр', | ||
| 'наместник', | ||
| 'спичрайтер', | ||
| 'вице-консул', | ||
| 'мвэс', | ||
| 'облревком', | ||
| 'главковерх', | ||
| 'пресс-атташе', | ||
| 'торгпред', | ||
| 'член', | ||
| 'назначенец', | ||
| 'эмиссар', | ||
| 'обрядоначальник', | ||
| 'главком', | ||
| 'единоросс', | ||
| 'политик', | ||
| 'генерал', | ||
| 'замгенпрокурор', | ||
| 'дипломат', | ||
| 'главноуполномоченный', | ||
| 'генерал-фельдцейхмейстер', | ||
| 'комендант', | ||
| 'казначей', | ||
| 'уполномоченный', | ||
| 'обер-прокурор', | ||
| 'наркомзем', | ||
| 'соправитель', | ||
| 'основатель', | ||
| 'сооснователь', | ||
| 'управляющий директор', | ||
| 'управляющий партнер', | ||
| 'партнер', | ||
| 'руководитель', | ||
| 'аналитик', | ||
| 'зампред', | ||
| 'миллиардер', | ||
| 'миллионер', | ||
| 'автор', | ||
| 'актер', | ||
| 'актриса', | ||
| 'певец', | ||
| 'певица', | ||
| 'исполнитель', | ||
| 'солист', | ||
| 'режиссер', | ||
| 'сценарист', | ||
| 'писатель', | ||
| 'музыкант', | ||
| 'композитор', | ||
| 'корреспондент', | ||
| 'журналист', | ||
| 'редактор', | ||
| 'дирижер', | ||
| 'кинорежиссер', | ||
| 'звукорежиссер', | ||
| 'детектив', | ||
| 'пианист', | ||
| 'драматург', | ||
| 'артист', | ||
| 'балетмейстер', | ||
| 'скрипач', | ||
| 'хореограф', | ||
| 'танцовщик', | ||
| 'документалист', | ||
| 'поэт', | ||
| 'литератор', | ||
| 'киноактер', | ||
| 'вокалист', | ||
| 'бард', | ||
| 'комик', | ||
| 'продюсер', | ||
| 'кинодраматург', | ||
| 'киноактриса', | ||
| 'балерина', | ||
| 'пианистка', | ||
| 'критик', | ||
| 'танцор', | ||
| 'концертмейстер', | ||
| 'симфонист', | ||
| 'сатирик', | ||
| 'аранжировщик', | ||
| 'саксофонист', | ||
| 'юморист', | ||
| 'шансонье', | ||
| 'гастролер', | ||
| 'виолончелист', | ||
| 'постановщик', | ||
| 'кинематографист', | ||
| 'сценограф', | ||
| 'джазмен', | ||
| 'музыковед', | ||
| 'киноартист', | ||
| 'педагог', | ||
| 'хормейстер', | ||
| 'беллетрист', | ||
| 'примадонна', | ||
| 'инструменталист', | ||
| 'альтист', | ||
| 'шоумен', | ||
| 'виртуоз', | ||
| 'пародист', | ||
| 'декоратор', | ||
| 'модельер', | ||
| 'очеркист', | ||
| 'оперетта', | ||
| 'контрабасист', | ||
| 'карикатурист', | ||
| 'дуэт', | ||
| 'монтажер', | ||
| 'живописец', | ||
| 'скульптор', | ||
| 'режиссура', | ||
| 'архитектор', | ||
| 'антрепренер', | ||
| 'импрессарио', | ||
| 'прозаик', | ||
| 'труппа', | ||
| 'трагик', | ||
| 'клоун', | ||
| 'солистка', | ||
| 'либреттист', | ||
| 'литературовед', | ||
| 'портретист', | ||
| 'гример', | ||
| 'художник', | ||
| 'импровизатор', | ||
| 'декламаторша', | ||
| 'телеведущий', | ||
| 'импресарио', | ||
| 'мастер', | ||
| 'аккомпаниатор', | ||
| 'шахматист', | ||
| 'иллюзионист', | ||
| 'эстрадник', | ||
| 'эстрада', | ||
| 'спортсмен', | ||
| 'дизайнер', | ||
| 'кинокритик', | ||
| 'публицист', | ||
| 'чтец', | ||
| 'конферансье', | ||
| 'студиец', | ||
| 'коверный', | ||
| 'куплетист', | ||
| 'знаменитость', | ||
| 'ученый', | ||
| 'балет', | ||
| 'искусствовед', | ||
| 'гитарист', | ||
| 'доктор', | ||
| 'академик', | ||
| 'судья', | ||
| 'юрист', | ||
| 'представитель', | ||
| 'директор', | ||
| 'прокурор', | ||
| 'отец', | ||
| 'мать', | ||
| 'мама', | ||
| 'папа', | ||
| 'брат', | ||
| 'сестра', | ||
| 'тёща', | ||
| 'тесть', | ||
| 'дедушка', | ||
| 'бабушка', | ||
| 'жена', | ||
| 'муж', | ||
| 'дочь', | ||
| 'сын', | ||
| 'мистер', | ||
| 'миссис', | ||
| 'господин', | ||
| 'госпожа', | ||
| 'пан', | ||
| 'пани', | ||
| 'сэр', | ||
| 'мисс', | ||
| 'боксер', | ||
| 'боец', | ||
| 'атлет', | ||
| 'футболист', | ||
| 'баскетболист', | ||
| 'агроном', | ||
| 'президент', | ||
| 'сопрезидент', | ||
| 'вице-президент', | ||
| 'экс-президент', | ||
| 'председатель', | ||
| 'руководитель', | ||
| 'директор', | ||
| 'глава', | ||
| ]) | ||
| GENT = gram('gent') | ||
| WHERE = or_( | ||
| rule(GENT), | ||
| rule(GENT, GENT), | ||
| rule(GENT, GENT, GENT), | ||
| rule(GENT, GENT, GENT, GENT), | ||
| rule(GENT, GENT, GENT, GENT, GENT), | ||
| ) | ||
| POSITION = or_( | ||
| POSITION, | ||
| rule(POSITION, WHERE) | ||
| ).interpretation( | ||
| Person.position | ||
| ) | ||
| NAME = NAME.interpretation( | ||
| Person.name | ||
| ) | ||
| SIMPLE_NAME = SIMPLE_NAME.interpretation( | ||
| Person.name | ||
| ) | ||
| POSITION_NAME = rule( | ||
| POSITION, | ||
| SIMPLE_NAME | ||
| ) | ||
| PERSON = or_( | ||
| POSITION_NAME, | ||
| NAME | ||
| ).interpretation( | ||
| Person | ||
| ) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals, print_function | ||
| TABLE = [ | ||
| ('<', '<'), | ||
| ('>', '>'), | ||
| ] | ||
| def escape(text): | ||
| for char, code in TABLE: | ||
| text = text.replace(char, code) | ||
| return text | ||
| def format_markup_html(text, spans): | ||
| spans = sorted(spans) | ||
| previous = 0 | ||
| for span in spans: | ||
| start, stop = span | ||
| yield escape(text[previous:start]) | ||
| yield '<mark>' | ||
| yield escape(text[start:stop]) | ||
| yield '</mark>' | ||
| previous = stop | ||
| yield escape(text[previous:]) | ||
| def format_markup_css(text, spans): | ||
| yield '<style>' | ||
| yield """ | ||
| .markup { | ||
| white-space: pre-wrap; | ||
| } | ||
| .markup > mark { | ||
| padding: 0.15em; | ||
| border-radius: 0.25em; | ||
| border: 1px solid #fdf07c; | ||
| background: #ffffc2; | ||
| } | ||
| """ | ||
| yield '</style>' | ||
| yield '<div class="markup tex2jax_ignore">' | ||
| yield ''.join(format_markup_html(text, spans)) | ||
| yield '</div>' | ||
| def show_markup_notebook(text, spans): | ||
| from IPython.display import HTML, display | ||
| html = ''.join(format_markup_css(text, spans)) | ||
| display(HTML(html)) | ||
| def format_markup(text, spans): | ||
| spans = sorted(spans) | ||
| previous = 0 | ||
| for span in spans: | ||
| start, stop = span | ||
| yield text[previous:start] | ||
| yield '[[' | ||
| yield text[start:stop] | ||
| yield ']]' | ||
| previous = stop | ||
| yield text[previous:] | ||
| def show_markup(text, spans): | ||
| print(''.join(format_markup(text, spans))) | ||
| def format_json(data): | ||
| import json | ||
| return json.dumps(data, indent=2, ensure_ascii=False) | ||
| def show_json(data): | ||
| print(format_json(data)) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| def make_translation_table(source, target): | ||
| assert len(source) == len(target) | ||
| return { | ||
| ord(a): ord(b) | ||
| for a, b in zip(source, target) | ||
| } | ||
| DASHES = make_translation_table( | ||
| '‑–—−', | ||
| '----' | ||
| ) | ||
| def normalize_text(text): | ||
| return text.translate(DASHES) |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import AddressExtractor | ||
| from natasha.grammars.address import ( | ||
| Address, Index, Country, | ||
| Region, Settlement, | ||
| Street, Building, Room | ||
| ) | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return AddressExtractor() | ||
| tests = [ | ||
| [ | ||
| 'Россия, Вологодская обл. г. Череповец, пр.Победы 93 б', | ||
| Address(parts=[ | ||
| Country(name='Россия'), | ||
| Region(name='Вологодская', type='область'), | ||
| Settlement(name='Череповец', type='город'), | ||
| Street(name='Победы', type='проспект'), | ||
| Building(number='93 б') | ||
| ]) | ||
| ], | ||
| [ | ||
| '692909, РФ, Приморский край, г. Находка, ул. Добролюбова, 18', | ||
| Address(parts=[ | ||
| Index(value='692909'), | ||
| Country(name='РФ'), | ||
| Region(name='Приморский', type='край'), | ||
| Settlement(name='Находка', type='город'), | ||
| Street(name='Добролюбова', type='улица'), | ||
| Building(number='18', type=None) | ||
| ]) | ||
| ], | ||
| [ | ||
| 'д. Федоровка, ул. Дружбы, 13', | ||
| Address(parts=[ | ||
| Settlement(name='Федоровка', type='деревня'), | ||
| Street(name='Дружбы', type='улица'), | ||
| Building(number='13', type=None) | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Россия, 129110, г.Москва, Олимпийский проспект, 22', | ||
| Address(parts=[ | ||
| Country(name='Россия'), | ||
| Index(value='129110'), | ||
| Settlement(name='Москва', type='город'), | ||
| Street(name='Олимпийский', type='проспект'), | ||
| Building(number='22', type=None) | ||
| ]) | ||
| ], | ||
| [ | ||
| 'г. Санкт-Петербург, Красногвардейский пер., д. 15', | ||
| Address(parts=[ | ||
| Settlement(name='Санкт-Петербург', type='город'), | ||
| Street(name='Красногвардейский', type='переулок'), | ||
| Building(number='15', type='дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Республика Карелия,г.Петрозаводск,ул.Маршала Мерецкова, д.8 Б,офис 4', | ||
| Address(parts=[ | ||
| Region(name='Карелия', type='республика'), | ||
| Settlement(name='Петрозаводск', type='город'), | ||
| Street(name='Маршала Мерецкова', type='улица'), | ||
| Building(number='8 Б', type='дом'), | ||
| Room(number='4', type='офис') | ||
| ]) | ||
| ], | ||
| [ | ||
| '628000, ХМАО-Югра, г.Ханты-Мансийск, ул. Ледовая , д.19', | ||
| Address(parts=[ | ||
| Index(value='628000'), | ||
| Region(name='ХМАО-Югра', type=None), | ||
| Settlement(name='Ханты-Мансийск', type='город'), | ||
| Street(name='Ледовая', type='улица'), | ||
| Building(number='19', type='дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'ХМАО г.Нижневартовск пер.Ягельный 17', | ||
| Address(parts=[ | ||
| Region(name='ХМАО', type=None), | ||
| Settlement(name='Нижневартовск', type='город'), | ||
| Street(name='Ягельный', type='переулок'), | ||
| Building(number='17', type=None) | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Белгородская обл, пгт Борисовка,ул. Рудого д.160', | ||
| Address(parts=[ | ||
| Region(name='Белгородская', type='область'), | ||
| Settlement(name='Борисовка', type='посёлок'), | ||
| Street(name='Рудого', type='улица'), | ||
| Building(number='160', type='дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Самарская область, п.г.т. Алексеевка, ул. Ульяновская д. 21', | ||
| Address(parts=[ | ||
| Region(name='Самарская', type='область'), | ||
| Settlement(name='Алексеевка', type='посёлок'), | ||
| Street(name='Ульяновская', type='улица'), | ||
| Building(number='21', type='дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'Мурманская обл поселок городского типа Молочный, ул.Гальченко д.11', | ||
| Address(parts=[ | ||
| Region(name='Мурманская', type='область'), | ||
| Settlement(name='Молочный', type='посёлок'), | ||
| Street(name='Гальченко', type='улица'), | ||
| Building(number='11', type='дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'ул. Народного Ополчения д. 9к.3', | ||
| Address(parts=[ | ||
| Street(name='Народного Ополчения', type='улица'), | ||
| Building(number='9к', type='дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'ул. Б. Пироговская, д.37/430', | ||
| Address(parts=[ | ||
| Street(name='Б. Пироговская', type='улица'), | ||
| Building(number='37/430', type='дом') | ||
| ]) | ||
| ], | ||
| [ | ||
| 'г. Таганрог, ул. Шило, 247/1', | ||
| Address(parts=[ | ||
| Settlement(name='Таганрог', type='город'), | ||
| Street(name='Шило', type='улица'), | ||
| Building(number='247/1', type=None) | ||
| ]) | ||
| ] | ||
| ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| text = test[0] | ||
| etalon = test[1:] | ||
| guess = [_.fact for _ in extractor(text)] | ||
| assert guess == etalon |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import LocationExtractor | ||
| from natasha.grammars.location import Location | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return LocationExtractor() | ||
| tests = [ | ||
| [ | ||
| 'в Ярославской области', | ||
| Location(name='ярославская область') | ||
| ], | ||
| [ | ||
| 'около красноярского края', | ||
| Location(name='красноярский край') | ||
| ], | ||
| [ | ||
| 'события в северо-кавказском федеральном округе', | ||
| Location(name='северо-кавказский федеральный округ') | ||
| ], | ||
| [ | ||
| 'Северо-западный ФО', | ||
| Location(name='северо-западный фо') | ||
| ], | ||
| # TODO: решить проблемы с дефисами | ||
| # [ | ||
| # 'Ямало-Ненецкий автономный округ', | ||
| # Location(name='ямало-ненецкий автономный округ'), | ||
| # ], | ||
| [ | ||
| 'В Чеченской республике на день рождения ...', | ||
| Location(name='чеченская республика'), | ||
| ], | ||
| [ | ||
| 'Донецкая народная республика провозгласила ...', | ||
| Location(name='донецкая народная республика'), | ||
| ], | ||
| [ | ||
| 'Российская Федерация', | ||
| Location(name='российская федерация'), | ||
| ], | ||
| [ | ||
| 'в Соединенных Штатах Америки', | ||
| Location(name='соединённый штат америка'), | ||
| ], | ||
| [ | ||
| 'речь шла о Обьединенных Арабских Эмиратах', | ||
| Location(name='обьединённый арабский эмират'), | ||
| ], | ||
| [ | ||
| 'Соединённые Штаты', | ||
| Location(name='соединённый штат'), | ||
| ], | ||
| [ | ||
| 'в штате Вашингтон', | ||
| Location(name='штат вашингтон'), | ||
| ], | ||
| [ | ||
| 'возле штата Южная Каролина', | ||
| Location(name='штат южная каролина'), | ||
| ], | ||
| [ | ||
| 'графство Корнуолл', | ||
| Location(name='графство корнуолл'), | ||
| ], | ||
| [ | ||
| 'город Москва', | ||
| Location(name='город москва'), | ||
| ], | ||
| [ | ||
| 'Москва', | ||
| Location(name='москва'), | ||
| ], | ||
| [ | ||
| 'город Ленинград', | ||
| Location(name='город ленинград'), | ||
| ], | ||
| [ | ||
| 'деревня Верхние Петушки', | ||
| Location(name='деревня верхний петушок'), | ||
| ], | ||
| [ | ||
| 'село Новое Кукуево', | ||
| Location(name='севшее новое кукуево'), | ||
| ], | ||
| ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| text = test[0] | ||
| etalon = test[1:] | ||
| guess = [_.fact for _ in extractor(text)] | ||
| assert guess == etalon |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import MoneyRangeExtractor | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return MoneyRangeExtractor() | ||
| tests = [ | ||
| [ | ||
| '20000-30000 рублей', | ||
| '20000 RUB-30000 RUB' | ||
| ], | ||
| [ | ||
| 'от 80 тысяч до 2 миллионов рублей', | ||
| '80000 RUB-2000000 RUB' | ||
| ], | ||
| ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| line, etalon = test | ||
| matches = list(extractor(line)) | ||
| assert len(matches) == 1 | ||
| fact = matches[0].fact | ||
| guess = str(fact.normalized) | ||
| assert guess == etalon |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import MoneyRateExtractor | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return MoneyRateExtractor() | ||
| tests = [ | ||
| ['2000 руб. / сутки', '2000 RUB/DAY'], | ||
| ['2000 руб./смена', '2000 RUB/SHIFT'], | ||
| ['2000 руб. в час', '2000 RUB/HOUR'], | ||
| ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| line, etalon = test | ||
| matches = list(extractor(line)) | ||
| assert len(matches) == 1 | ||
| fact = matches[0].fact | ||
| guess = str(fact.normalized) | ||
| assert guess == etalon |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha import OrganisationExtractor | ||
| from natasha.grammars.organisation import Organisation | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return OrganisationExtractor() | ||
| tests = [ | ||
| [ | ||
| 'ПАО «Газпром»', | ||
| Organisation(name='ПАО «Газпром»'), | ||
| ], | ||
| [ | ||
| 'публичное акционерное общество "Газпром"', | ||
| Organisation(name='публичное акционерное общество "Газпром"'), | ||
| ], | ||
| [ | ||
| 'историческое общество "Мемориал"', | ||
| Organisation(name='историческое общество "Мемориал"'), | ||
| ], | ||
| [ | ||
| 'коммерческое производственное объединение "Вектор"', | ||
| Organisation(name='коммерческое производственное объединение "Вектор"') | ||
| ], | ||
| [ | ||
| 'Международное историко-просветительское, правозащитное' | ||
| ' и благотворительное общество «Мемориал»', | ||
| Organisation( | ||
| name='Международное историко-просветительское, ' | ||
| 'правозащитное и благотворительное общество «Мемориал»' | ||
| ), | ||
| ], | ||
| [ | ||
| 'правозащитный центр «Мемориал»', | ||
| Organisation(name='правозащитный центр «Мемориал»'), | ||
| ], | ||
| [ | ||
| 'Кировский завод', | ||
| Organisation(name='Кировский завод'), | ||
| ], | ||
| [ | ||
| # TODO: нормализация | ||
| 'Кировский Механический Завод имени Ленина', | ||
| Organisation(name='Кировский Механический Завод имени Ленина'), | ||
| ], | ||
| [ | ||
| 'Московский государственный университет имени М.В.Ломоносова', | ||
| Organisation( | ||
| name='Московский государственный университет имени М.В.Ломоносова' | ||
| ), | ||
| ], | ||
| [ | ||
| # TODO: группа с gent (кого/чего? петра великого) | ||
| 'Санкт-Петербургский политехнический университет Петра Великого', | ||
| Organisation(name='Санкт-Петербургский политехнический университет'), | ||
| ], | ||
| [ | ||
| 'Научно-исследовательский институт онкологии им. Н.Н. Петрова', | ||
| Organisation( | ||
| name='Научно-исследовательский институт онкологии им. Н.Н. Петрова' | ||
| ), | ||
| ], | ||
| # # TODO: | ||
| # [ | ||
| # 'НАЦИОНАЛЬНЫЙ МЕДИЦИНСКИЙ ИССЛЕДОВАТЕЛЬСКИЙ ЦЕНТР ' | ||
| # 'имени академика Н.Н. Петрова', | ||
| # Organisation(), | ||
| # ], | ||
| # [ | ||
| # 'Ленинградский институт методов и техники управления', | ||
| # Organisation( | ||
| # name='Ленинградский институт методов и техники управления' | ||
| # ), | ||
| # ], | ||
| [ | ||
| 'агентство Reuters', | ||
| Organisation(name='агентство Reuters'), | ||
| ], | ||
| [ | ||
| 'компания Rambler&Co', | ||
| Organisation(name='компания Rambler&Co') | ||
| ], | ||
| [ | ||
| 'компания Standard Oil Co. Inc.', | ||
| Organisation(name='компания Standard Oil Co. Inc') | ||
| ], | ||
| [ | ||
| 'ООН', | ||
| Organisation(name='ООН'), | ||
| ], | ||
| [ | ||
| 'МИД России', | ||
| Organisation(name='МИД России'), | ||
| ], | ||
| [ | ||
| 'МВД Приморского района Петербурга', | ||
| Organisation(name='МВД Приморского района Петербурга'), | ||
| ], | ||
| [ | ||
| 'Авиакомпания "Аэрофлот"', | ||
| Organisation(name='Авиакомпания "Аэрофлот"'), | ||
| ], | ||
| ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| text = test[0] | ||
| etalon = test[1:] | ||
| guess = [_.fact for _ in extractor(text)] | ||
| assert guess == etalon |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| import pytest | ||
| from natasha.grammars.person import Person | ||
| from natasha.grammars.name import Name | ||
| from natasha import PersonExtractor | ||
| @pytest.fixture(scope='module') | ||
| def extractor(): | ||
| return PersonExtractor() | ||
| tests = [ | ||
| [ | ||
| 'президент Николя Саркози', | ||
| Person( | ||
| position='президент', | ||
| name=Name( | ||
| first='николя', last='саркози', | ||
| middle=None, nick=None | ||
| ) | ||
| ) | ||
| ], | ||
| [ | ||
| 'Вице-премьер правительства РФ Дмитрий Козак', | ||
| Person( | ||
| position='Вице-премьер правительства РФ', | ||
| name=Name( | ||
| first='дмитрий', | ||
| last='козак', | ||
| middle=None, | ||
| nick=None | ||
| ) | ||
| ) | ||
| ], | ||
| # TODO Почему-то петров -> пётр | ||
| # [ | ||
| # 'академик Н.Н. Петров', | ||
| # Person( | ||
| # position='академик', | ||
| # name=Name( | ||
| # first='Н', | ||
| # middle='Н', | ||
| # last='петров', | ||
| # ) | ||
| # ), | ||
| # ], | ||
| [ | ||
| 'Вице-президент Генадий Рушайло', | ||
| Person( | ||
| position='Вице-президент', | ||
| name=Name( | ||
| first='генадий', | ||
| last='рушайло', | ||
| ) | ||
| ), | ||
| ], | ||
| ] | ||
| @pytest.mark.parametrize('test', tests) | ||
| def test_extractor(extractor, test): | ||
| text = test[0] | ||
| etalon = test[1:] | ||
| guess = [_.fact for _ in extractor(text)] | ||
| assert guess == etalon |
| # coding: utf-8 | ||
| from __future__ import unicode_literals | ||
| from yargy.tokenizer import MorphTokenizer | ||
| TOKENIZER = MorphTokenizer().remove_types('EOL') |
| from yargy.utils import Record |
-1
| {"is_release": false, "git_version": "14c24ac"} |
-41
| natasha/__init__.py,sha256=cfjk_EbcrDP6608vw3ylRmaWk3MHhZoRcMVgw1CJQZs,281 | ||
| natasha/crf.py,sha256=W1cHk7bTxh51-uqXa7QAUKfA9H0fDixpdhjPC_1hJ_c,5385 | ||
| natasha/extractors.py,sha256=LaydnaImswtZcbLzKJnRl1bgNcfprUdPgrajvSy-GaA,3629 | ||
| natasha/markup.py,sha256=CSGFXQkqPlSMdglwUu0snLxjWGk05RWaCpDawf42lvQ,1606 | ||
| natasha/preprocess.py,sha256=OHg2pEmnpCZbR4gatJH5QEOJn68en1Kvime1M0pZHEk,352 | ||
| natasha/tokenizer.py,sha256=6pCwYq3zDcQGfcs9UMhr_RKy2JDW6eUeMOg3lGor5Xc,151 | ||
| natasha/utils.py,sha256=GI_aoriY8zXmdb9GElICgzmpnZeNTljK8jzIz2sproc,33 | ||
| natasha/data/__init__.py,sha256=CR0FSSp_ffPjrwD7Vx2yvymv3t5wvOzl96OlGMvzDbA,838 | ||
| natasha/data/__init__.pyc,sha256=z3QpCjdVQ-KZGUGOlUKS6YCTOObr0_fSZdy8eU0JhnM,927 | ||
| natasha/data/dicts/first.txt,sha256=9c9y2AS-KtkCiN3CenzaE-cw0UNpHKXMvFhqORpTkeA,99734 | ||
| natasha/data/dicts/last.txt,sha256=cuaeP1ERb1p_U_9UU3e-GK8hatFO0wMtENl6guAT81M,2954411 | ||
| natasha/data/dicts/maybe_first.txt,sha256=pzMoONertxFZ5zIBpvZPfdJxtRw_0YYE2Dez07d4YWM,1739 | ||
| natasha/data/models/name.crf.json,sha256=9TdqzdkFcLTGxgEkB5dd2D9qDnh8gBJ29pL9LL3UkVo,23392 | ||
| natasha/data/models/street.crf.json,sha256=IZo6fxuXuuAUHlawBhWp9X8miXULuTiM5KheUgW-O1E,873649 | ||
| natasha/dsl/__init__.py,sha256=LdBmdoFTuUld2Scs_myg-h5eGS9AmPlf9jWbI9CI7WA,112 | ||
| natasha/dsl/money.py,sha256=RzIua6mdfEFNmoEieATJCy3mxepH708-WP61jjGFcBQ,1476 | ||
| natasha/grammars/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0 | ||
| natasha/grammars/address.py,sha256=6M1H2v48o4SJlmxJQNYcS6DGlj36gxHg4jHW3Y03fck,31461 | ||
| natasha/grammars/date.py,sha256=7semSOIBnKSXxBAyHS4MYXSjzflwjYmzR04IoRunqUA,2055 | ||
| natasha/grammars/location.py,sha256=mSDXEAPe4n0ZEqnYwqDECQgXdgHyo3yr8DO4X8iYR4I,3146 | ||
| natasha/grammars/money.py,sha256=ciCqJ0LVe186DZtXkLAyrPaSoiqo6YQjwoAT6wvWbUg,5518 | ||
| natasha/grammars/name.py,sha256=hWom6lwVfAjg7llChsoNwP5hWyB1nrDCOCkNfvOK5QA,3256 | ||
| natasha/grammars/organisation.py,sha256=Muqn8h6ejdwo9CC1kXd3befuZ5CsZm4Qyz8pXRZojRY,5402 | ||
| natasha/grammars/person.py,sha256=5JARzMf4hy1dXVWxiQ5eKS941gSSYRxJnLbt05P0KYs,8034 | ||
| natasha/tests/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0 | ||
| natasha/tests/test_address.py,sha256=qGFAmx0IWHTSb8-EFURD6CweNNzaTUfn5gJ-Ik_svfY,5811 | ||
| natasha/tests/test_date.py,sha256=28AQUu0n4TCflln5IO4DAdnRBdZSRMU_RXb-2IaF8qE,1381 | ||
| natasha/tests/test_location.py,sha256=j0urzxxuKDwF7bKgIkyZnP-IzZKBD0FYYavWQnfdZyU,3120 | ||
| natasha/tests/test_money.py,sha256=rHrUCm8_E4luoMxuYNfWQoyVAFm2tlzXg6q467YYgzY,1621 | ||
| natasha/tests/test_money_range.py,sha256=xpzboC7HkocAEvKYxdMM3mbnnNs3JlaGDDyKH3-a7h0,658 | ||
| natasha/tests/test_money_rate.py,sha256=cCMg9xUBuTjaTTo-VoWa9pEMT4hZYeqQ-W2KNohzgGk,614 | ||
| natasha/tests/test_name.py,sha256=rlk00tebEg4Q0Zm4a1TAvyUrNXWYbE2ePcOIBBmb4II,2649 | ||
| natasha/tests/test_organisation.py,sha256=EwSWTdDauNxV16YZZotuvO-ELyiUM31WD4mp5LKhbFU,4502 | ||
| natasha/tests/test_person.py,sha256=AJIBf78VTRVunhnCRWbg9OF_cR3JhbTiun0bRPlFSKg,1746 | ||
| natasha-0.10.0.dist-info/DESCRIPTION.rst,sha256=OCTuuN6LcWulhHS3d5rfjdsQtW22n7HENFRh6jC6ego,10 | ||
| natasha-0.10.0.dist-info/METADATA,sha256=zGenEgULqw1hCLDFoqe6I1BegQ20JicfPK59bm_ClF8,1315 | ||
| natasha-0.10.0.dist-info/RECORD,, | ||
| natasha-0.10.0.dist-info/WHEEL,sha256=kdsN-5OJAZIiHN-iO4Rhl82KyS0bDWf4uBwMbkNafr8,110 | ||
| natasha-0.10.0.dist-info/metadata.json,sha256=GCnLnqaW9mEoQEIcnsQJDefbZYJAHhILnkEAZc0_8GM,1427 | ||
| natasha-0.10.0.dist-info/pbr.json,sha256=8NOa1NJbhUgXAzYOO1oueKUQ3olImM5ubQmbhQmtckM,47 | ||
| natasha-0.10.0.dist-info/top_level.txt,sha256=j9DF13DQvig_I4R3PUqYWx2o11-m4gbThoWKDbkkkNQ,8 |
| natasha |
-6
| Wheel-Version: 1.0 | ||
| Generator: bdist_wheel (0.30.0) | ||
| Root-Is-Purelib: true | ||
| Tag: py2-none-any | ||
| Tag: py3-none-any | ||
Alert delta unavailable
Currently unable to show alert delta for PyPI packages.
37449297
Infinity%45
Infinity%3624
Infinity%