
Security News
NVD Quietly Sweeps 100K+ CVEs Into a “Deferred” Black Hole
NVD now marks all pre-2018 CVEs as "Deferred," signaling it will no longer enrich older vulnerabilities, further eroding trust in its data.
Module name | |
---|---|
DependencyParser | 85.6% |
POSTagger | 98.8% |
Chunker | 93.4% |
Lemmatizer | 89.9% |
Metric | Value | |
---|---|---|
SpacyPOSTagger | Precision | 0.99250 |
Recall | 0.99249 | |
F1-Score | 0.99249 | |
EZ Detection in SpacyPOSTagger | Precision | 0.99301 |
Recall | 0.99297 | |
F1-Score | 0.99298 | |
SpacyChunker | Accuracy | 96.53% |
F-Measure | 95.00% | |
Recall | 95.17% | |
Precision | 94.83% | |
SpacyDependencyParser | TOK Accuracy | 99.06 |
UAS | 92.30 | |
LAS | 89.15 | |
SENT Precision | 98.84 | |
SENT Recall | 99.38 | |
SENT F-Measure | 99.11 |
Hazm is a python library to perform natural language processing tasks on Persian text. It offers various features for analyzing, processing, and understanding Persian text. You can use Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, or read popular Persian corpora.
To install the latest version of Hazm, run the following command in your terminal:
pip install hazm
Alternatively, you can install the latest update from GitHub (this version may be unstable and buggy):
pip install git+https://github.com/roshan-research/hazm.git
Finally if you want to use our pretrained models, you can download it from the links below:
Module name | Size |
---|---|
Download WordEmbedding | ~ 5 GB |
Download SentEmbedding | ~ 1 GB |
Download POSTagger | ~ 18 MB |
Download DependencyParser | ~ 15 MB |
Download Chunker | ~ 4 MB |
Download spacy_pos_tagger_parsbertpostagger | ~ 630 MB |
Download spacy_pos_tagger_parsbertpostagger95 | ~ 630 MB |
Download spacy_chunker_uncased_bert | ~ 650 MB |
Download spacy_chunker_parsbert | ~ 630 MB |
Download spacy_dependency_parser | ~ 630 MB |
>>> from hazm import *
>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیمفاصله پردازش را آسان مي كند')
'اصلاح نویسهها و استفاده از نیمفاصله پردازش را آسان میکند'
>>> sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
['ما هم برای وصل کردن آمدیم!', 'ولی برای پردازش، جدا بهتر نیست؟']
>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']
>>> stemmer = Stemmer()
>>> stemmer.stem('کتابها')
'کتاب'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('میروم')
'رفت#رو'
>>> tagger = POSTagger(model='pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب میخوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('میخوانیم', 'V')]
>>> spacy_posTagger = SpacyPOSTagger(model_path = 'MODELPATH')
>>> spacy_posTagger.tag(tokens = ['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]
>>> posTagger = POSTagger(model = 'pos_tagger.model', universal_tag = False)
>>> posTagger.tag(tokens = ['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]
>>> chunker = Chunker(model='chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'
>>> spacy_chunker = SpacyChunker(model_path = 'model_path')
>>> tree = spacy_chunker.parse(sentence = [('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])
>>> print(tree)
(S
(NP نامه/NOUN,EZ ایشان/PRON)
(POSTP را/ADP)
(VP دریافت/NOUN داشتم/VERB)
./PUNCT)
>>> word_embedding = WordEmbedding(model_type = 'fasttext', model_path = 'word2vec.bin')
>>> word_embedding.doesnt_match(['سلام' ,'درود' ,'خداحافظ' ,'پنجره'])
'پنجره'
>>> word_embedding.doesnt_match(['ساعت' ,'پلنگ' ,'شیر'])
'ساعت'
>>> parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> parser.parse(word_tokenize('زنگها برای که به صدا درمیآید؟'))
<DependencyGraph with 8 nodes>
>>> spacy_parser = SpacyDependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> spacy_parser.parse_sents([word_tokenize('زنگها برای که به صدا درمیآید؟')])
Visit https://roshan-ai.ir/hazm/docs to view the full documentation.
Disclaimer: These ports are not developed or maintained by Roshan. They may not have the same functionality or quality as the original Hazm..
We welcome and appreciate any contributions to this repo, such as bug reports, feature requests, code improvements, documentation updates, etc. Please follow the Contribution guideline when contributing. You can open an issue, fork the repo, write your code, create a pull request and wait for a review and feedback. Thank you for your interest and support in this repo!
FAQs
Persian NLP Toolkit
We found that hazm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
NVD now marks all pre-2018 CVEs as "Deferred," signaling it will no longer enrich older vulnerabilities, further eroding trust in its data.
Research
Security News
Lazarus-linked threat actors expand their npm malware campaign with new RAT loaders, hex obfuscation, and over 5,600 downloads across 11 packages.
Security News
Safari 18.4 adds support for Iterator Helpers and two other TC39 JavaScript features, bringing full cross-browser coverage to key parts of the ECMAScript spec.