
Security News
Another Round of TEA Protocol Spam Floods npm, But It’s Not a Worm
Recent coverage mislabels the latest TEA protocol spam as a worm. Here’s what’s actually happening.

NLPashto is a Python suite for Pashto Natural Language Processing. It provides tools for fundamental text processing tasks, such as text cleaning, tokenization, and chunking (word segmentation). Additionally, it includes state-of-the-art models for POS tagging and sentiment analysis (specifically offensive language detection).
To use NLPashto, you will need:
Install NLPashto via PyPi:
pip install nlpashto
This module contains basic text cleaning utilities:
from nlpashto import Cleaner
cleaner = Cleaner()
noisy_txt = "په ژوند کی علم 📚🖖 , 🖊 او پيسي 💵. 💸💲 دواړه حاصل کړه پوهان به دی علم ته درناوی ولري اوناپوهان به دي پیسو ته... https://t.co/xIiEXFg"
cleaned_text = cleaner.clean(noisy_txt)
print(cleaned_text)
# Output: په ژوند کی علم , او پيسي دواړه حاصل کړه پوهان به دی علم ته درناوی ولري او ناپوهان به دي پیسو ته
Parameters of the clean method:
text (str or list): Input noisy text to clean.split_into_sentences (bool): Split text into sentences.remove_emojis (bool): Remove emojis.normalize_nums (bool): Normalize Arabic numerals (1, 2, 3, ...) to Pashto numerals (۱، ۲، ۳، ...).remove_puncs (bool): Remove punctuations.remove_special_chars (bool): Remove special characters.special_chars (list): List of special characters to keep.This module corrects space omission and insertion errors. It removes extra spaces and inserts necessary ones:
from nlpashto import Tokenizer
tokenizer = Tokenizer()
noisy_txt = 'جلال اباد ښار کې هره ورځ لس ګونه کسانپهډلهییزهتوګهدنشهيي توکو کارولو ته ا د ا م ه و رک وي'
tokenized_text = tokenizer.tokenize(noisy_txt)
print(tokenized_text)
# Output: [['جلال', 'اباد', 'ښار', 'کې', 'هره', 'ورځ', 'لسګونه', 'کسان', 'په', 'ډله', 'ییزه', 'توګه', 'د', 'نشه', 'يي', 'توکو', 'کارولو', 'ته', 'ادامه', 'ورکوي']]
To retrieve full compound words instead of space-delimited tokens, use the Segmenter:
from nlpashto import Segmenter
segmenter = Segmenter()
segmented_text = segmenter.segment(tokenized_text)
print(segmented_text)
# Output: [['جلال اباد', 'ښار', 'کې', 'هره', 'ورځ', 'لسګونه', 'کسان', 'په', 'ډله ییزه', 'توګه', 'د', 'نشه يي', 'توکو', 'کارولو', 'ته', 'ادامه', 'ورکوي']]
Specify batch size for multiple sentences:
segmenter = Segmenter(batch_size=32) # Default is 16
For a detailed explanation about the POS tagger, refer to the POS tagging paper:
from nlpashto import POSTagger
pos_tagger = POSTagger()
pos_tagged = pos_tagger.tag(segmented_text)
print(pos_tagged)
# Output: [[('جلال اباد', 'NNP'), ('ښار', 'NNM'), ('کې', 'PT'), ('هره', 'JJ'), ('ورځ', 'NNF'), ...]]
Detect offensive language using a fine-tuned PsBERT model:
from nlpashto import POLD
sentiment_analysis = POLD()
# Offensive example
offensive_text = 'مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
sentiment = sentiment_analysis.predict(offensive_text)
print(sentiment)
# Output: 1
# Normal example
normal_text = 'تاسو رښتیا وایئ خور 🙏'
sentiment = sentiment_analysis.predict(normal_text)
print(sentiment)
# Output: 0
NLPashto: NLP Toolkit for Low-resource Pashto Language
H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "NLPashto: NLP Toolkit for Low-resource Pashto Language," International Journal of Advanced Computer Science and Applications, vol. 14, no. 6, pp. 1345-1352, 2023.
BibTeX
@article{haq2023nlpashto,
title={NLPashto: NLP Toolkit for Low-resource Pashto Language},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
journal={International Journal of Advanced Computer Science and Applications},
issn={2156-5570},
volume={14},
number={6},
pages={1345-1352},
year={2023},
doi={https://dx.doi.org/10.14569/IJACSA.2023.01406142}
}
Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF
H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "Correction of whitespace and word segmentation in noisy Pashto text using CRF," Speech Communication, vol. 153, p. 102970, 2023.
BibTeX
@article{HAQ2023102970,
title={Correction of whitespace and word segmentation in noisy Pashto text using CRF},
journal={Speech Communication},
issn={1872-7182},
volume={153},
pages={102970},
year={2023},
doi={https://doi.org/10.1016/j.specom.2023.102970},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang}
}
POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model
H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "POS Tagging of Low-resource Pashto Language: Annotated Corpus and BERT-based Model," Preprint, 2023.
BibTeX
@article{haq2023pashto,
title={POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
journal={Preprint},
year={2023},
doi={https://doi.org/10.21203/rs.3.rs-2712906/v1}
}
Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT
H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT," PeerJ Computer Science, vol. 9, p. e1617, 2023.
BibTeX
@article{haq2023pold,
title={Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT},
author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
journal={PeerJ Computer Science},
issn={2376-5992},
volume={9},
pages={e1617},
year={2023},
doi={10.7717/peerj-cs.1617}
}
K. Ahmed, M. A. Khan, I. Haq, A. Al Mazroa, M. Syam, N. Innab, et al., "Social media’s dark secrets: A propagation, lexical and psycholinguistic oriented deep learning approach for fake news proliferation," Expert Systems with Applications, vol. 255, p. 124650, 2024.
BibTeX
@article{AHMED2024124650,
title={Social media’s dark secrets: A propagation, lexical and psycholinguistic oriented deep learning approach for fake news proliferation},
author={Kanwal Ahmed and Muhammad Asghar Khan and Ijazul Haq and Alanoud Al Mazroa and Syam M.S. and Nisreen Innab and Masoud Alajmi and Hend Khalid Alkahtani},
journal={Expert Systems with Applications},
volume={255},
pages={124650},
year={2024},
issn={0957-4174},
doi={https://doi.org/10.1016/j.eswa.2024.124650}
}
FAQs
Pashto Natural Language Processing Toolkit
We found that nlpashto demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Recent coverage mislabels the latest TEA protocol spam as a worm. Here’s what’s actually happening.

Security News
PyPI adds Trusted Publishing support for GitLab Self-Managed as adoption reaches 25% of uploads

Research
/Security News
A malicious Chrome extension posing as an Ethereum wallet steals seed phrases by encoding them into Sui transactions, enabling full wallet takeover.