Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Hezar: The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
Hezar (meaning thousand in Persian) is a multipurpose AI library built to make AI easy for the Persian community!
Hezar is a library that:
Hezar is available on PyPI and can be installed with pip (Python 3.10 and later):
pip install hezar
Note that Hezar is a collection of models and tools, hence having different installation variants:
pip install hezar[all] # For a full installation
pip install hezar[nlp] # For NLP
pip install hezar[vision] # For computer vision models
pip install hezar[audio] # For audio and speech
pip install hezar[embeddings] # For word embedding models
You can also install the latest version from the source:
git clone https://github.com/hezarai/hezar.git
pip install ./hezar
Explore Hezar to learn more on the docs page or explore the key concepts:
There's a bunch of ready to use trained models for different tasks on the Hub!
🤗Hugging Face Hub Page: https://huggingface.co/hezarai
Let's walk you through some examples!
from hezar.models import Model
example = ["هزار، کتابخانهای کامل برای به کارگیری آسان هوش مصنوعی"]
model = Model.load("hezarai/bert-fa-sentiment-dksf")
outputs = model.predict(example)
print(outputs)
[[{'label': 'positive', 'score': 0.812910258769989}]]
from hezar.models import Model
pos_model = Model.load("hezarai/bert-fa-pos-lscp-500k") # Part-of-speech
ner_model = Model.load("hezarai/bert-fa-ner-arman") # Named entity recognition
inputs = ["شرکت هوش مصنوعی هزار"]
pos_outputs = pos_model.predict(inputs)
ner_outputs = ner_model.predict(inputs)
print(f"POS: {pos_outputs}")
print(f"NER: {ner_outputs}")
POS: [[{'token': 'شرکت', 'label': 'Ne'}, {'token': 'هوش', 'label': 'Ne'}, {'token': 'مصنوعی', 'label': 'AJe'}, {'token': 'هزار', 'label': 'NUM'}]]
NER: [[{'token': 'شرکت', 'label': 'B-org'}, {'token': 'هوش', 'label': 'I-org'}, {'token': 'مصنوعی', 'label': 'I-org'}, {'token': 'هزار', 'label': 'I-org'}]]
from hezar.models import Model
model = Model.load("hezarai/roberta-fa-mask-filling")
inputs = ["سلام بچه ها حالتون <mask>"]
outputs = model.predict(inputs, top_k=1)
print(outputs)
[[{'token': 'چطوره', 'sequence': 'سلام بچه ها حالتون چطوره', 'token_id': 34505, 'score': 0.2230483442544937}]]
from hezar.models import Model
model = Model.load("hezarai/whisper-small-fa")
transcripts = model.predict("examples/assets/speech_example.mp3")
print(transcripts)
[{'text': 'و این تنها محدود به محیط کار نیست'}]
from hezar.models import Model
from hezar.utils import load_image, draw_boxes, show_image
model = Model.load("hezarai/CRAFT")
image = load_image("../assets/text_detection_example.png")
outputs = model.predict(image)
result_image = draw_boxes(image, outputs[0]["boxes"])
show_image(result_image, "result")
from hezar.models import Model
# OCR with CRNN
model = Model.load("hezarai/crnn-fa-printed-96-long")
texts = model.predict("examples/assets/ocr_example.jpg")
print(f"CRNN Output: {texts}")
CRNN Output: [{'text': 'چه میشه کرد، باید صبر کنیم'}]
from hezar.models import Model
model = Model.load("hezarai/crnn-fa-license-plate-recognition-v2")
plate_text = model.predict("assets/license_plate_ocr_example.jpg")
print(plate_text) # Persian text of mixed numbers and characters might not show correctly in the console
[{'text': '۵۷س۷۷۹۷۷'}]
from hezar.models import Model
model = Model.load("hezarai/vit-roberta-fa-image-captioning-flickr30k")
texts = model.predict("examples/assets/image_captioning_example.jpg")
print(texts)
[{'text': 'سگی با توپ تنیس در دهانش می دود.'}]
We constantly keep working on adding and training new models and this section will hopefully be expanding over time ;)
from hezar.embeddings import Embedding
fasttext = Embedding.load("hezarai/fasttext-fa-300")
most_similar = fasttext.most_similar("هزار")
print(most_similar)
[{'score': 0.7579, 'word': 'میلیون'},
{'score': 0.6943, 'word': '21هزار'},
{'score': 0.6861, 'word': 'میلیارد'},
{'score': 0.6825, 'word': '26هزار'},
{'score': 0.6803, 'word': '٣هزار'}]
from hezar.embeddings import Embedding
word2vec = Embedding.load("hezarai/word2vec-skipgram-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
[{'score': 0.7885, 'word': 'چهارهزار'},
{'score': 0.7788, 'word': '۱۰هزار'},
{'score': 0.7727, 'word': 'دویست'},
{'score': 0.7679, 'word': 'میلیون'},
{'score': 0.7602, 'word': 'پانصد'}]
from hezar.embeddings import Embedding
word2vec = Embedding.load("hezarai/word2vec-cbow-fa-wikipedia")
most_similar = word2vec.most_similar("هزار")
print(most_similar)
[{'score': 0.7407, 'word': 'دویست'},
{'score': 0.7400, 'word': 'میلیون'},
{'score': 0.7326, 'word': 'صد'},
{'score': 0.7276, 'word': 'پانصد'},
{'score': 0.7011, 'word': 'سیصد'}]
For a full guide on the embeddings module, see the embeddings tutorial.
You can load any of the datasets on the Hub like below:
from hezar.data import Dataset
# The `preprocessor` depends on what you want to do exactly later on. Below are just examples.
sentiment_dataset = Dataset.load("hezarai/sentiment-dksf", preprocessor="hezarai/bert-base-fa") # A TextClassificationDataset instance
lscp_dataset = Dataset.load("hezarai/lscp-pos-500k", preprocessor="hezarai/bert-base-fa") # A SequenceLabelingDataset instance
xlsum_dataset = Dataset.load("hezarai/xlsum-fa", preprocessor="hezarai/t5-base-fa") # A TextSummarizationDataset instance
alpr_ocr_dataset = Dataset.load("hezarai/persian-license-plate-v1", preprocessor="hezarai/crnn-fa-printed-96-long") # An OCRDataset instance
flickr30k_dataset = Dataset.load("hezarai/flickr30k-fa", preprocessor="hezarai/vit-roberta-fa-base") # An ImageCaptioningDataset instance
commonvoice_dataset = Dataset.load("hezarai/common-voice-13-fa", preprocessor="hezarai/whisper-small-fa") # A SpeechRecognitionDataset instance
...
The returned dataset objects from load()
are PyTorch Dataset wrappers for specific tasks and can be used by a data loader out-of-the-box!
You can also load Hezar's datasets using 🤗Datasets:
from datasets import load_dataset
dataset = load_dataset("hezarai/sentiment-dksf")
For a full guide on Hezar's datasets, see the datasets tutorial.
Hezar makes it super easy to train models using out-of-the-box models and datasets provided in the library.
from hezar.models import BertSequenceLabeling, BertSequenceLabelingConfig
from hezar.data import Dataset
from hezar.trainer import Trainer, TrainerConfig
from hezar.preprocessors import Preprocessor
base_model_path = "hezarai/bert-base-fa"
dataset_path = "hezarai/lscp-pos-500k"
train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path)
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path)
model = BertSequenceLabeling(BertSequenceLabelingConfig(id2label=train_dataset.config.id2label))
preprocessor = Preprocessor.load(base_model_path)
train_config = TrainerConfig(
output_dir="bert-fa-pos-lscp-500k",
task="sequence_labeling",
device="cuda",
init_weights_from=base_model_path,
batch_size=8,
num_epochs=5,
metrics=["seqeval"],
)
trainer = Trainer(
config=train_config,
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=train_dataset.data_collator,
preprocessor=preprocessor,
)
trainer.train()
trainer.push_to_hub("bert-fa-pos-lscp-500k") # push model, config, preprocessor, trainer files and configs
You can actually go way deeper with the Trainer. See more details here.
Hezar hosts everything on the HuggingFace Hub. When you use the .load()
method for a model, dataset, etc., it's
downloaded and saved in the cache (at ~/.cache/hezar
) so next time you try to load the same asset, it uses the cached version
which works even when offline. But if you want to export assets more explicitly, you can use the .save()
method to save
anything anywhere you want on a local path.
from hezar.models import Model
# Load the online model
model = Model.load("hezarai/bert-fa-ner-arman")
# Save the model locally
save_path = "./weights/bert-fa-ner-arman"
model.save(save_path) # The weights, config, preprocessors, etc. are saved at `./weights/bert-fa-ner-arman`
# Now you can load the saved model
local_model = Model.load(save_path)
Moreover, any class that has .load()
and .save()
can be treated the same way.
Hezar's primary focus is on providing ready to use models (implementations & pretrained weights) for different casual tasks not by reinventing the wheel, but by being built on top of PyTorch, 🤗Transformers, 🤗Tokenizers, 🤗Datasets, Scikit-learn, Gensim, etc. Besides, it's deeply integrated with the 🤗Hugging Face Hub and almost any module e.g, models, datasets, preprocessors, trainers, etc. can be uploaded to or downloaded from the Hub!
More specifically, here's a simple summary of the core modules in Hezar:
hezar.models.Model
instance which is in fact, a PyTorch nn.Module
wrapper with extra features for saving, loading, exporting, etc.hezar.data.Dataset
instance which is a PyTorch Dataset implemented specifically for each task that can load the data files from the Hugging Face Hub.For more info, check the tutorials
Maintaining Hezar is no cakewalk with just a few of us on board. The concept might not be groundbreaking, but putting it into action was a real challenge and that's why Hezar stands as the biggest Persian open source project of its kind!
Any contribution, big or small, would mean a lot to us. So, if you're interested, let's team up and make Hezar even better together! ❤️
Don't forget to check out our contribution guidelines in CONTRIBUTING.md before diving in. Your support is much appreciated!
We highly recommend to submit any issues or questions in the issues or discussions section but in case you need direct contact, here it is:
If you found this project useful in your work or research please cite it by using this BibTeX entry:
@misc{hezar2023,
title = {Hezar: The all-in-one AI library for Persian},
author = {Aryan Shekarlaban & Pooya Mohammadi Kazaj},
publisher = {GitHub},
howpublished = {\url{https://github.com/hezarai/hezar}},
year = {2023}
}
FAQs
Hezar: The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
We found that hezar demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.