Security News
Opengrep Emerges as Open Source Alternative Amid Semgrep Licensing Controversy
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
HuSpaCy is a spaCy library providing industrial-strength Hungarian language processing facilities through spaCy models. The released pipelines consist of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module. Word and phrase embeddings are also available through spaCy's API. All models have high throughput, decent memory usage and close to state-of-the-art accuracy. A live demo is available here, model releases are published to Hugging Face Hub.
This repository contains material to build HuSpaCy and all of its models in a reproducible way.
To get started using the tool, first, we need to download one of the models. The easiest way to achieve this is to install huspacy
(from PyPI) and then fetch a model through its API.
pip install huspacy
import huspacy
# Download the latest CPU optimized model
huspacy.download()
You can install the latest models directly from 🤗 Hugging Face Hub:
pip install hu_core_news_lg@https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
pip install hu_core_news_trf@https://huggingface.co/huspacy/hu_core_news_trf/resolve/main/hu_core_news_trf-any-py3-none-any.whl
To speed up inference on GPU, CUDA must be installed as described in https://spacy.io/usage.
Read more on the models here
HuSpaCy is fully compatible with spaCy's API, newcomers can easily get started with spaCy 101 guide.
Although HuSpacy models can be loaded with spacy.load(...)
, the tool provides convenience methods to easily access downloaded models.
# Load the model using spacy.load(...)
import spacy
nlp = spacy.load("hu_core_news_lg")
# Load the default large model (if downloaded)
import huspacy
nlp = huspacy.load()
# Load the model directly as a module
import hu_core_news_lg
nlp = hu_core_news_lg.load()
To process texts, you can simply call the loaded model (i.e. the nlp
callable object)
doc = nlp("Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.")
As HuSpaCy is built on spaCy, the returned doc
document contains all the annotations given by the pipeline components.
API Documentation is available in our website.
We provide several pretrained models:
hu_core_news_lg
is a CNN-based large model which achieves a good
balance between accuracy and processing speed. This default model provides tokenization, sentence splitting,
part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named
entity recognition and ships with pretrained word vectors.hu_core_news_trf
is built
on huBERT and provides the same functionality as the large model
except the word vectors. It comes with much higher accuracy in the price of increased computational resource usage.
We suggest using it with GPU support.hu_core_news_md
greatly improves on hu_core_news_lg
's
throughput by loosing some accuracy. This model could be a good choice when processing speed is crucial.hu_core_news_trf_xl
is an experimental model built
on XLM-RoBERTa-large. It provides the same functionality as
the hu_core_news_trf
model, however it comes with slightly higher accuracy in the price of significantly increased
computational resource usage.
We suggest using it with GPU support.HuSpaCy's model versions follows spaCy's versioning scheme.
A demo of the models is available at Hugging Face Spaces.
To read more about the model's architecture we suggest reading the relevant sections from spaCy's documentation.
Models | md | lg | trf | trf_xl |
---|---|---|---|---|
Embeddings | 100d floret | 300d floret | transformer:huBERT | transformer:XLM-RoBERTa-large |
Target hardware | CPU | CPU | GPU | GPU |
Accuracy | ⭑⭑⭑⭒ | ⭑⭑⭑⭑ | ⭑⭑⭑⭑⭒ | ⭑⭑⭑⭑⭑ |
Resource usage | ⭑⭑⭑⭑⭑ | ⭑⭑⭑⭑ | ⭑⭑ | ⭒ |
If you use HuSpaCy or any of its models, please cite it as:
@InProceedings{HuSpaCy:2023,
author= {"Orosz, Gy{\"o}rgy and Szab{\'o}, Gerg{\H{o}} and Berkecz, P{\'e}ter and Sz{\'a}nt{\'o}, Zsolt and Farkas, Rich{\'a}rd"},
editor= {"Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav"},
title = {{"Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines"}},
booktitle = {{"Text, Speech, and Dialogue"}},
year = "2023",
publisher = {{"Springer Nature Switzerland"}},
address = {{"Cham"}},
pages = "58--69",
isbn = "978-3-031-40498-6"
}
@InProceedings{HuSpaCy:2021,
title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},
booktitle = {{XVIII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia}},
author = {Orosz, Gy{\"o}rgy and Sz{\' a}nt{\' o}, Zsolt and Berkecz, P{\' e}ter and Szab{\' o}, Gerg{\H o} and Farkas, Rich{\' a}rd},
location = {{Szeged}},
pages = "59--73",
year = {2022},
}
For feature requests, issues and bugs please use the GitHub Issue Tracker. Otherwise, reach out to us in the Discussion Forum.
HuSpaCy is implemented in the SzegedAI team, coordinated by Orosz György in the Hungarian AI National Laboratory, MILAB program.
This library is released under the Apache 2.0 License
Trained models have their own license (CC BY-SA 4.0) as described on the models page.
FAQs
HuSpaCy: industrial strength Hungarian natural language processing
We found that huspacy demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
Security News
Critics call the Node.js EOL CVE a misuse of the system, sparking debate over CVE standards and the growing noise in vulnerability databases.
Security News
cURL and Go security teams are publicly rejecting CVSS as flawed for assessing vulnerabilities and are calling for more accurate, context-aware approaches.