Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
This package wraps the fast and efficient UDPipe language-agnostic NLP pipeline (via its Python bindings), so you can use UDPipe pre-trained models as a spaCy pipeline for 50+ languages out-of-the-box. Inspired by spacy-stanza, this package offers slightly less accurate models that are in turn much faster (see benchmarks for UDPipe and Stanza).
Use the package manager pip to install spacy-udpipe.
pip install spacy-udpipe
After installation, use spacy_udpipe.download()
to download the pre-trained model for the desired language.
A full list of pre-trained UDPipe models for supported languages can be found in languages.json
.
The loaded UDPipeLanguage class returns a spaCy Language
object, i.e., the object you can use to process text and create a Doc
object.
import spacy_udpipe
spacy_udpipe.download("en") # download English model
text = "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world."
nlp = spacy_udpipe.load("en")
doc = nlp(text)
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_)
As all attributes are computed once and set in the custom Tokenizer
, the Language.pipeline
is empty.
The type of text
can be one of the following:
str
,List[str]
,List[List[str]]
.The following code snippet demonstrates how to load a custom UDPipe
model (for the Croatian language):
import spacy_udpipe
nlp = spacy_udpipe.load_from_path(lang="hr",
path="./custom_croatian.udpipe",
meta={"description": "Custom 'hr' model"})
text = "Wikipedija je enciklopedija slobodnog sadržaja."
doc = nlp(text)
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_)
This can be done for any of the languages supported by spaCy. For an exhaustive list, see spaCy languages.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update the tests as appropriate. Tests are run automatically for each pull request on the master branch.
To start the tests locally, first, install the package with pip install -e .
, then run pytest
in the root source directory.
Maintained by Text Analysis and Knowledge Engineering Lab (TakeLab).
Tag map
Token.tag_
is a CoNLL XPOS tag (language-specific part-of-speech tag), defined for each language separately by the corresponding Universal Dependencies treebank. Mappings between XPOS and Universal Dependencies POS tags should be defined in a TAG_MAP
dictionary (located in language-specific tag_map.py
files), along with optional morphological features. See spaCy tag map for more details.
Syntax iterators
In order to extract Doc.noun_chunks
, a proper syntax iterator implementation for the language of interest is required. For more details, please see spaCy syntax iterators.
Other language-specific issues
A quick way to check language-specific defaults in spaCy is to visit spaCy language support. Also, please see spaCy language data for details regarding other language-specific data.
FAQs
Use fast UDPipe models directly in spaCy
We found that spacy-udpipe demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.