
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
spacy-huggingface-pipelines
Advanced tools
This package provides spaCy components to use pretrained Hugging Face Transformers pipelines for inference only.
dslim/bert-base-NER
and
distilbert-base-uncased-finetuned-sst-2-english
.Installing the package from pip will automatically install all dependencies, including PyTorch and spaCy.
pip install -U pip setuptools wheel
pip install spacy-huggingface-pipelines
For GPU installation, follow the spaCy installation quickstart with GPU, e.g.
pip install -U spacy[cuda-autodetect]
If you are having trouble installing PyTorch, follow the instructions on the official website for your specific operating system and requirements.
This module provides spaCy wrappers for the inference-only transformers
TokenClassificationPipeline
and
TextClassificationPipeline
pipelines.
The models are downloaded on initialization from the Hugging Face Hub if they're not already in your local cache, or alternatively they can be loaded from a local path.
Note that the transformer model data is not saved with the pipeline when you
call nlp.to_disk
, so if you are loading pipelines in an environment with
limited internet access, make sure the model is available in your
transformers cache directory
and enable offline mode if needed.
Config settings for hf_token_pipe
:
[components.hf_token_pipe]
factory = "hf_token_pipe"
model = "dslim/bert-base-NER" # Model name or path
revision = "main" # Model revision
aggregation_strategy = "average" # "simple", "first", "average", "max"
stride = 16 # If stride >= 0, process long texts in
# overlapping windows of the model max
# length. The value is the length of the
# window overlap in transformer tokenizer
# tokens, NOT the length of the stride.
kwargs = {} # Any additional arguments for
# TokenClassificationPipeline
alignment_mode = "strict" # "strict", "contract", "expand"
annotate = "ents" # "ents", "pos", "spans", "tag"
annotate_spans_key = null # Doc.spans key for annotate = "spans"
scorer = null # Optional scorer
TokenClassificationPipeline
settingsmodel
: The model name or path.revision
: The model revision. For production use, a specific git commit is
recommended instead of the default main
.stride
: For stride >= 0
, the text is processed in overlapping windows
where the stride
setting specifies the number of overlapping tokens between
windows (NOT the stride length). If stride
is None
, then the text may be
truncated. stride
is only supported for fast tokenizers.aggregation_strategy
: The aggregation strategy determines the word-level
tags for cases where subwords within one word do not receive the same
predicted tag. See:
https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategykwargs
: Any additional arguments to
TokenClassificationPipeline
.alignment_mode
determines how transformer predictions are aligned to spaCy
token boundaries as described for
Doc.char_span
.annotate
and annotate_spans_key
configure how the annotation is saved to
the spaCy doc. You can save the output as token.tag_
, token.pos_
(only for
UPOS tags), doc.ents
or doc.spans
.Doc.ents
:import spacy
nlp = spacy.blank("en")
nlp.add_pipe("hf_token_pipe", config={"model": "dslim/bert-base-NER"})
doc = nlp("My name is Sarah and I live in London")
print(doc.ents)
# (Sarah, London)
Doc.spans[spans_key]
and scores as
Doc.spans[spans_key].attrs["scores"]
:import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={
"model": "dslim/bert-base-NER",
"annotate": "spans",
"annotate_spans_key": "bert-base-ner",
},
)
doc = nlp("My name is Sarah and I live in London")
print(doc.spans["bert-base-ner"])
# [Sarah, London]
print(doc.spans["bert-base-ner"].attrs["scores"])
# [0.99854773, 0.9996215]
Token.tag
:import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={
"model": "QCRI/bert-base-multilingual-cased-pos-english",
"annotate": "tag",
},
)
doc = nlp("My name is Sarah and I live in London")
print([t.tag_ for t in doc])
# ['PRP$', 'NN', 'VBZ', 'NNP', 'CC', 'PRP', 'VBP', 'IN', 'NNP']
Token.pos
:import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_token_pipe",
config={"model": "vblagoje/bert-english-uncased-finetuned-pos", "annotate": "pos"},
)
doc = nlp("My name is Sarah and I live in London")
print([t.pos_ for t in doc])
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']
Config settings for hf_text_pipe
:
[components.hf_text_pipe]
factory = "hf_text_pipe"
model = "distilbert-base-uncased-finetuned-sst-2-english" # Model name or path
revision = "main" # Model revision
kwargs = {} # Any additional arguments for
# TextClassificationPipeline
scorer = null # Optional scorer
The input texts are truncated according to the transformers model max length.
TextClassificationPipeline
settingsmodel
: The model name or path.revision
: The model revision. For production use, a specific git commit is
recommended instead of the default main
.kwargs
: Any additional arguments to
TextClassificationPipeline
.import spacy
nlp = spacy.blank("en")
nlp.add_pipe(
"hf_text_pipe",
config={"model": "distilbert-base-uncased-finetuned-sst-2-english"},
)
doc = nlp("This is great!")
print(doc.cats)
# {'POSITIVE': 0.9998694658279419, 'NEGATIVE': 0.00013048505934420973}
Both token and text classification support batching with nlp.pipe
:
for doc in nlp.pipe(texts, batch_size=256):
do_something(doc)
If the component runs into an error processing a batch (e.g. on an empty text),
nlp.pipe
will back off to processing each text individually. If it runs into
an error on an individual text, a warning is shown and the doc is returned
without additional annotation.
Switch to GPU:
import spacy
spacy.require_gpu()
for doc in nlp.pipe(texts):
do_something(doc)
Please report bugs in the spaCy issue tracker or open a new thread on the discussion board for other issues.
FAQs
spaCy wrapper for Hugging Face Transformers pipelines
We found that spacy-huggingface-pipelines demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.