Security News
38% of CISOs Fear They’re Not Moving Fast Enough on AI
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
crosslingual-coreference
Advanced tools
A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.
Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.
pip install crosslingual-coreference
from crosslingual_coreference import Predictor
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
language="en_core_web_sm", device=-1, model_name="minilm"
)
print(predictor.predict(text)["resolved_text"])
print(predictor.pipe([text])[0]["resolved_text"])
# Note you can also get 'cluster_heads' and 'clusters'
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.
from crosslingual_coreference import Predictor
predictor = Predictor(
language="en_core_web_sm",
device=0,
model_name="minilm",
chunk_size=2500,
chunk_overlap=2,
)
import spacy
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
"xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)
doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
print(doc._.cluster_heads)
# Output
#
# {Momofuku Ando: [5, 6],
# instant noodles: [11, 12],
# Osaka: [14, 14],
# Nissin: [21, 21],
# Many students: [26, 27]}
This only works with spacy >= 3.3.
import spacy
from spacy.tokens import Span
from spacy import displacy
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
nlp = spacy.load("nl_core_news_sm")
nlp.add_pipe("xx_coref", config={"model_name": "minilm"})
doc = nlp(text)
spans = []
for idx, cluster in enumerate(doc._.coref_clusters):
for span in cluster:
spans.append(
Span(doc, span[0], span[1]+1, str(idx).upper())
)
doc.spans["custom"] = spans
displacy.render(doc, style="span", options={"spans_key": "custom"})
FAQs
A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.
We found that crosslingual-coreference demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.
Security News
Company News
Socket is joining TC54 to help develop standards for software supply chain security, contributing to the evolution of SBOMs, CycloneDX, and Package URL specifications.