Security News
tea.xyz Spam Plagues npm and RubyGems Package Registries
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box across many languages.
Readme
pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.
This project is a direct port of ruby gem - Pragmatic Segmenter which provides rule-based sentence boundary detection.
'PySBD: Pragmatic Sentence Boundary Disambiguation' a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020.
Research Paper:
https://arxiv.org/abs/2010.09657
Poster:
Python
pip install pysbd
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']
pysbd
as a spaCy pipeline component. (recommended)import spacy
from pysbd.utils import PySBDFactory
nlp = spacy.blank('en')
# explicitly adding component to pipeline
# (recommended - makes it more readable to tell what's going on)
nlp.add_pipe(PySBDFactory(nlp))
# or you can use it implicitly with keyword
# pysbd = nlp.create_pipe('pysbd')
# nlp.add_pipe(pysbd)
doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]
If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to CONTRIBUTING.md to know more and follow these steps.
git checkout -b my-new-feature
)git commit -am 'Add some feature'
)git push origin my-new-feature
)If you use pysbd
package in your projects or research, please cite PySBD: Pragmatic Sentence Boundary Disambiguation.
@inproceedings{sadvilkar-neumann-2020-pysbd,
title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
author = "Sadvilkar, Nipun and
Neumann, Mark",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
pages = "110--114",
abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}
This project wouldn't be possible without the great work done by Pragmatic Segmenter team.
FAQs
pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box across many languages.
We found that pysbd demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Security News
As cyber threats become more autonomous, AI-powered defenses are crucial for businesses to stay ahead of attackers who can exploit software vulnerabilities at scale.
Security News
UnitedHealth Group disclosed that the ransomware attack on Change Healthcare compromised protected health information for millions in the U.S., with estimated costs to the company expected to reach $1 billion.