
Research
/Security News
60 Malicious Ruby Gems Used in Targeted Credential Theft Campaign
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
A generic pre escape and post recover text tags for NLP/ML pipelines - Use at your own risk !
A generic pre escape and post recover text tags for NLP/ML pipelines - Use at your own risk !
NLP (natural language processing) by definition deals with human language as it is spoken and written. We are considering any written language here.
In the same time, text to be processed could be tagged in many ways. HTML, XML, POS tagging, etc.
Since you are here, You are most likely using external NLP libraries that has their own logics and considerations; ie, external libraries don't know about your data ! and You should remove tags before processing the natural language
But why you need to remove tags before processing ? Simply because it is very likely that tags would confuse or lower machine learning efficiency.
Again, their might be some solution to some case and library, but -I think there is no general solution to strip tags from text, do the processing then recover the initial structure. It is hard to think of a general solution because I think there is simply no one. But why ?
Having:
Hello world <tag>blablablabla hellooooo</tag> and so on. That was a vEⓡ𝔂 𝔽𝕌Ňℕy ţ乇𝕏𝓣. Not only, I am leaving my credit card for You: <red>4929 9425 8354 2322 - Visa</red> here you have it !
You can think of indexing tags, processing text then recover tags. But NLP is more than capitalizing text, imagine you are using doing these processing:
These nice libraries will change text, and some of them will shrink or grow its size !
NLP-Escape is a generic solution that will make your life easier :)
NLP-Escape simply maps and replaces each tag with a unique codification using the Null character and the null character in JavaScript is \0
.
My first version encodes text by replacing tags with a succession of \0
(this might change in future versions).
As you have understood, this assumes there are no \0
already in the initial text and comes with the obvious costs:
\0
could shrink or grow the text to be processed.\0
might confuse the NLP libraries as well (just like tags themselves). But I think this is unlikely to happen (null character is rarely dealt with in one way or another) so YOU MUST DO TESTS TO VALIDATE THIS<a><b><c>...<i>...<n>
would grow considerably.FAQs
A generic pre escape and post recover text tags for NLP/ML pipelines - Use at your own risk !
The npm package nlp-escape receives a total of 0 weekly downloads. As such, nlp-escape popularity was classified as not popular.
We found that nlp-escape demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.
Research
/Security News
Two npm packages masquerading as WhatsApp developer libraries include a kill switch that deletes all files if the phone number isn’t whitelisted.