You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

github.com/makcedward/nlpaug

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/makcedward/nlpaug

v0.0.5
Source
Go
Version published
Created
Source

Build Status Codacy Badge Codecov Badge

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Starter Guides

Augmenter

TargetAugmenterActionDescription
CharacterRandomAugInsertInsert character randomly
SubstituteSubstitute character randomly
SwapSwap character randomly
DeleteDelete character randomly
OcrAugSubstituteSimulate OCR engine error
QwertyAugSubstituteSimulate keyboard distnace error
WordRandomWordAugSwapSwap word randomly
DeleteDelete word randomly
WordNetAugSubstituteSubstitute word according to WordNet's synonym
Word2vecAugInsertInsert word randomly from word2vec dictionary
SubstituteSubstitute word based on word2vec embeddings
GloVeAugInsertInsert word randomly from GloVe dictionary
SubstituteSubstitute word based on GloVe embeddings
FasttextAugInsertInsert word randomly from fasttext dictionary
SubstituteSubstitute word based on fasttext embeddings
BertAugInsertInsert word based by feeding surroundings word to BERT language model
SubstituteSubstitute word based by feeding surroundings word to BERT language model
SpectrogramFrequencyMaskingAugSubstituteSet block of values to zero according to frequency dimension
TimeMaskingAugSubstituteSet block of values to zero according to time dimension
AudioNoiseAugSubstituteInject noise
PitchAugSubstituteAdjust pitch
ShiftAugSubstituteShift time dimension forward/ backward
SpeedAugSubstituteAdjust speed of audio

Flow

PipelineDescription
SequentialApply list of augmentation functions sequentially
SometimesApply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git

Download word2vec or GloVe files if you use Word2VecAug, GloVeAug or FasttextAug:

Recent Changes

0.0.5 Jul 2, 2019:

0.0.4 Jun 7, 2019:

  • Added stopwords feature in character and word augmenter.
  • Added character's swap augmenter.
  • Added word's swap augmenter.
  • Added validation rule for #1.
  • Fixed BERT reverse tokenization for #2.

0.0.3 May 23, 2019:

  • Added Speed, Noise, Shift and Pitch augmenters for Audio

0.0.2 Apr 30, 2019:

  • Added Frequency Masking and Time Masking for Speech Recognition (Spectrogram).
  • Added librosa library dependency for converting wav to spectrogram.

0.0.1 Mar 20, 2019: Project initialization

Test

Word2vec and GloVe models are used in word insertion and substitution. Those model files are necessary in order to run test case. You have to add ".env" file in root directory and the content should be
    - MODEL_DIR={MODEL FILE PATH}
Folder structure of model should be
    -- root directory
        - glove.6B.50d.txt
        - GoogleNews-vectors-negative300.bin
        - wiki-news-300d-1M.vec

Research Reference

AugmenterResearch
RandomAug, QwertyAugD. Pruthi, B. Dhingra and Z. C. Lipton. Combating Adversarial Misspellings with Robust Word Recognition. 2019
WordNetAugX. Zhang, J. Zhao and Y. LeCun. Character-level Convolutional Networks for Text Classification. 2015
WordNetAugS. Kobayashi. C. Coulombe. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. 2018
Word2vecAug, GloVeAug, FasttextAugW. Y. Wang and D. Yang. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets. 2015
BertAugS. Kobayashi. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relation. 2018
FrequencyMaskingAug, TimeMaskingAugD. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. 2019

FAQs

Package last updated on 03 Jul 2019

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.