Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
TakeBlipSentimentAnalysis
Advanced tools
Data & Analytics Research
Sentiment analysis is the process of detecting the sentiment of a sentence, the sentiment could be negative, positive or neutral.
This implementation uses a LSTM implementation to solve the task. The implementation is using PyTorch framework and Gensim FastText as input embedding.
To train the model it is necessary a csv file with the labeled dataset, and an embedding file. For prediction, the files needed are the embedding model, the trained model and the vocabulary of the labels (output of the train).
This implementation presents the possibility to predict the sentiment for a single sentence and for a batch of sentences (by file or dictionary).
The LSTM architecture utilized in this implementation has 4 layers:
Embedding layer: a layer with the embedding representation of each word.
LSTM layer: receives as input the embedding representation of each word in a sentence. For each word generate a output with size pre-defined.
The linear output layer: receives as input the last word hidden output of the LSTM and applies a linear function to get a vector of the size of the possible labels.
Softmax layer: receives the output of the linear layer and apply softmax operation to get the probability of each label.
For the bidirectional LSTM the linear output layer receives the hidden output from the first and the last word.
To train your own Sentiment Analysis model you will need a csv file with the following structure:
Message Sentiment
achei pessimo o atendimento Negative
otimo trabalho Positive
bom dia Neutral
..., ...
A few steps should be followed to train the model.
An example with the steps
import torch
import os
import pickle
from TakeSentimentAnalysis import model, utils
from TakeSentimentAnalysis.train import LSTMTrainer
File variables
input_path = '*.csv'
sentence_column = 'Message'
label_column = 'Sentiment'
encoding = 'utf-8'
separator = '|'
use_pre_processing = True
save_dir = 'path_to_save_folder'
wordembed_path = '*.kv'
The file variables are:
Validation variables
val = True
val-path = '*.csv'
val-period = 1
Model variables
word-dim = 300
lstm-dim = 300
lstm-layers = 1
dropout-prob = 0.05
bidirectional = False
epochs = 5
batch-size = 32
shuffle = False
learning-rate = 0.001
learning-rate-decay = 0.1
max-patience = 2
max-decay-num = 2
patience-threshold = 0.98
Generate the sentences vocabulary. This steps is necessary to generate a index to each word in the sentences (on train and validation datasets) to retrieve information after PyTorch operations.
pad_string = '<pad>'
unk_string = '<unk>'
sentence_vocab = vocab.create_vocabulary(
input_path=input_path,
column_name=sentence_column,
pad_string=pad_string,
unk_string=unk_string,
encoding=encoding,
separator=separator,
use_pre_processing=use_pre_processing)
if val:
sentences = vocab.read_sentences(
path=val_path,
column=sentence_column,
encoding=encoding,
separator=separator,
use_pre_processing=use_pre_processing)
vocab.populate_vocab(sentences, sentence_vocab)
Generating the labels vocabulary. To generate a index of each label, this object is necessary to predict, so must be saved.
label_vocab = vocab.create_vocabulary(
input_path=input_path,
column_name=label_column,
pad_string=pad_string,
unk_string=unk_string,
encoding=encoding,
separator=separator,
is_label=True)
vocab_label_path = os.path.join(save_dir,
'vocab-label.pkl')
pickle.dump(label_vocab, open(vocab_label_path, 'wb'))
Initialize the LSTM model.
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
lstm_model = model.LSTM(
vocab_size=len(sentence_vocab),
word_dim=word_dim,
n_labels=len(label_vocab),
hidden_dim=lstm_dim,
layers=lstm_layers,
dropout_prob=dropout_prob,
device=device,
bidirectional=bidirectional
).to(device)
lstm_model.reset_parameters()
Fill the embedding layer with the representation of each word in the vocabulary.
wordembed_path = wordembed_path
fasttext = utils.load_fasttext_embeddings(wordembed_path, pad_string)
lstm_model.embeddings[0].weight.data = torch.from_numpy(
fasttext[sentence_vocab.i2f.values()])
lstm_model.embeddings[0].weight.requires_grad = False
trainer = LSTMTrainer(
lstm_model=lstm_model,
epochs=epochs,
input_vocab=sentence_vocab,
input_path=input_path,
label_vocab=label_vocab,
save_dir=save_dir,
val=val,
val_period=val_period,
pad_string=pad_string,
unk_string=unk_string,
batch_size=batch_size,
shuffle=shuffle,
label_column=label_column,
encoding=encoding,
separator=separator,
use_pre_processing=use_pre_processing,
learning_rate=learning_rate,
learning_rate_decay=learning_rate_decay,
max_patience=max_patience,
max_decay_num=max_decay_num,
patience_threshold=patience_threshold,
val_path=val_path)
trainer.train()
The prediction can be made for a single sentence or for a batch of sentences.
In both cases a few steps should be followed.
import sys
import os
import torch
from TakeSentimentAnalysis import utils
from TakeSentimentAnalysis.predict import SentimentPredict
model_path = '*.pkl'
label_vocab = '*.pkl'
save_dir = '*.csv'
encoding = 'utf-8'
separator = '|'
sys.path.insert(0, os.path.dirname(model_path))
lstm_model = torch.load(model_path)
pad_string = '<pad>'
unk_string = '<unk>'
embedding = utils.load_fasttext_embeddings(wordembed_path,
pad_string)
SentimentPredicter = SentimentPredict(model=lstm_model,
label_path=label_vocab,
embedding=embedding,
save_dir=save_dir,
encoding=encoding,
separator=separator)
To predict a single sentence
SentimentPredicter.predict_line(line=sentence)
To predict in a batch a few more variables are need:
To predict a batch using dictionary:
SentimentPredicter.predict_batch(
filepath='',
sentence_column='',
pad_string=pad_string,
unk_string=unk_string,
batch_size=batch_size,
shuffle=shuffle,
use_pre_processing=use_pre_processing,
sentences=[{'id': 1, 'sentence': sentence_1},
{'id': 2, 'sentence': sentence_2}]))
To predict a batch using a csv file:
SentimentPredicter.predict_batch(
filepath=input_path,
sentence_column=sentence_column,
pad_string=pad_string,
unk_string=unk_string,
batch_size=batch_size,
shuffle=shuffle,
use_pre_processing=use_pre_processing)
FAQs
Sentiment Analysis Package
We found that TakeBlipSentimentAnalysis demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.