
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
This module helps you extract key terms and topics from corpus using a comparative approach.
from compExtract import ComparativeExtraction
import pandas as pd
import numpy as np
PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"
data = pd.read_csv(PATH)
label = [x <= 3 for x in data['stars']]
data
stars | titles | reviews | dates | |
---|---|---|---|---|
0 | 5.0 | Worth It\n | Definitely worth the money!\n | September 21, 2019 |
1 | 2.0 | Nintendo Swich gris joy con\n | Con este producto no he sentido mucha satisfac... | September 20, 2019 |
2 | 5.0 | My kid wont put it down\n | Couldnt of been happier, came early. I was th... | September 20, 2019 |
3 | 3.0 | Happy\n | Happy\n | September 20, 2019 |
4 | 5.0 | Great\n | Great product\n | September 19, 2019 |
... | ... | ... | ... | ... |
4995 | 1.0 | One Star\n | it is no good, it suck, no work, plz hlp amazon\n | December 12, 2017 |
4996 | 5.0 | A must have gaming system\n | The Nintendo Switch is a versatile hybrid game... | December 12, 2017 |
4997 | 5.0 | Switch\n | This purchase save me from looking for one.\n | December 11, 2017 |
4998 | 5.0 | Five Stars\n | Best babysitter ever!\n | December 11, 2017 |
4999 | 5.0 | Five Stars\n | Its a great game console.\n | December 11, 2017 |
5000 rows × 4 columns
data.columns
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')
Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.
The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.
Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.
ce = ComparativeExtraction(corpus = data['reviews'], labels = label)
ce.get_distinguish_terms(ngram_range = (1,3),top_n = 10)
<compExtract.ComparativeExtraction at 0x7ff96f84b588>
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews
ce.increased_terms_df
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | work | 0.194976 | 0.278426 | 191 | 0.083449 | 360 |
1 | switch | 0.176764 | 0.351312 | 241 | 0.174548 | 753 |
2 | buy | 0.174520 | 0.297376 | 204 | 0.122856 | 530 |
3 | month | 0.143129 | 0.158892 | 109 | 0.015763 | 68 |
4 | nintendo | 0.134316 | 0.290087 | 199 | 0.155772 | 672 |
5 | charge | 0.122855 | 0.141399 | 97 | 0.018544 | 80 |
6 | use | 0.118448 | 0.206997 | 142 | 0.088549 | 382 |
7 | new | 0.113989 | 0.160350 | 110 | 0.046361 | 200 |
8 | would | 0.106540 | 0.164723 | 113 | 0.058183 | 251 |
9 | get | 0.104055 | 0.231778 | 159 | 0.127724 | 551 |
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews
ce.decreased_terms_df
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | love | -0.216997 | 0.080175 | 55 | 0.297172 | 1282 |
1 | great | -0.122247 | 0.099125 | 68 | 0.221372 | 955 |
2 | fun | -0.048160 | 0.046647 | 32 | 0.094808 | 409 |
3 | best | -0.042638 | 0.030612 | 21 | 0.073250 | 316 |
4 | amaze | -0.038011 | 0.010204 | 7 | 0.048215 | 208 |
5 | awesome | -0.035827 | 0.007289 | 5 | 0.043115 | 186 |
6 | son love | -0.035564 | 0.002915 | 2 | 0.038479 | 166 |
7 | perfect | -0.032515 | 0.008746 | 6 | 0.041261 | 178 |
8 | easy | -0.026282 | 0.023324 | 16 | 0.049606 | 214 |
9 | kid love | -0.024370 | 0.004373 | 3 | 0.028744 | 124 |
If we need more context on a given word, or we need more interpretable topics, we can:
Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.
The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.
# The binary_dtm provides a convenient way to extract reviews with specific terms
print(ce.binary_dtm[['work']])
work
0 0
1 0
2 0
3 0
4 0
... ...
4995 1
4996 0
4997 0
4998 0
4999 0
[5000 rows x 1 columns]
reviews_contain_term_work = data['reviews'][[x == 1 for x in ce.binary_dtm['work']]]
len(reviews_contain_term_work)
551
for x in pd.Series(reviews_contain_term_work).sample(1):
print(x)
I bought this as a Christmas present for my son. After about a month and half of using it. The switch stopped working. It wont charge. The product is an expensive piece of junk.
ce_ngram = ComparativeExtraction(corpus = data['reviews'], labels = label).get_distinguish_terms(ngram_range=(2,4), top_n=10)
/Users/xiaoma/envs/compExtract/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
warnings.warn("The parameter 'token_pattern' will not be used"
<compExtract.ComparativeExtraction at 0x7ff955f23cf8>
ce_ngram.increased_terms_df
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | joy con | 0.040857 | 0.056851 | 39 | 0.015994 | 69 |
1 | brand new | 0.020511 | 0.027697 | 19 | 0.007186 | 31 |
2 | nintendo switch | 0.019638 | 0.074344 | 51 | 0.054706 | 236 |
3 | buy switch | 0.018888 | 0.027697 | 19 | 0.008809 | 38 |
4 | play game | 0.014092 | 0.039359 | 27 | 0.025267 | 109 |
5 | game play | 0.009812 | 0.021866 | 15 | 0.012054 | 52 |
6 | year old | 0.005243 | 0.023324 | 16 | 0.018081 | 78 |
7 | christmas gift | 0.003682 | 0.014577 | 10 | 0.010895 | 47 |
8 | battery life | 0.001833 | 0.024781 | 17 | 0.022949 | 99 |
9 | wii u | 0.000504 | 0.016035 | 11 | 0.015531 | 67 |
ce_ngram.decreased_terms_df
feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
---|---|---|---|---|---|---|
0 | son love | -0.035564 | 0.002915 | 2 | 0.038479 | 166 |
1 | kid love | -0.024370 | 0.004373 | 3 | 0.028744 | 124 |
2 | great game | -0.018442 | 0.007289 | 5 | 0.025730 | 111 |
3 | great product | -0.014171 | 0.004373 | 3 | 0.018544 | 80 |
4 | great console | -0.013641 | 0.005831 | 4 | 0.019471 | 84 |
5 | best console | -0.013609 | 0.001458 | 1 | 0.015067 | 65 |
6 | highly recommend | -0.012615 | 0.002915 | 2 | 0.015531 | 67 |
7 | absolutely love | -0.011987 | 0.001458 | 1 | 0.013445 | 58 |
8 | game system | -0.011746 | 0.021866 | 15 | 0.033611 | 145 |
9 | love switch | -0.011452 | 0.013120 | 9 | 0.024571 | 106 |
FAQs
Extract keywords via comparison of corpus
We found that compExtract demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.