
Research
Namastex.ai npm Packages Hit with TeamPCP-Style CanisterWorm Malware
Malicious Namastex.ai npm packages appear to replicate TeamPCP-style Canister Worm tradecraft, including exfiltration and self-propagation.
compExtract
Advanced tools
This module helps you extract key terms and topics from corpus using a comparative approach.
from compExtract import ComparativeExtraction
import pandas as pd
import numpy as np
PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"
data = pd.read_csv(PATH)
label = [x <= 3 for x in data['stars']]
data
| stars | titles | reviews | dates | |
|---|---|---|---|---|
| 0 | 5.0 | Worth It\n | Definitely worth the money!\n | September 21, 2019 |
| 1 | 2.0 | Nintendo Swich gris joy con\n | Con este producto no he sentido mucha satisfac... | September 20, 2019 |
| 2 | 5.0 | My kid wont put it down\n | Couldnt of been happier, came early. I was th... | September 20, 2019 |
| 3 | 3.0 | Happy\n | Happy\n | September 20, 2019 |
| 4 | 5.0 | Great\n | Great product\n | September 19, 2019 |
| ... | ... | ... | ... | ... |
| 4995 | 1.0 | One Star\n | it is no good, it suck, no work, plz hlp amazon\n | December 12, 2017 |
| 4996 | 5.0 | A must have gaming system\n | The Nintendo Switch is a versatile hybrid game... | December 12, 2017 |
| 4997 | 5.0 | Switch\n | This purchase save me from looking for one.\n | December 11, 2017 |
| 4998 | 5.0 | Five Stars\n | Best babysitter ever!\n | December 11, 2017 |
| 4999 | 5.0 | Five Stars\n | Its a great game console.\n | December 11, 2017 |
5000 rows × 4 columns
data.columns
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')
Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.
The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.
Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.
ce = ComparativeExtraction(corpus = data['reviews'], labels = label)
ce.get_distinguish_terms(ngram_range = (1,3),top_n = 10)
<compExtract.ComparativeExtraction at 0x7ff96f84b588>
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews
ce.increased_terms_df
| feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
|---|---|---|---|---|---|---|
| 0 | work | 0.194976 | 0.278426 | 191 | 0.083449 | 360 |
| 1 | switch | 0.176764 | 0.351312 | 241 | 0.174548 | 753 |
| 2 | buy | 0.174520 | 0.297376 | 204 | 0.122856 | 530 |
| 3 | month | 0.143129 | 0.158892 | 109 | 0.015763 | 68 |
| 4 | nintendo | 0.134316 | 0.290087 | 199 | 0.155772 | 672 |
| 5 | charge | 0.122855 | 0.141399 | 97 | 0.018544 | 80 |
| 6 | use | 0.118448 | 0.206997 | 142 | 0.088549 | 382 |
| 7 | new | 0.113989 | 0.160350 | 110 | 0.046361 | 200 |
| 8 | would | 0.106540 | 0.164723 | 113 | 0.058183 | 251 |
| 9 | get | 0.104055 | 0.231778 | 159 | 0.127724 | 551 |
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews
ce.decreased_terms_df
| feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
|---|---|---|---|---|---|---|
| 0 | love | -0.216997 | 0.080175 | 55 | 0.297172 | 1282 |
| 1 | great | -0.122247 | 0.099125 | 68 | 0.221372 | 955 |
| 2 | fun | -0.048160 | 0.046647 | 32 | 0.094808 | 409 |
| 3 | best | -0.042638 | 0.030612 | 21 | 0.073250 | 316 |
| 4 | amaze | -0.038011 | 0.010204 | 7 | 0.048215 | 208 |
| 5 | awesome | -0.035827 | 0.007289 | 5 | 0.043115 | 186 |
| 6 | son love | -0.035564 | 0.002915 | 2 | 0.038479 | 166 |
| 7 | perfect | -0.032515 | 0.008746 | 6 | 0.041261 | 178 |
| 8 | easy | -0.026282 | 0.023324 | 16 | 0.049606 | 214 |
| 9 | kid love | -0.024370 | 0.004373 | 3 | 0.028744 | 124 |
If we need more context on a given word, or we need more interpretable topics, we can:
Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.
The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.
# The binary_dtm provides a convenient way to extract reviews with specific terms
print(ce.binary_dtm[['work']])
work
0 0
1 0
2 0
3 0
4 0
... ...
4995 1
4996 0
4997 0
4998 0
4999 0
[5000 rows x 1 columns]
reviews_contain_term_work = data['reviews'][[x == 1 for x in ce.binary_dtm['work']]]
len(reviews_contain_term_work)
551
for x in pd.Series(reviews_contain_term_work).sample(1):
print(x)
I bought this as a Christmas present for my son. After about a month and half of using it. The switch stopped working. It wont charge. The product is an expensive piece of junk.
ce_ngram = ComparativeExtraction(corpus = data['reviews'], labels = label).get_distinguish_terms(ngram_range=(2,4), top_n=10)
/Users/xiaoma/envs/compExtract/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
warnings.warn("The parameter 'token_pattern' will not be used"
<compExtract.ComparativeExtraction at 0x7ff955f23cf8>
ce_ngram.increased_terms_df
| feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
|---|---|---|---|---|---|---|
| 0 | joy con | 0.040857 | 0.056851 | 39 | 0.015994 | 69 |
| 1 | brand new | 0.020511 | 0.027697 | 19 | 0.007186 | 31 |
| 2 | nintendo switch | 0.019638 | 0.074344 | 51 | 0.054706 | 236 |
| 3 | buy switch | 0.018888 | 0.027697 | 19 | 0.008809 | 38 |
| 4 | play game | 0.014092 | 0.039359 | 27 | 0.025267 | 109 |
| 5 | game play | 0.009812 | 0.021866 | 15 | 0.012054 | 52 |
| 6 | year old | 0.005243 | 0.023324 | 16 | 0.018081 | 78 |
| 7 | christmas gift | 0.003682 | 0.014577 | 10 | 0.010895 | 47 |
| 8 | battery life | 0.001833 | 0.024781 | 17 | 0.022949 | 99 |
| 9 | wii u | 0.000504 | 0.016035 | 11 | 0.015531 | 67 |
ce_ngram.decreased_terms_df
| feature | diff | pos_prop | pos_count | neg_prop | neg_count | |
|---|---|---|---|---|---|---|
| 0 | son love | -0.035564 | 0.002915 | 2 | 0.038479 | 166 |
| 1 | kid love | -0.024370 | 0.004373 | 3 | 0.028744 | 124 |
| 2 | great game | -0.018442 | 0.007289 | 5 | 0.025730 | 111 |
| 3 | great product | -0.014171 | 0.004373 | 3 | 0.018544 | 80 |
| 4 | great console | -0.013641 | 0.005831 | 4 | 0.019471 | 84 |
| 5 | best console | -0.013609 | 0.001458 | 1 | 0.015067 | 65 |
| 6 | highly recommend | -0.012615 | 0.002915 | 2 | 0.015531 | 67 |
| 7 | absolutely love | -0.011987 | 0.001458 | 1 | 0.013445 | 58 |
| 8 | game system | -0.011746 | 0.021866 | 15 | 0.033611 | 145 |
| 9 | love switch | -0.011452 | 0.013120 | 9 | 0.024571 | 106 |
FAQs
Extract keywords via comparison of corpus
We found that compExtract demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Malicious Namastex.ai npm packages appear to replicate TeamPCP-style Canister Worm tradecraft, including exfiltration and self-propagation.

Product
Explore exportable charts for vulnerabilities, dependencies, and usage with Reports, Socket’s new extensible reporting framework.

Product
Socket for Jira lets teams turn alerts into Jira tickets with manual creation, automated ticketing rules, and two-way sync.