You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

compExtract

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

compExtract

Extract keywords via comparison of corpus

0.1.2
pipPyPI
Maintainers
1

Introduction

This module helps you extract key terms and topics from corpus using a comparative approach.

Installation

Usage

Import packages

from compExtract import ComparativeExtraction

Load sample data

import pandas as pd
import numpy as np
PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"
data = pd.read_csv(PATH)
label = [x <= 3 for x in data['stars']]
data
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
starstitlesreviewsdates
05.0Worth It\nDefinitely worth the money!\nSeptember 21, 2019
12.0Nintendo Swich gris joy con\nCon este producto no he sentido mucha satisfac...September 20, 2019
25.0My kid wont put it down\nCouldnt of been happier, came early. I was th...September 20, 2019
33.0Happy\nHappy\nSeptember 20, 2019
45.0Great\nGreat product\nSeptember 19, 2019
...............
49951.0One Star\nit is no good, it suck, no work, plz hlp amazon\nDecember 12, 2017
49965.0A must have gaming system\nThe Nintendo Switch is a versatile hybrid game...December 12, 2017
49975.0Switch\nThis purchase save me from looking for one.\nDecember 11, 2017
49985.0Five Stars\nBest babysitter ever!\nDecember 11, 2017
49995.0Five Stars\nIts a great game console.\nDecember 11, 2017

5000 rows × 4 columns

data.columns
Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')

Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.

The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.

Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.

Initialize the module with the review corpus and labels

ce = ComparativeExtraction(corpus = data['reviews'], labels = label)

Extract the keywords

ce.get_distinguish_terms(ngram_range = (1,3),top_n = 10)
<compExtract.ComparativeExtraction at 0x7ff96f84b588>
# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews
ce.increased_terms_df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
featurediffpos_proppos_countneg_propneg_count
0work0.1949760.2784261910.083449360
1switch0.1767640.3513122410.174548753
2buy0.1745200.2973762040.122856530
3month0.1431290.1588921090.01576368
4nintendo0.1343160.2900871990.155772672
5charge0.1228550.141399970.01854480
6use0.1184480.2069971420.088549382
7new0.1139890.1603501100.046361200
8would0.1065400.1647231130.058183251
9get0.1040550.2317781590.127724551
# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews
ce.decreased_terms_df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
featurediffpos_proppos_countneg_propneg_count
0love-0.2169970.080175550.2971721282
1great-0.1222470.099125680.221372955
2fun-0.0481600.046647320.094808409
3best-0.0426380.030612210.073250316
4amaze-0.0380110.01020470.048215208
5awesome-0.0358270.00728950.043115186
6son love-0.0355640.00291520.038479166
7perfect-0.0325150.00874660.041261178
8easy-0.0262820.023324160.049606214
9kid love-0.0243700.00437330.028744124

If we need more context on a given word, or we need more interpretable topics, we can:

  • Output the reviews that contains the term
  • Switch the ngram_range

Output the reviews

Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.

The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.

# The binary_dtm provides a convenient way to extract reviews with specific terms
print(ce.binary_dtm[['work']])
      work
0        0
1        0
2        0
3        0
4        0
...    ...
4995     1
4996     0
4997     0
4998     0
4999     0

[5000 rows x 1 columns]
reviews_contain_term_work = data['reviews'][[x == 1 for x in ce.binary_dtm['work']]]
len(reviews_contain_term_work)
551
for x in pd.Series(reviews_contain_term_work).sample(1):
    print(x)
I bought this as a Christmas present for my son.  After about a month and half of using it.  The switch stopped working.  It wont charge.  The product is an expensive piece of junk.

Change the n-gram range to exclude uni-grams

ce_ngram = ComparativeExtraction(corpus = data['reviews'], labels = label).get_distinguish_terms(ngram_range=(2,4), top_n=10)
/Users/xiaoma/envs/compExtract/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"





<compExtract.ComparativeExtraction at 0x7ff955f23cf8>
ce_ngram.increased_terms_df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
featurediffpos_proppos_countneg_propneg_count
0joy con0.0408570.056851390.01599469
1brand new0.0205110.027697190.00718631
2nintendo switch0.0196380.074344510.054706236
3buy switch0.0188880.027697190.00880938
4play game0.0140920.039359270.025267109
5game play0.0098120.021866150.01205452
6year old0.0052430.023324160.01808178
7christmas gift0.0036820.014577100.01089547
8battery life0.0018330.024781170.02294999
9wii u0.0005040.016035110.01553167
ce_ngram.decreased_terms_df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
featurediffpos_proppos_countneg_propneg_count
0son love-0.0355640.00291520.038479166
1kid love-0.0243700.00437330.028744124
2great game-0.0184420.00728950.025730111
3great product-0.0141710.00437330.01854480
4great console-0.0136410.00583140.01947184
5best console-0.0136090.00145810.01506765
6highly recommend-0.0126150.00291520.01553167
7absolutely love-0.0119870.00145810.01344558
8game system-0.0117460.021866150.033611145
9love switch-0.0114520.01312090.024571106

Keywords

NLP

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts