You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

compExtract

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

compExtract

Extract keywords via comparison of corpus

0.1.2

PyPI

Maintainers: 1

Introduction

This module helps you extract key terms and topics from corpus using a comparative approach.

Installation

Usage

Import packages

from compExtract import ComparativeExtraction

Load sample data

import pandas as pd
import numpy as np
PATH = "/Users/xiaoma/Desktop/gitrepo/associate-term-search/data/switch_reviews.csv"
data = pd.read_csv(PATH)
label = [x <= 3 for x in data['stars']]

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	stars	titles	reviews	dates
0	5.0	Worth It\n	Definitely worth the money!\n	September 21, 2019
1	2.0	Nintendo Swich gris joy con\n	Con este producto no he sentido mucha satisfac...	September 20, 2019
2	5.0	My kid wont put it down\n	Couldnt of been happier, came early. I was th...	September 20, 2019
3	3.0	Happy\n	Happy\n	September 20, 2019
4	5.0	Great\n	Great product\n	September 19, 2019
...	...	...	...	...
4995	1.0	One Star\n	it is no good, it suck, no work, plz hlp amazon\n	December 12, 2017
4996	5.0	A must have gaming system\n	The Nintendo Switch is a versatile hybrid game...	December 12, 2017
4997	5.0	Switch\n	This purchase save me from looking for one.\n	December 11, 2017
4998	5.0	Five Stars\n	Best babysitter ever!\n	December 11, 2017
4999	5.0	Five Stars\n	Its a great game console.\n	December 11, 2017

5000 rows × 4 columns

data.columns

Index(['stars', 'titles', 'reviews', 'dates'], dtype='object')

Here we are using online Amazon reviews for Nintendo Switch to illustrate the usages of the module.

The module requires a corpus and a set of binary labels as inputs. The labels should be created depending on what type of questions are we trying to answer. The set of labels should be of the same length as the corpus.

Here, let's assume that we want to know why people dislike this product and find relevant keywords. To answer this question, we created the label to be a binary variable indicating whether a reviewer gives a 3 star or less.

Initialize the module with the review corpus and labels

ce = ComparativeExtraction(corpus = data['reviews'], labels = label)

Extract the keywords

ce.get_distinguish_terms(ngram_range = (1,3),top_n = 10)

<compExtract.ComparativeExtraction at 0x7ff96f84b588>

# Get the keywords that are mentioned significantly more in the less than or equal to 3 star reviews
ce.increased_terms_df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	feature	diff	pos_prop	pos_count	neg_prop	neg_count
0	work	0.194976	0.278426	191	0.083449	360
1	switch	0.176764	0.351312	241	0.174548	753
2	buy	0.174520	0.297376	204	0.122856	530
3	month	0.143129	0.158892	109	0.015763	68
4	nintendo	0.134316	0.290087	199	0.155772	672
5	charge	0.122855	0.141399	97	0.018544	80
6	use	0.118448	0.206997	142	0.088549	382
7	new	0.113989	0.160350	110	0.046361	200
8	would	0.106540	0.164723	113	0.058183	251
9	get	0.104055	0.231778	159	0.127724	551

# Get the keywords that are mentioned significantly less in the less than or equal to 3 star reviews
ce.decreased_terms_df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	feature	diff	pos_prop	pos_count	neg_prop	neg_count
0	love	-0.216997	0.080175	55	0.297172	1282
1	great	-0.122247	0.099125	68	0.221372	955
2	fun	-0.048160	0.046647	32	0.094808	409
3	best	-0.042638	0.030612	21	0.073250	316
4	amaze	-0.038011	0.010204	7	0.048215	208
5	awesome	-0.035827	0.007289	5	0.043115	186
6	son love	-0.035564	0.002915	2	0.038479	166
7	perfect	-0.032515	0.008746	6	0.041261	178
8	easy	-0.026282	0.023324	16	0.049606	214
9	kid love	-0.024370	0.004373	3	0.028744	124

If we need more context on a given word, or we need more interpretable topics, we can:

Output the reviews that contains the term
Switch the ngram_range

Output the reviews

Say we want to know more about the significant term "work", we can directly output all the reviews containing the term.

The output class "kw" contains a one-hot encoded document-term-matrix that has all the terms found from the corpus. We can leverage it to find corresponding reviews of each term.

# The binary_dtm provides a convenient way to extract reviews with specific terms
print(ce.binary_dtm[['work']])

      work
0        0
1        0
2        0
3        0
4        0
...    ...
4995     1
4996     0
4997     0
4998     0
4999     0

[5000 rows x 1 columns]

reviews_contain_term_work = data['reviews'][[x == 1 for x in ce.binary_dtm['work']]]
len(reviews_contain_term_work)

for x in pd.Series(reviews_contain_term_work).sample(1):
    print(x)

I bought this as a Christmas present for my son.  After about a month and half of using it.  The switch stopped working.  It wont charge.  The product is an expensive piece of junk.

Change the n-gram range to exclude uni-grams

ce_ngram = ComparativeExtraction(corpus = data['reviews'], labels = label).get_distinguish_terms(ngram_range=(2,4), top_n=10)

/Users/xiaoma/envs/compExtract/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"





<compExtract.ComparativeExtraction at 0x7ff955f23cf8>

ce_ngram.increased_terms_df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	feature	diff	pos_prop	pos_count	neg_prop	neg_count
0	joy con	0.040857	0.056851	39	0.015994	69
1	brand new	0.020511	0.027697	19	0.007186	31
2	nintendo switch	0.019638	0.074344	51	0.054706	236
3	buy switch	0.018888	0.027697	19	0.008809	38
4	play game	0.014092	0.039359	27	0.025267	109
5	game play	0.009812	0.021866	15	0.012054	52
6	year old	0.005243	0.023324	16	0.018081	78
7	christmas gift	0.003682	0.014577	10	0.010895	47
8	battery life	0.001833	0.024781	17	0.022949	99
9	wii u	0.000504	0.016035	11	0.015531	67

ce_ngram.decreased_terms_df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	feature	diff	pos_prop	pos_count	neg_prop	neg_count
0	son love	-0.035564	0.002915	2	0.038479	166
1	kid love	-0.024370	0.004373	3	0.028744	124
2	great game	-0.018442	0.007289	5	0.025730	111
3	great product	-0.014171	0.004373	3	0.018544	80
4	great console	-0.013641	0.005831	4	0.019471	84
5	best console	-0.013609	0.001458	1	0.015067	65
6	highly recommend	-0.012615	0.002915	2	0.015531	67
7	absolutely love	-0.011987	0.001458	1	0.013445	58
8	game system	-0.011746	0.021866	15	0.033611	145
9	love switch	-0.011452	0.013120	9	0.024571	106

Keywords

NLP

FAQs

What is compExtract?

Is compExtract well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

compExtract

Introduction

Installation

Usage

Import packages

Load sample data

Initialize the module with the review corpus and labels

Extract the keywords

Output the reviews

Change the n-gram range to exclude uni-grams

Keywords

Related posts

Introducing License Overlays: Smarter License Management for Real-World Code

Introducing Rust Support in Socket