Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.
pip install -U pip
pip install causalnlp
NOTE: On Python 3.6.x, if you get a RuntimeError: Python version >= 3.7 required
, try ensuring NumPy is installed before CausalNLP (e.g., pip install numpy==1.18.5
).
To try out the examples yourself:
import pandas as pd
df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', error_bad_lines=False)
The file music_seed50.tsv
is a semi-simulated dataset from here. Columns of relevance include:
Y_sim
: outcome, where 1 means product was clicked and 0 means not.text
: raw text of reviewrating
: rating associated with review (1 through 5)T_true
: 0 means rating less than 3, 1 means rating of 5, where T_true
affects the outcome Y_sim
.T_ac
: an approximation of true review sentiment (T_true
) created with Autocoder from raw review textC_true
:confounding categorical variable (1=audio CD, 0=other)We'll pretend the true sentiment (i.e., review rating and T_true
) is hidden and only use T_ac
as the treatment variable.
Using the text_col
parameter, we include the raw review text as another "controlled-for" variable.
from causalnlp import CausalInferenceModel
from lightgbm import LGBMClassifier
cm = CausalInferenceModel(df,
metalearner_type='t-learner', learner=LGBMClassifier(num_leaves=500),
treatment_col='T_ac', outcome_col='Y_sim', text_col='text',
include_cols=['C_true'])
cm.fit()
outcome column (categorical): Y_sim
treatment column: T_ac
numerical/categorical covariates: ['C_true']
text covariate: text
preprocess time: 1.1179866790771484 sec
start fitting causal inference model
time to fit causal inference model: 10.361494302749634 sec
CausalNLP supports estimation of heterogeneous treatment effects (i.e., how causal impacts vary across observations, which could be documents, emails, posts, individuals, or organizations).
We will first calculate the overall average treatment effect (or ATE), which shows that a positive review increases the probability of a click by 13 percentage points in this dataset.
Average Treatment Effect (or ATE):
print( cm.estimate_ate() )
{'ate': 0.1309311542209525}
Conditional Average Treatment Effect (or CATE): reviews that mention the word "toddler":
print( cm.estimate_ate(df['text'].str.contains('toddler')) )
{'ate': 0.15559234254638685}
Individualized Treatment Effects (or ITE):
test_df = pd.DataFrame({'T_ac' : [1], 'C_true' : [1],
'text' : ['I never bought this album, but I love his music and will soon!']})
effect = cm.predict(test_df)
print(effect)
[[0.80538201]]
Model Interpretability:
print( cm.interpret(plot=False)[1][:10] )
v_music 0.079042
v_cd 0.066838
v_album 0.055168
v_like 0.040784
v_love 0.040635
C_true 0.039949
v_just 0.035671
v_song 0.035362
v_great 0.029918
v_heard 0.028373
dtype: float64
Features with the v_
prefix are word features. C_true
is the categorical variable indicating whether or not the product is a CD.
Despite the "NLP" in CausalNLP, the library can be used for causal inference on data without text (e.g., only numerical and categorical variables). See the examples for more info.
API documentation and additional usage examples are available at: https://amaiya.github.io/causalnlp/
Please cite the following paper when using CausalNLP in your work:
@article{maiya2021causalnlp,
title={CausalNLP: A Practical Toolkit for Causal Inference with Text},
author={Arun S. Maiya},
year={2021},
eprint={2106.08043},
archivePrefix={arXiv},
primaryClass={cs.CL},
journal={arXiv preprint arXiv:2106.08043},
}
FAQs
CausalNLP: A Practical Toolkit for Causal Inference with Text
We found that causalnlp demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.