Hisia
:blush: :smiley: :relaxed: :yum: :wink: :smirk: :flushed: :worried: :frowning: :triumph: :disappointed: :angry: :persevere: :confounded: :shit:
A Danish sentiment analyzer using scikit-learn LogisticRegression
from hisia import Hisia
negative_joe = Hisia('Det er simpelthen ikke okay :(')
negative_joe.sentiment
negative_joe = Hisia('Det er simpelthen ikke okay :(')
negative_joe.sentiment
negative_joe.explain
positive_gro = Hisia('Det var ikke dΓ₯rligt')
positive_gro
positive_gro.explain
Hisia (Emotions)
Hisia is a Swahili word for emotion/feeling. My initial thought was to call it FΓΈlelser, a Danish word for feeling but it was just not right. As a Tanzanian, I went with Swahili as it was much more of a package name I would like to install from PyPI. :)
pip install -U hisia
Data and Models Used in Hisia
Data: 2016 TrustPilot's 254,464 Danish reviews' body and stars and [8 fake reviews]*20 see notes for the explanation.
β Update: 2021-10-02: Political Data from Sentiment Analysis on Comments from Danish Political Articles on Social Media
Models
Hisia, LogisticRegression with SAGA, a variant of Stochastic Average Gradient (SAG), as a solver, L2 penalty was select for the base model. Test score accuracy is ca. 93% and recall of 93%. SAGA was selected because it is a faster solver for large datasets (rows and columns wise). As a stochastic gradient, the the memory of the previous epoch gradient is incorporated/feed-forward to the current epoch. This allows a faster convergence rate. Seeds: 42 in data split of 80% training, 20% test, and 42 in the model used for reproducibility. Check notebooks for other parameters.
HisiaTrain, SGDClassifier, Stochastic Gradient Descent learner with smooth loss 'modified_huber as loss function and L2 penalty. Test score accuracy 94% and recall of 94%. SGDClassifier was select because of partial_fit. It allows batch/online training.
Note: This score reflects models in regards to TrustPilot reviews style of writing.
8x20 fake reviews. TrustPilot reviews are directed towards products and services. A word like 'elsker'(love) or 'hader'(hate) were rare. To make sure the model learns such a relationship, I added 8 reviews and duplicated them 20 times. These new sentences did not increase or decrease the model accuracy but correctly added the coefficient of the words love, hate and not bad (ikke dΓ₯rligt).
Notebook folder contains playground model_train notebook to reproduce the model scores and also explore what the model has learned. Same parameters and data used to train Hisia.
News & Updates
Hisia is part of sprogteknologi.dk tools
Comparing Afinn (Lexicon) and Hisia (Logistic Regression) scoring models
Features
- Sentiment analysis
- Sentiment explainer
- Sentiment reinforcement learning (Coming Soon)
- Sentiment retrainer (Coming Soon)
Project Organization
βββ LICENSE
βββ README.md
β
βββ notebooks <- Jupyter notebook. Reproduce the results, show model explanations, and comparing with afinn
β βββ model_training.ipynb
β βββ afinn_hisia_comparison.ipynb
β βββ helpers.py
β
β
βββ hisia <- Source code for use in this project.
βΒ Β βββ __init__.py <- Makes hisia a Python module
βΒ Β βββ hisia.py <- hisia a sentiment predictor and explainer
β β
βΒ Β βββ data <- Path to training and validating dataset and stopwords: data folder is inside hisia for retrain
βΒ Β βΒ Β βββ data.json
β β βββ data_custom.json
β β βββ stops.pkl
β β
βΒ Β βββ models <- Helpers, frozen model, models trainer
β β β
βΒ Β βΒ Β βββ base_model.pkl
βΒ Β βΒ Β βββ helpers.py
βΒ Β βΒ Β βββ train_model.py
β β
βΒ Β βββ visualization <- Results oriented visualizations
βΒ Β βββ ROC.png
βΒ Β βββ ROC_test.png
β
βββ tests <- Path to tests to check models accurance, datatypes, scikit-learn version
βΒ Β βββ __init__.py
βΒ Β βββ conftest.py
β βββ test_basemodel_results.py
βΒ Β βββ test_data.py
β βββ test_scikit_version.py
βΒ Β βββ test_tokenizer.py
β
β
βββ tox.ini <- tox file to trains base models and run pytests
Bugs and Errors: 6% Expected Error
"All models are wrong, but some are useful" There is no magic. Expect the model to make very basic mistakes. To help in training a better model, post an issue with the sentence and expected results, and model results. Because of data limitation, this model performs very well in relationship to products or companies reviews, but limited outside those domain.
TODO
Retrain and Test: For Developers
Coming Soon