Socket
Socket
Sign inDemoInstall

hisia

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

hisia

A Danish sentiment analysis using scikit-learn


Maintainers
1

Hisia

:blush: :smiley: :relaxed: :yum: :wink: :smirk: :flushed: :worried: :frowning: :triumph: :disappointed: :angry: :persevere: :confounded: :shit:

A Danish sentiment analyzer using scikit-learn LogisticRegression

hisia cover

from hisia import Hisia

negative_joe = Hisia('Det er simpelthen ikke okay :(')
negative_joe.sentiment
# from hisia import Hisia

negative_joe = Hisia('Det er simpelthen ikke okay :(')
negative_joe.sentiment
# Sentiment(sentiment='negative', positive_probability=0.008, negative_probability=0.992)
negative_joe.explain
# {'decision': -4.8335798389992055,
#  'intercept': 0.809727254639209,
#  'features': {(':(', -4.36432604514099),
#               ('ikke', -3.273671001915033),
#               ('simpelthen', -2.450742871314483),
#               ('simpelthen ikke', -1.9214388345665114)}
# }

positive_gro = Hisia('Det var ikke dΓ₯rligt')
positive_gro
# Sentiment(sentiment=positive, positive_probability=0.684, negative_probability=0.316)
positive_gro.explain
# {'decision': 0.7739625583753332,
#  'intercept': 0.809727254639209,
#  'features': {('dΓ₯rlig', -8.910130726393785),
#              ('ikke', -3.273671001915033),
#              ('ikke dΓ₯rlig', 5.126914312204595)}
# }

Hisia (Emotions)

Hisia is a Swahili word for emotion/feeling. My initial thought was to call it FΓΈlelser, a Danish word for feeling but it was just not right. As a Tanzanian, I went with Swahili as it was much more of a package name I would like to install from PyPI. :)

pip install -U hisia

Data and Models Used in Hisia

Data: 2016 TrustPilot's 254,464 Danish reviews' body and stars and [8 fake reviews]*20 see notes for the explanation.
  Update: 2021-10-02: Political Data from Sentiment Analysis on Comments from Danish Political Articles on Social Media

Models
Hisia, LogisticRegression with SAGA, a variant of Stochastic Average Gradient (SAG), as a solver, L2 penalty was select for the base model. Test score accuracy is ca. 93% and recall of 93%. SAGA was selected because it is a faster solver for large datasets (rows and columns wise). As a stochastic gradient, the the memory of the previous epoch gradient is incorporated/feed-forward to the current epoch. This allows a faster convergence rate. Seeds: 42 in data split of 80% training, 20% test, and 42 in the model used for reproducibility. Check notebooks for other parameters.

HisiaTrain, SGDClassifier, Stochastic Gradient Descent learner with smooth loss 'modified_huber as loss function and L2 penalty. Test score accuracy 94% and recall of 94%. SGDClassifier was select because of partial_fit. It allows batch/online training.

Note: This score reflects models in regards to TrustPilot reviews style of writing.

8x20 fake reviews. TrustPilot reviews are directed towards products and services. A word like 'elsker'(love) or 'hader'(hate) were rare. To make sure the model learns such a relationship, I added 8 reviews and duplicated them 20 times. These new sentences did not increase or decrease the model accuracy but correctly added the coefficient of the words love, hate and not bad (ikke dΓ₯rligt).

Notebook folder contains playground model_train notebook to reproduce the model scores and also explore what the model has learned. Same parameters and data used to train Hisia.

News & Updates

Hisia is part of sprogteknologi.dk tools Comparing Afinn (Lexicon) and Hisia (Logistic Regression) scoring models

Features

  • Sentiment analysis
  • Sentiment explainer
  • Sentiment reinforcement learning (Coming Soon)
  • Sentiment retrainer (Coming Soon)

Project Organization

β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md         
β”‚
β”œβ”€β”€ notebooks          <-  Jupyter notebook. Reproduce the results, show model explanations, and comparing with afinn
β”‚   └── model_training.ipynb
β”‚   └── afinn_hisia_comparison.ipynb
β”‚   └── helpers.py          
β”‚                         
β”‚
β”œβ”€β”€ hisia              <-   Source code for use in this project.
β”‚Β Β  β”œβ”€β”€ __init__.py    <-   Makes hisia a Python module
β”‚Β Β  β”œβ”€β”€ hisia.py       <-   hisia a sentiment predictor and explainer
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ data           <-  Path to training and validating dataset and stopwords: data folder is inside hisia for retrain
β”‚Β Β  β”‚Β Β  └── data.json
β”‚   β”‚   └── data_custom.json
β”‚   β”‚   └── stops.pkl
β”‚   β”‚
β”‚Β Β  β”œβ”€β”€ models         <-  Helpers, frozen model, models trainer
β”‚   β”‚   β”‚                 
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ base_model.pkl
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ helpers.py
β”‚Β Β  β”‚Β Β  └── train_model.py
β”‚   β”‚
β”‚Β Β  └── visualization  <-  Results oriented visualizations
β”‚Β Β      └── ROC.png
β”‚Β Β      └── ROC_test.png
β”‚
β”œβ”€β”€ tests              <-   Path to tests to check models accurance, datatypes, scikit-learn version
β”‚Β Β  β”œβ”€β”€ __init__.py
β”‚Β Β  β”œβ”€β”€ conftest.py
β”‚   β”œβ”€β”€ test_basemodel_results.py
β”‚Β Β  β”œβ”€β”€ test_data.py
β”‚   β”œβ”€β”€ test_scikit_version.py
β”‚Β Β  β”œβ”€β”€ test_tokenizer.py  
β”‚
β”‚
└── tox.ini            <- tox file to trains base models and run pytests

Bugs and Errors: 6% Expected Error

"All models are wrong, but some are useful" There is no magic. Expect the model to make very basic mistakes. To help in training a better model, post an issue with the sentence and expected results, and model results. Because of data limitation, this model performs very well in relationship to products or companies reviews, but limited outside those domain.

TODO

  • Benchmark AFINN and Hisia on Non-Trustpilot data: comparison results
  • Use Danish BERT for feature extraction inside of Scikit-Learn Transformers
  • Fix path to the model issue
  • Remove more useless words (stop_words)
  • Finish HisiaTrainer

Retrain and Test: For Developers

Coming Soon

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚑️ by Socket Inc