Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
:blush: :smiley: :relaxed: :yum: :wink: :smirk: :flushed: :worried: :frowning: :triumph: :disappointed: :angry: :persevere: :confounded: :shit:
A Danish sentiment analyzer using scikit-learn LogisticRegression
from hisia import Hisia
negative_joe = Hisia('Det er simpelthen ikke okay :(')
negative_joe.sentiment
# from hisia import Hisia
negative_joe = Hisia('Det er simpelthen ikke okay :(')
negative_joe.sentiment
# Sentiment(sentiment='negative', positive_probability=0.008, negative_probability=0.992)
negative_joe.explain
# {'decision': -4.8335798389992055,
# 'intercept': 0.809727254639209,
# 'features': {(':(', -4.36432604514099),
# ('ikke', -3.273671001915033),
# ('simpelthen', -2.450742871314483),
# ('simpelthen ikke', -1.9214388345665114)}
# }
positive_gro = Hisia('Det var ikke dΓ₯rligt')
positive_gro
# Sentiment(sentiment=positive, positive_probability=0.684, negative_probability=0.316)
positive_gro.explain
# {'decision': 0.7739625583753332,
# 'intercept': 0.809727254639209,
# 'features': {('dΓ₯rlig', -8.910130726393785),
# ('ikke', -3.273671001915033),
# ('ikke dΓ₯rlig', 5.126914312204595)}
# }
Hisia is a Swahili word for emotion/feeling. My initial thought was to call it FΓΈlelser, a Danish word for feeling but it was just not right. As a Tanzanian, I went with Swahili as it was much more of a package name I would like to install from PyPI. :)
pip install -U hisia
Data: 2016 TrustPilot's 254,464 Danish reviews' body and stars and [8 fake reviews]*20 see notes for the explanation.
β Update: 2021-10-02: Political Data from Sentiment Analysis on Comments from Danish Political Articles on Social Media
Models
Hisia, LogisticRegression with SAGA, a variant of Stochastic Average Gradient (SAG), as a solver, L2 penalty was select for the base model. Test score accuracy is ca. 93% and recall of 93%. SAGA was selected because it is a faster solver for large datasets (rows and columns wise). As a stochastic gradient, the the memory of the previous epoch gradient is incorporated/feed-forward to the current epoch. This allows a faster convergence rate. Seeds: 42 in data split of 80% training, 20% test, and 42 in the model used for reproducibility. Check notebooks for other parameters.
HisiaTrain, SGDClassifier, Stochastic Gradient Descent learner with smooth loss 'modified_huber as loss function and L2 penalty. Test score accuracy 94% and recall of 94%. SGDClassifier was select because of partial_fit. It allows batch/online training.
Note: This score reflects models in regards to TrustPilot reviews style of writing.
8x20 fake reviews. TrustPilot reviews are directed towards products and services. A word like 'elsker'(love) or 'hader'(hate) were rare. To make sure the model learns such a relationship, I added 8 reviews and duplicated them 20 times. These new sentences did not increase or decrease the model accuracy but correctly added the coefficient of the words love, hate and not bad (ikke dΓ₯rligt).
Notebook folder contains playground model_train notebook to reproduce the model scores and also explore what the model has learned. Same parameters and data used to train Hisia.
Hisia is part of sprogteknologi.dk tools Comparing Afinn (Lexicon) and Hisia (Logistic Regression) scoring models
βββ LICENSE
βββ README.md
β
βββ notebooks <- Jupyter notebook. Reproduce the results, show model explanations, and comparing with afinn
β βββ model_training.ipynb
β βββ afinn_hisia_comparison.ipynb
β βββ helpers.py
β
β
βββ hisia <- Source code for use in this project.
βΒ Β βββ __init__.py <- Makes hisia a Python module
βΒ Β βββ hisia.py <- hisia a sentiment predictor and explainer
β β
βΒ Β βββ data <- Path to training and validating dataset and stopwords: data folder is inside hisia for retrain
βΒ Β βΒ Β βββ data.json
β β βββ data_custom.json
β β βββ stops.pkl
β β
βΒ Β βββ models <- Helpers, frozen model, models trainer
β β β
βΒ Β βΒ Β βββ base_model.pkl
βΒ Β βΒ Β βββ helpers.py
βΒ Β βΒ Β βββ train_model.py
β β
βΒ Β βββ visualization <- Results oriented visualizations
βΒ Β βββ ROC.png
βΒ Β βββ ROC_test.png
β
βββ tests <- Path to tests to check models accurance, datatypes, scikit-learn version
βΒ Β βββ __init__.py
βΒ Β βββ conftest.py
β βββ test_basemodel_results.py
βΒ Β βββ test_data.py
β βββ test_scikit_version.py
βΒ Β βββ test_tokenizer.py
β
β
βββ tox.ini <- tox file to trains base models and run pytests
"All models are wrong, but some are useful" There is no magic. Expect the model to make very basic mistakes. To help in training a better model, post an issue with the sentence and expected results, and model results. Because of data limitation, this model performs very well in relationship to products or companies reviews, but limited outside those domain.
Coming Soon
FAQs
A Danish sentiment analysis using scikit-learn
We found that hisia demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.