TextMood - Simple, powerful sentiment analyzer
TextMood is a simple and powerful sentiment analyzer, provided as a Ruby gem with
a command-line tool for simple interoperability with other processes. It takes text
as input and returns a sentiment score.
The sentiment analysis is relatively simple, and works by splitting the text into
tokens and comparing each token to a pre-selected sentiment score for that token.
The combined score for all tokens is then returned.
However, TextMood also supports doing multiple passes over the text, splitting
it into tokens of N words (N-grams) for each pass. By adding multi-word tokens to
the sentiment file and using this feature, you can achieve much greater accuracy
than with just single-word analysis.
Summary of features
- Bundles baseline sentiment scores for many languages, making it easy to get started
- CLI tool that makes it extremely simple to get sentiment scores for any text
- Supports multiple passes for any range of N-grams
- Has a flexible API that’s easy to use and understand
Bundled languages
- English ("en") - decent quality, copied from cmaclell/Basic-Tweet-Sentiment-Analyzer
- Russian ("ru") - low quality, raw Google Translate of the English file
- Spanish ("es") - low quality, raw Google Translate of the English file
- German ("de") - low quality, raw Google Translate of the English file
- French ("fr") - low quality, raw Google Translate of the English file
- Norwegian Bokmål ("no_NB") - low quality, slightly improved Google Translate of the English file
- Swedish ("se") - low quality, raw Google Translate of the English file
- Danish ("da") - low quality, raw Google Translate of the English file
Please see the Contribute section for more info on how to improve the quality of these
files, or adding new ones.
Installation
The easiest way to get the latest stable version is to install the gem:
gem install textmood
If you’d like to get the bleeding-edge version:
git clone https://github.com/stiang/textmood
The master branch will normally be in sync with the gem, but there may be
newer code in branches.
Usage
TextMood can be used as a Ruby library or as a standalone CLI tool.
Ruby library
You can use it in a Ruby program like this:
require "textmood"
tm = TextMood.new(language: "en")
score = tm.analyze("some text")
tm = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
tm = TextMood.new(language: "zw", alias_file: "my-custom-languages.json")
tm = TextMood.new(language: "en", normalize_score: true)
score = tm.analyze("some text")
tm = TextMood.new(language: "en", ternary_output: true)
score = tm.analyze("some text")
tm = TextMood.new(language: "en",
ternary_output: true,
normalize_score: true,
min_threshold: 10,
max_threshold: 20)
score = tm.analyze("some text")
tm = TextMood.new(language: "en", debug: true, start_ngram: 2, end_ngram: 3)
score = tm.analyze("some long text with many words")
tm = TextMood.new(language: "en", debug: true)
score = tm.analyze("some text")
:verbose prints out statistics about the analysis
tm = TextMood.new(language: "en", verbose: true)
score = tm.analyze("some slightly longer text that contains a few more tokens")
#(stdout): Combined score: 1.0 (5 tokens, 0.2 avg.)
#(stdout): Negative score: -0.5 (1 tokens, -0.5 avg.)
#(stdout): Positive score: 1.5 (4 tokens, 0.375 avg.)
#(stdout): Neutral score: 0.0 (0 tokens)
#(stdout): Not found: 5 tokens
#=> '1.0'
#### CLI tool
You can also pass some UTF-8-encoded text to the CLI tool and get a score back, like so:
```bash
textmood -l en "<some text>"
-0.4375
Alternatively, you can pipe text to textmood on stdin:
echo "<some text>" | textmood -l en
-0.4375
The cli tool has many useful options, mostly mirroring those of the library. Here’s the
output from textmood -h
:
Usage: textmood [options] "<text>"
OR
echo "<text>" | textmood [options]"
Returns a sentiment score of the provided text. Above 0 is usually
considered positive, below is considered negative.
MANDATORY options:
-l, --language LANGUAGE The IETF language tag for the provided text.
Examples: en_US, no_NB
OR
-f, --file PATH TO FILE Use the specified sentiment file. May be used
multiple times to load several files. No other
files will be loaded if this option is used.
OPTIONAL options:
-a, --alias-file PATH TO FILE JSON file containing a hash that maps language codes to
sentiment score files. This lets you use the convenience of
language codes with custom sentiment score files.
-n, --normalize-score Tries to normalize the score to an integer between +/- 100
according to the number of tokens that were scored, making
it more feasible to compare scores for texts of different
length
-t, --ternary-output Return 1 (positive), -1 (negative) or 0 (neutral)
instead of the actual score. See also --min-threshold
and --max-threshold.
-i, --min-threshold FLOAT Scores lower than this are considered negative when
using --ternary-output (default 0.5). Note that the
threshold is compared to the normalized score, if applicable
-x, --max-threshold FLOAT Scores higher than this are considered positive when
using --ternary-output (default 0.5). Note that the
threshold is compared to the normalized score, if applicable
-s, --start-ngram INTEGER The lowest word N-gram number to split the text into
(default 1). Note that this only makes sense if the
sentiment file has tokens of similar N-gram length
-e, --end-ngram INTEGER The highest word N-gram number to to split the text into
(default 1). Note that this only makes sense if the
sentiment file has tokens of similar N-gram length
-k, --skip-symbols Do not include symbols file (emoticons etc.). Only applies
when using -l/--language.
-c, --config PATH TO FILE Use the specified config file. If not specified, textmood will
look for /etc/textmood.cfg and ~/.textmood. Settings in the user
config will override settings from the global file.
-d, --debug Prints out the score for each token in the provided text
or 'nil' if the token was not found in the sentiment file
-v, --verbose Prints out some useful statistics about the analysis
(counts, averages etc).
-h, --help Show this message
Configuration files for the CLI tool
The CLI tool will look for /etc/textmood and ~/.textmood unless the -c/--config option
is used, in which case only that file is used. The configuration files are basic, flat
YAML files that use the same keys as the library understands:
language: en
files: [/path/to/file1, /path/to/file2]
alias_file: /home/john/textmood/aliases.json
normalize_score: true
ternary_output: true
max_threshold: 10
min_threshold: 5
start_ngram: 1
end_ngram: 3
skip_symbols: true
debug: true
Sentiment files
The included sentiment files reside in the lang directory. I hope to add many
more baseline sentiment files in the future.
Sentiment files should be named according to the IETF language tag, like en or
no_NB, and contain one colon-separated line per token, like so:
1.0: epic
1.0: good
1.0: upright
0.958: fortunate
0.875: wonderfulness
0.875: wonderful
0.875: wide-eyed
0.875: wholesomeness
0.875: well-to-do
0.875: well-situated
0.6: well suited
-0.3: dishonest
-0.5: tragedy
The score, which must be between -1.0 and 1.0, is to the left of the first ':',
and everything to the right is the (potentially multi-word) token.
TODO
- Add more sentiment language files
- Improve sentiment files, adding bigrams and trigrams
- Improve test coverage
Contribute
Including baseline word/N-gram scores for many different languages is one
of the expressed goals of this project. If you are able to contribute scores
for a missing language or improve an existing one, it would be much appreciated!
The process is the usual:
- Fork
- Add/improve
- Pull request
Credits
Loosely based on https://github.com/cmaclell/Basic-Tweet-Sentiment-Analyzer
Author
Stian Grytøyr