ReaderBench Python
Install
We recommend using virtual environments, as some packages require an exact version.
If you only want to use the package do the following:
sudo apt-get install python3-pip, python3-venv, python3-dev
python3 -m venv rbenv
(create virutal environment named rbenv)source rbenv/bin/activate
(activate virtual env)pip3 uninstall setuptools && pip3 install setuptools && pip3 install --upgrade pip && pip3 install --no-cache-dir rbpy-rb
- Use it as in: https://github.com/readerbench/ReaderBench/blob/master/usage.py
If you want to contribute to the code base of package:
sudo apt-get install python3-pip, python3-venv, python3-dev
git clone git@git.readerbench.com:ReaderBench/readerbenchpy.git && cd readerbenchpy/
python3 -m venv rbenv
(create virutal environment named rbenv)source rbenv/bin/activate
(activate virtual env)pip3 uninstall setuptools && pip3 install setuptools && pip3 install --upgrade pip
pip3 install -r requirements.txt
python3 nltk_download.py
Optional: prei-install model for en (otherwise most of the English processings would fail
and ask to run this command):python3 -m spacy download en_core_web_lg
If you want to install spellchecking (hunspell) also you need this non-python libraries:
sudo apt-get install libhunspell-1.6-0 libhunspell-dev hunspell-ro
pip3 install hunspell
Usage
For usage (parsing, lemmatization, NER, wordnet, content words, indices etc.) see file usage.py
from
https://github.com/readerbench/ReaderBench
Tips
You may also need some spacy models which are downloaded through spacy.
You have to download these spacy models by yourself, using the command:
python3 -m spacy download name_of_the_model
The logger will also write instructions on which models you need, and how to download them.
Developer instructions
How to use Bert
Our models are also available in the HuggingFace platform: https://huggingface.co/readerbench
You can use them directly from HuggingFace:
# tensorflow
from transformers import AutoModel, AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = TFAutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
outputs = model(inputs)
# pytorch
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = AutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)
or from ReaderBench:
from rb.core.lang import Lang
from rb.processings.encoders.bert import BertWrapper
from tensorflow import keras
bert_wrapper = BertWrapper(Lang.RO, max_seq_len=128)
inputs, bert_layer = bert_wrapper.create_inputs_and_model()
cls_output = bert_wrapper.get_output(bert_layer, "cls") # or "pool"
# Add decision layer and compile model
# eg.
# hidden = keras.layers.Dense(..)(cls_output)
# output = keras.layers.Dense(..)(hidden)
# model = keras.Model(inputs=inputs, outputs=[output])
# model.compile(..)
bert_wrapper.load_weights() #must be called after compile
# Process inputs for model
feed_inputs = bert_wrapper.process_input(["text1", "text2", "text3"])
# feed_output = ...
# model.fit(feed_inputs, feed_output, ...)
How to use the logger
In each file you have to initialize the logger:
from rb.utils.rblogger import Logger
logger = Logger.get_logger()
logger.info("info msg")
logger.warning("warning msg")
logger.error()
How to push the wheel on pip
rm -r dist/
pip3 install twine wheel
./upload_to_pypi.sh
How to run rb/core/cscl/csv_parser.py
- Do the installing steps from contribution
- run
pip3 install xmltodict
- run
EXPORT PYTHONPATH=/add/path/to/repo/readerbenchpy/
- add json resources in a
jsons
directory in readerbenchpy/rb/core/cscl/
- run
cd rb/core/cscl/ && python3 csv_parser.py
Supported Date Formats
ReaderBench is able to perform conversation analysis from chats and communities. Each utterance must have the time expressed in one of the following formats:
- %Y-%m-%d %H:%M:%S.%f %Z
- %Y-%m-%d %H:%M:%S %Z
- %Y-%m-%d %H:%M %Z
- %Y-%m-%d %H:%M:%S.%f
- %Y-%m-%d %H:%M:%S
- %Y-%m-%d %H:%M
where codifications are extracted from Python date format codes.