Scio v2 is a reimplementation of Scio in Python3.
Scio uses tika to extract text from documents (PDF, HTML, DOC, etc).
The result is sent to the Scio Analyzer that extracts information using a combination of NLP
(Natural Language Processing) and pattern matching.
SCIO now supports setting TLP on data upload, to annotate documents with tlp
tag. Documents downloaded by feeds will have a default TLP white, but this can be changed in the config for feeds.
Source code
The source code the workers are available on github.
To setup, first install from PyPi:
sudo pip3 install act-scio
You will also need to install beanstalkd. On debian/ubuntu you can run:
sudo apt install beanstalkd
Configure beanstalk to accept larger payloads with the -z
option. For red hat derived setups this can be configured in /etc/sysconfig/beanstalkd
MAX_JOB_SIZE=-z 524288
You then need to install NLTK data files. A helper utility to do this is included:
You will also need to create a default configuration:
scio-config user
To run the api, execute:
This will setup the API on Use --port <PORT> and --host <IP>
to listen on another port and/or another interface.
For documentation of the API endpoint see
You can create a default configuration using this command (should be run as the user running scio):
scio-config user
Common configuration can be found under ~/.config/scio/etc/scio.ini
Running Manually
Scio Tika Server
The Scio Tika server reads jobs from the beanstalk tube scio_doc
and the extracted text will be sent to the tube scio_analyze
The first time the server runs, it will download tika using maven. It will use a proxy if $https_proxy
is set.
uses tika-python which depends on tika-server.jar. If your server has internet access, this will downloaded automatically. If not or you need proxy to connect to the internet, follow the instructions on "Airagap Environment Setup" here: Currently only tested with tika-server version 2.7.0.
Scio Analyze Server
Scio Analyze Server reads (by default) jobs from the beanstalk tube scio_analyze
You can also read directly from stdin like this:
echo "The companies in the Bus; Finanical, Aviation and Automobile industry are large." | scio-analyze --beanstalk= --elasticsearch=
Scio Submit
Submit document (from file or URI) to scio_api
scio-submit \
--uri \
--scio-baseuri http://localhost:3000/submit \
--tlp white
Running as a service
Systemd compatible service scripts can be found under examples/systemd.
To install:
sudo cp examples/systemd/*.service /usr/lib/systemd/system
sudo systemctl enable scio-tika-server
sudo systemctl enable scio-analyze
sudo service start scio-tika-server
sudo service start scio-analyze
scio-feed cron job
To continously fetch new content from feeds, you can add scio-feed to cron like this (make sure the directory $HOME/logs exists):
# Fetch scio feeds every hour
0 * * * * /usr/local/bin/scio-feeds >> $HOME/logs/scio-feed.log.$(date +\%s) 2>&1
# Delete logs from scio-feeds older than 7 days
0 * * * * find $HOME/logs/ -name 'scio-feed.log.*' -mmin +10080 -exec rm {} \;
Local development
Use pip to install in local development mode. act-scio uses namespacing, so it is not compatible with using install
or develop
In repository, run:
pip3 install --user -e .