You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

abstraction

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

abstraction

machine learning framework

2017.1.16.1534

PyPI

Maintainers: 1

|project abstraction|

NOTE

Please note that abstraction is a project, not a finished product.

setup

The following Bash commands, that have been tested on Ubuntu 15.10, should install prerequisites and check out abstraction.

.. code:: bash

# Install ROOT.
sudo apt-get -y install festival
sudo apt-get -y install pylint
sudo apt-get -y install snakefood
sudo apt-get -y install sqlite
sudo apt-get -y install python-nltk
sudo python -m nltk.downloader all
sudo easy_install -U gensim
sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
sudo pip install git+git://github.com/google/skflow.git
sudo pip install abstraction
git clone https://github.com/wdbm/abstraction.git

The function abstraction.setup() should be run.

upcoming

Under consideration is a requirement for arcodex to ensure the existence of a response to an utterance before saving to database.

logging

Updating logging procedures is under consideration because of possible logging conflicts. It could be beneficial currently to run using Bash anonymous pipes, in a way like the following:

.. code:: bash

python script.py 2> >(grep -E -v "INFO|DEBUG")

data

feature scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally-distributed data: Gaussian with zero mean and unit variance -- often called a standard scores. Many machine learning algorithms assume that all features are centered around zero and have variance of the same order. A feature with a variance that is orders of magnitude larger that others might dominate the objective function and make the estimator unable to learn from other features. The scikit function scale provides a quick way to perform this operation on a single array-like dataset.

SUSY Data Set

The SUSY Data Set is a classification problem to distinguish between a signal process which produces supersymmetric particles and a background process which does not. In the data, the first column is the class label (1 for signal, 0 for background), followed by 18 features (8 low-level features and 10 high-level features):

lepton 1 pT
lepton 1 eta
lepton 1 phi
lepton 2 pT
lepton 2 eta
lepton 2 phi
missing energy magnitude
missing energy phi
MET_rel
axial MET
M_R
M_TR_2
R
MT2
S_R
M_Delta_R
dPhi_r_b
cos(theta_r1)

This data has been produced by MadGraph5 Monte Carlo simulations of 8 TeV proton collisions, with showering and hadronisation performed by Pythia 6 and detector response simulated by Delphes. The first 8 features are kinematic properties measured by simulated particle detectors. The next 10 features are functions of the first 8 features; they are high-level features derived by physicists to help discriminate between the two classes. There are 46% positive examples in the SUSY data set. The features were standardised over the entire training/testing sets with mean zero and standard deviation one, except for those features with values strictly greater than zero; these were scaled such that the mean value was one.

Caffe

introduction

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) with cleanliness, readability and speed in mind. It has a clean architecture which enables rapid deployment. It is readable and modifiable, encouraging active development. It is a fast CNN implementation. It has command line, Python and MATLAB interfaces for day-to-day usage, interfacing with research code and rapid prototyping. While Caffe is essentially a C++ library, it has a modular interface for development with cmdcaffe, pycaffe and matcaffe.

The Caffe core software packages are as follows:

Caffe
CUDA
cuDNN
OpenBLAS
OpenCV
Boost

Caffe other dependencies are as follows:

protobuf
google-glog
gflags
snappy
leveldb
lmdb
hdf5

The Caffe build tools are CMake and make.

command line

The command line interface cmdcaffe is a Caffe tool for model training, scoring and diagnostics. Run it without arguments for help. It is at directory caffe/build/tools.

train


``caffe train`` learns models from scratch, resumes learning from saved
snapshots and fine-tunes models to new data and tasks. All training
requires a solver configuration through the option
``-solver solver.prototxt``. Resuming requires the option
``snapshot model_item_1000.solverstate`` argument to load the solver
snapshot.

.. code:: bash

    # train LeNet
    caffe train -solver examples/mnist/lenet_solver.prototxt
    # train on GPU 2
    caffe train -solver examples/mnist/lenet_solver .prototxt -gpu 2

test
~~~~

``caffe test`` scores models by running them in the test phase and
resport the network output as its score. The network architecture must
be defined properly to output an accuracy measure or loss as its output.
The per-batch score is reported and then the grand average is reported
last.

.. code:: bash

    # score the learned LeNet model on the validation set
    as defined in the model architecture lenet_train_test.prototxt
    caffe test - model examples/mnist/lenet_train_test.prototxt -weights examples/mnist/lenet_iter_10000 -gpu 0 -iterations 100

benchmark

caffe time benchmarks model execution layer-by-layer through timing and synchronisation. This is useful to check system performance and measure relative execution times for models.

.. code:: bash

# time LeNet training on CPU for 10 iterations
caffe time -model examples/mnist/lenet_train_test.prototxt -iterations 10
# time LeNet training on GPU for the default 50 iterations
caffe time -model examples/mnist/lenet_train_test.prototxt - gpu 0

diagnose


``caffe device_query`` reports GPU details for reference and checking
device ordinals for running on a device in multi-GPU machines.

.. code:: bash

    # query the first device
    caffe device_query -gpu 0

pycaffe
-------

The Python interface ``pycaffe`` is the caffe module and its scripts are
at the directory ``caffe/python``. Run ``import caffe`` to load models,
do forward and backward, handle IO, visualise networks and instrument
model-solving. All model data, derivatives and parameters are exposed
for reading and writing.

``caffe.Net`` is the central interface for loading, configuring and
running models. ``caffe.Classifier`` and ``caffe.Detector`` provide
convenience interfaces for common tasks. ``caffe.SGDSolver`` exposes the
solving interface. ``caffe.io`` handles input and output with
preprocessing and protocol buffers. ``caffe.draw`` visualises network
architectures. Caffe blobs are exposed as numpy ndarrays for ease-of-use
and efficiency.

MATLAB
------

The MATLAB interface ``matcaffe`` is the Caffe MATLAB MEX file and its
helper m-files are at the directory caffe/matlab. There is example code
``caffe/matlab/caffe/matcaffe_demo.m``.

models
------

The directory structure of models is as follows:

.. code:: bash

    .
    ├── bvlc_alexnet
    │   ├── deploy.prototxt
    │   ├── readme.md
    │   ├── solver.prototxt
    │   └── train_val.prototxt
    ├── bvlc_googlenet
    │   ├── bvlc_googlenet.caffemodel
    │   ├── deploy.prototxt
    │   ├── quick_solver.prototxt
    │   ├── readme.md
    │   ├── solver.prototxt
    │   └── train_val.prototxt
    ├── bvlc_reference_caffenet
    │   ├── deploy.prototxt
    │   ├── readme.md
    │   ├── solver.prototxt
    │   └── train_val.prototxt
    ├── bvlc_reference_rcnn_ilsvrc13
    │   ├── deploy.prototxt
    │   └── readme.md
    └── finetune_flickr_style
        ├── deploy.prototxt
        ├── readme.md
        ├── solver.prototxt
        └── train_val.prototxt

draw a graph of network architecture
------------------------------------

.. code:: bash

    "${CAFFE}"/python/draw_net.py "${CAFFE}"/models/bvlc_googlenet/deploy.prototxt bvlc_googlenet_deploy.png

setup
-----

.. code:: bash

    sudo apt-get -y install libprotobuf-dev
    sudo apt-get -y install libleveldb-dev
    sudo apt-get -y install libsnappy-dev
    sudo apt-get -y install libopencv-dev
    sudo apt-get -y install libhdf5-dev
    sudo apt-get -y install libhdf5-serial-dev
    sudo apt-get -y install protobuf-compiler
    sudo apt-get -y install --no-install-recommends libboost-all-dev
    sudo apt-get -y install libatlas-base-dev
    sudo apt-get -y install python-dev
    sudo apt-get -y install libgflags-dev
    sudo apt-get -y install libgoogle-glog-dev
    sudo apt-get -y install liblmdb-dev
    sudo apt-get -y install python-pydot

.. code:: bash

    sudo pip install protobuf
    sudo pip install scikit-image

.. code:: bash

    cd
    git clone https://github.com/BVLC/caffe.git
    cd caffe
    cp Makefile.config.example Makefile.config

Edit the makefile. Uncomment ``CPU_ONLY := 1`` for a non-GPU compilation
(without CUDA). It may be necessary to include the following lines:

::

    INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial/
    LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu/hdf5/serial

.. code:: bash

    time make all
    time make test
    time make runtest
    time make pycaffe

.. code:: bash

    PYTHONPATH="/home/"${USER}"/caffe/python:${PYTHONPATH}"
    CAFFE="/home/"${USER}"/caffe"

Download Caffe models from the Model Zoo.

-  http://caffe.berkeleyvision.org/model_zoo.html
-  https://github.com/BVLC/caffe/wiki/Model-Zoo

.. code:: bash

    ~/caffe/scripts/download_model_binary.py models/bvlc_googlenet

Torch
=====

setup
-----

.. code:: bash

    curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
    git clone https://github.com/torch/distro.git ~/torch --recursive
    cd ~/torch; ./install.sh

CPU versus GPU for deep learning
================================

Roelof Pieters set some benchmarks in 2015-07 for deep dreaming video
processing using CPU and GPU hardware. The CPU hardware was Amazon EC2
g2.2xlarge Intel Xeon E5-2670 (Sandy Bridge) 8 cores 2.6 GHz/3.3 GHz
turbo and the GPU hardware was Amazon EC2 g2.2xlarge 2 x 4 Gb GPU.

+------+------+------+------+------+
| **in | **CP | **GP | **CP | **GP |
| put  | U    | U    | U    | U    |
| imag | proc | proc | proc | proc |
| e    | essi | essi | essi | essi |
| reso | ng   | ng   | ng   | ng   |
| luti | time | time | time | time |
| on   | for  | for  | for  | for  |
| (pix | 1    | 1    | 2    | 2    |
| els) | imag | imag | minu | minu |
| **   | e**  | e**  | te   | te   |
|      |      |      | vide | vide |
|      |      |      | o**  | o**  |
+======+======+======+======+======+
| 540  | 45 s | 1 s  | 1 d  | 60   |
| x    |      |      | 21 h | minu |
| 360  |      |      |      | tes  |
+------+------+------+------+------+
| 1024 | 144  | 3 s  | 6 d  | 3 h  |
| x    | s    |      |      |      |
| 768  |      |      |      |      |
+------+------+------+------+------+

So, the GPU hardware was ~45 -- ~48 times faster than the CPU hardware.

introduction
============

Project abstraction is a natural language processing project utilising
curated conversation data as neural network training data.

bags of words, skip-grams and word vectors
==========================================

Word vectors are an efficient implementation of bag-of-words and
skip-gram architectures for computing vector representations of words.
These representations can be used in natural language processing
applications and research.

An n-gram is a contiguous sequence of n items from a sequence of text or
speech. The items can be phonemes, syllabels, letters, words or base
pairs depending on the application. Skip-grams are a generalisation of
n-grams in which the components (typically words) need not be
consecutive in the text under consideration, but may have gaps that are
skipped. They are one way of overcoming the data sparsity problem found
in conventional n-gram analysis.

Formally, an n-gram is a consecutive subsequence of length n of some
sequence of tokens w\_n. A k-skip-n-gram is a length-n subsequence in
which components occur at a distance of at most k from each other. For
example, in the text

::

    the rain in Spain falls mainly on the plain

the set of 1-skip-2-grams includes all of the 2-grams and, in addition,
the following sequences:

::

    the in,
    rain Spain,
    in falls,
    Spain mainly,
    mainly the,
    on plain

It has been demonstrated that skip-gram language models can be trained
such that it is possible to perform 'word arithmetic'. For example, with
an appropriate model, the expression ``king - man + woman`` evaluates to
very close to ``queen``.

-  "Efficient Estimation of Word Representations in Vector Space", Tomas
   Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
   http://arxiv.org/abs/1301.3781

The bag-of-words model is a simplifying representation used in natural
language processing. In this model, a text is represented as a bag
(multiset -- a set in which members can appear more than once) of its
words, disregarding grammar and word order but keeping multiplicity. The
bag-of-words model is used commonly in methods of document
classification, for which the frequency of occurrence of each word is
used as a feature for training a classifier.

Word vectors are continuous distributed representations of words. The
tool word2vec takes a text corpus as input and produces word vectors as
output. It constructs a vocabulary from the training text data and then
learns vector representations of words. A word2vec model is formed by
training on raw text. It records the context, or usage, of each word
encoded as word vectors. The significance of a word vector is defined as
its usefulness as an indicator of certain larger meanings or labels.

curated conversation data
=========================

Curated conversation data sourced from Reddit is used for the
conversation analysis and modelling. Specifically, conversational
exchanges on Reddit are recorded. An exchange consists of an utterance
and a response to the utterance, together with associated data, such as
references and timestamps. A submission to Reddit is considered as an
utterance and a comment on the submission is considered as a response to
the utterance. The utterance is assumed to be of good quality and the
response is assumed to be appropriate to the utterance based on the
crowd-curated quality assessment inherent in Reddit.

translation with word vectors
=============================

In the paper `"Exploiting Similarities among Languages for Machine
Translation" <http://arxiv.org/abs/1309.4168>`__, Tomas Milokov
describes how after training two monolingual modes, a translation matrix
is generated on the most frequently occurring 5000 words. Using this
translation matrix, the accuracy of the translations was tested on 1000
words. A description Milokov gave of the general procedure is as
follows:

-  Create matrix ``M`` with dimensionality ``I`` times ``O``, where
   ``I`` is the size of input vectors and ``O`` is the size of the
   output vectors.
-  Iterate over the training set several times with decreasing learning
   rate and update ``M``.

   -  For each training sample, compute outputs by multiplying the input
      vector by ``M``.
   -  Compute the gradient of the error (target vector - output vector).
   -  Update the weights in ``M`` (with reference to how the weights are
      updated between the hidden layer and the output layer in word2vec
      code).

abstraction code picture
========================

.. figure:: packages_abstraction.png
   :alt: 

module abstraction
==================

The module abstraction contains functions used generally for project
abstraction. Many of the programs of the project use its functions.

arcodex: archive collated exchanges
===================================

The program arcodex is a data collation and archiving program
specialised to conversational exchanges. It can be used to archive to
database exchanges on Reddit.

The following example accesses 2 utterances from the subreddit
"worldnews" with verbosity:

.. code:: bash

    arcodex.py --numberOfUtterances 2 --subreddits=worldnews --verbose

The following example accesses 2 utterances from each of the subreddits
"changemyview" and "worldnews" with verbosity:

.. code:: bash

    arcodex.py --numberOfUtterances 2 --subreddits=changemyview,worldnews --verbose

The following example accesses 30 utterances from all of the listed
subreddits with verbosity:

.. code:: bash

    arcodex.py --numberOfUtterances 30 --subreddits=askreddit,changemyview,lgbt,machinelearning,particlephysics,technology,worldnews --verbose

The standard run 2014-10-28T202832Z is as follows:

.. code:: bash

    arcodex.py --numberOfUtterances 200 --subreddits=askreddit,changemyview,lgbt,machinelearning,particlephysics,technology,worldnews --verbose

vicodex, vicodex\_2: view collated exchanges
============================================

The program vicodex\_2 (and vicodex) is a viewing program specialised to
conversational exchanges. It can be used to access and view a database
of exchanges.

The following example accesses database "database.db" and displays its
exchanges data:

.. code:: bash

    vicodex_2.py --database="database.db"

inspect-database: quick printout of database
============================================

The program inspect-database provides a simple, comprehensive printout
of the contents of a database. Specifically, for every table in the
database it prints all of the column contents for every entry.

.. code:: bash

    inspect-database.py --database="database.db"

The program Sqliteman can be used to provide a view of database
information:

.. code:: bash

    sqliteman database.db

::

    SELECT * FROM exchanges;

vcodex: word vectors
====================

The program vcodex converts conversational exchanges in an abstraction
database to word vector representations and adds or updates an
abstraction database with these vectors.

.. code:: bash

    vcodex.py --database="database.db" --wordvectormodel=Brown_corpus.wvm

The program vcodex increases the file size of abstraction database
version 2015-01-06T172242Z by a factor of ~5.49. On an i7-5500U CPU
running at 2.40 GHz, the conversion rate is ~25 exchanges per second.

reducodex: remove duplicate collated exchanges
==============================================

The program reducodex inspects an existing database of conversational
exchanges, removes duplicate entries, creates simplified identifiers for
entries and then writes a new database of these entries.

The following examples access database "database.db", remove duplicate
entries, create simplified identifiers for entries and output database
"database\_1.db":

.. code:: bash

    reducodex.py --inputdatabase="database.db"

.. code:: bash

    reducodex.py --inputdatabase="database.db" --outputdatabase="database_1.db"

fix\_database: fix the data structures of database entries
==========================================================

.. code:: bash

    fix_database.py --verbose 2> >(grep -E -v "INFO|DEBUG")

abstraction development testing
===============================

.. code:: bash

    ./arcodex.py --numberOfUtterances 10 --subreddits=askreddit,changemyview,lgbt,machinelearning,particlephysics,technology,worldnews --database=2015-10-12T1612Z.db --verbose

.. code:: bash

    ./vicodex.py --database=2015-10-12T1612Z.db

saving models
=============

Note that the file ``checkpoint`` in the saved model directory contains
full paths.

.. |project abstraction| image:: http://img.youtube.com/vi/v9zJ9noLeok/0.jpg
   :target: https://www.youtube.com/watch?v=v9zJ9noLeok

FAQs

What is abstraction?

Is abstraction well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

abstraction

NOTE

setup

upcoming

logging

data

feature scaling

SUSY Data Set

Caffe

introduction

command line

Related posts

Socket Now Protects the Chrome Extension Ecosystem

Introducing Socket MCP for Claude Desktop