omterms

A customizable keyword extraction package.

0.1.4

PyPI

Maintainers: 1

.. role:: math(raw) :format: html latex ..

+-------------+ | Bulent | | Ozel, UZH | +-------------+ | bulent.oz | | el@gmail.co | | m | +-------------+

Support for this work is partly covered by the OpenMaker Project: http://openmaker.eu/

Collaborator(s): * Hamza Zeytinoglu

The first objective of this module is to provide a customizable and standardized text preprocessing prior to further analyses where more advanced machine learning and or statistical techniques can be applied and compared with each other. In that sense, it provides a pipelined set of functionalities (i) to be able to inspect, organize, prune and merge texts around one or very few specific theme(s) or topic(s), (ii) remove unwanted terms or literals from the texts, (iii) tokenize the texts, (iv) count the terms in texts, and (v) when desired stem the tokenized terms.

The second objective of this module is to be able compare or score a foreground corpus or a specific corpus against a background corpus or reference corpus. Example use cases could be, for instance, exploring the language of a sub-culture, a community, or a movement looking at to what extend the specific use of the language of the group differentiates itself from the common language.

In cases when there are more than a few number of themes or topics, and where each topic is represented with a large set of documents that validates the employment of standardized matrix decomposition based methodologies, then the scoring option of this module can be skipped entirely. More specifically, in use cases where the objective is being able to classify and differentiate a number of topics or issues from each other and where there are sufficient data that fulfills the underlining assumptions of NMF, LDA or LSI based approaches, then tools from, for instance, Python’s sklearn.decomposition <http://scikit-learn.org/stable/modules/decomposition.html#non-negative-matrix-factorization-nmf-or-nnmf>__ package are suggested.

Nevertheless, the outputs of this module, such as its normalized term frequencies or the specificity scores it associates to them regarding to a reference background corpus, can be used as input to other matrix decomposition techniques.

Install

A. Via Python's standard distribution channel PyPI


.. code:: bash

    pip install omterms

B. Via from its GitHub source
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    git clone https://github.com/bulentozel/omterms.git

.. code:: bash

    cd omterms

.. code:: bash

    pip install .

A quick use
-----------

.. code:: python

    >>> from omterms.interface import *
    >>> extract_terms("Some input X text to process less then 3 seconds.").head()
    Configuring the text cleaner ...
    A single text is provided.
    Extracting the terms ...
    Tokenizing the input text ..
    Done. Number of terms: 10
    Cleaning process: Initial size of tokens = 10
    Reduction due to punctuations and stopwords = 3.
    Reduction due to all numeral terms = 1
    Reduction due to short terms = 1
    Reduction due to rare terms = 0
    Reduction due to partially numeral terms = 0
    Reduction due to terms with not allowed symbols = 0
    The total term count reduction during this cleaning process = 5
    Percentage = 50%
    COMPLETED.
       TF     Term  wTF
    0   1    input  0.2
    1   1     text  0.2
    2   1  process  0.2
    3   1     less  0.2
    4   1  seconds  0.2
    >>> 

More on usage
-------------

`Please see the
tutorial. <https://github.com/bulentozel/omterms/blob/master/tutorial.ipynb>`__

--------------

--------------

Roadmap on Keyword and Keyphrase Extraction
===========================================

The method outlined here aims to set-up a base line for future
improvements.

-  It uses a statistical approach combined with standardized procedures
   that are widely applied in standard NLP workflows.
-  In this base line, it aims to present a work flow that can be ablied
   to

   -  different languages
   -  differrent problem domains
   -  analysis on a single theme with limited training set

1. Overall work flow
--------------------

In short, the workflow presented on this notebook is the second stage on
a workflow objective of which is being able to measure relevance of a
given external input to a specific theme, issue or topic. The steps of
the work flow is as follows.

1. Forming a specific corpus where the corpus consists of set of
   documents around a topic. The corpus could be

   -  a set of blog articles around an issue let say green finance
   -  or a set of Wikipedia articles around the same subject
   -  or collection of news articles around the green finance
   -  or collection of tweets around the same issue.

   At the moment we have another module that given a set of seed
   Wikipedia articles around an issue the crawler scrapes textual data
   from articles. For the details of the module please `see the scraper
   module. <https://github.com/bulentozel/OpenMaker/tree/master/Scraping>`__.
   The output of that module is a set of input texts stored in a
   collection in JSON format.

2. Given an input set of texts on a theme a concept or a topic identify
   set of terms that more likely or less likely can occur within a
   discussion on the topic. This module hereby presents one of the
   simple methods for this purpose.

3. Given a list of weighted terms which are more likely to occur or
   reprsent a theme, concept or topic and input query text measure the
   relevance of the input text to the topic/theme/concept. `The notebook
   in this
   link <https://github.com/bulentozel/OpenMaker/blob/master/Semantics/Score%20Text.ipynb>`__
   demonstrates one way doing such scoring of a given text against the
   curated set of terms of this particular module.

2. Suggested future work
------------------------

-  Comparing and combining this comparison based scoring with matrix
   decompostion based topic modelling approaches such as NMF, LDA, LSI.

-  Using language specicif term frequency counts of Wikipedia itself for
   comparisons. In NLP terminology, the *foreground* corpus around a
   topic needs to be compared and contrasted to a *background* corpus.

-  Improving the semantic crawler of the previous stage to be able to
   increase quality of the specific corpuses

Methodological Improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Instead of tokenizing all terms, examine possibilities of key-phrase
   extrcation combining with *tf-idf* and

   -  experiment with extracting noun phrases and words, for this use
      NLTK's regular expression module for POS (part of speeach)
      analysis.
   -  extract n-grams where n=1,2,3

3. Definitions and assumptions
------------------------------

Assumptions
~~~~~~~~~~~

-  At the comparison stage, it is assumed that a document's terms tend
   to be relatively frequent within the document as compared to an
   external reference corpus. However, it should be noted this
   assumption is contested in the field. See the paper by Chuang et el.

-  Condidering the fact that the crawler is used to aggregate
   semantically related set of documents into a single document, *tf x
   idf* is equivalent to *tf*. As can be seen below, we use a normalized
   version of *tf*: *ntS / NS*.

-  Fewer number of but relatively more relevant training (input corpus)
   is prefered in order to reduce term extraction problems due to length
   of documents. However, it should be noted that the crawling depth of
   an identiefied wiki article from stage 1 of this document can be used
   as an additional weight on relevance/reprsesntation of keywords.

-  We have limited ourselves to terms instead of n-grams and phrases or
   use of POS to be able to develop a base model that can work on
   different languages.

Term
~~~~

Given for instance a set of texts around open source software movement a
term that is identified can be a word such as *openness*, a person such
as *Stallman* a license type such as *GNU*, an acronym for an
organization such as *FSF* the Free Software Foundation, or a technology
such as *Emacs*.

Likelihood ratio
~~~~~~~~~~~~~~~~

It is a simple measure computed comparing frequency count of a term in a
specific corpus versus its frequency count in the reference reference
corpus. Here assumption is that the reference corpus is a large enough
sample of the language at observing the occurance of a term. Then having
a higher/lower observation frequency of a term in the specific corpus is
a proxy indicator for the term choice while having a debate on the
topic.

The likelihood ratio for a term :math:`P_t` is calculated as:

:math:`P_t = log ( (ntS/NS) / (ntR/NR) )`

where

-  *ntS* is the raw frequency count of the term in the entire specific
   corpus
-  *ntR* is the raw frequenccy count of the term in the reference corpus
-  *NS* is the total number of terms in the specific corpus
-  *NR* is the total number of terms in the reference corpus

It should be noted that frequency counts are calculated after having
applied the same tokenization and post processing such as excluding
stop-words, pancuations, rare terms, etc both on the reference corpus
and the specific corpus.

4. Some thoughts on a conceptual approach at using the extracted keywords or phrases to predict topical relevance of a new text
-------------------------------------------------------------------------------------------------------------------------------

Using the outcome of this technique to score arbitrary input texts
against a single issue such as financial sustainability or against a set
of issues such as the 10 basic human values requires a set of
normalization of the raw scores and their rescaling/transformation.

The factors that need to be considered are:

-  **Differing document lengths:** The likelihood of repetion of a key
   phrase increases as the size of the input text gets larger. In more
   concrete terms, when a scoring that simply sums up detection of
   weighted keyphrases or words within a given input text would be very
   sensitive to the document length. For isntance, the an executive
   summary of an article would very likely get quite lower score than
   the full article on any issue.

   *Among other methods, this can simply be resolved by computing per
   word scores, where the word set to be conidered is the tokenized and
   cleaned set of words that represent the input text.*

-  **Topical relevance:** This factor would be important when the
   subject matter of the inputs texts vary among each other. In other
   words, this factor would matter to a very high significance, let's
   say when one wants to compare perceptions of indivuduals on the role
   of privacy in democracies and when this question is not asked them in
   a uniform manner, that is under the same social, cultural,
   environmental and physical conditions.

   Let’s assume that issue under investigation is again pricacy in
   democracies. It is possible that the same individual as a blogger who
   has a strong pro-privacy opinion (i) may not touch the issue while
   talking on data science, (ii) he would slightly touch the issue while
   he talks about her preferences in mobile devices (iii) He dives into
   subject using all keywords and phrases when he talks about impact of
   privacy in a democratic life. In brief, it is necessary to offset the
   variability of the topical relavance of an input text to the issue
   under investigation when arbitrary text samples are used for scoring.

   *An offsetting scheme can be devised when opinion or perception of an
   actor is to be measured with respect to more than one factor that
   define the issue under investigation. For instance, when we want to
   measure the position of a political leader on individual liberties vs
   social security or when we want to profile discourse of the political
   leader as of a number of basic human values we could employ some
   simple statistical methods in order to offset the topical relevance
   of the discourses or the speeches of the political figure to what we
   would like to measure.*

   *A simple method could be rescaling the scores on each sub factor
   such as the scores of liberty and security that are measured from the
   same speech into a range of -1 to 1. This can simply be done by
   taking the mean of the two and then deducting the mean from each
   score and scaling them into a scala of -1 to 1. This way it may be
   possible to use multiple speeches of the same political figure on
   different topics to evalaute his or her postion on liberty vs
   security matter.*

In statistical terms this problem corresponds to adjusting or
normalizing ratings or scores measured on different scales to a
notionally common scale. Given the fact that in most cases a normal
distribution for underlying factors may not be assumed the
quantile-normalization technique is suggested. The quantile
normalization sorts and ranks the variables with a non-negative
amplitudes. Then these rankings can be scaled to for instance to a 0-1
interval.

-  **Level of subjectivity**. This is variability in terms of relevant
   importance attributed to each issue out of a given set of issues. For
   instance, it is possible that a great many individuals or political
   leaders would attach a higher importance to individual liberties than
   secuirty or otherway around. But the question might be rather to
   understand to what extend one attaches more importance to an issue
   more than the others. So when the objective of the scoring is not
   simply to make an order of importance, then a comparative importance
   with respect to overall observations needs to be tackled.

   *Observed variances in each query texts can be considered. That is, a
   simple statistical methods can be used for instance to be able to
   compare two or more query texts with respect to each other. A
   suggested method would be (1) estimate coefficient of variation for
   each input text using per-word scores (2) the rescale
   quantile-normalized scores that is suggested above using the
   estimated coefficient of variation in each case.*

   *When this rescaling is applied, for instance, liberty vs security
   the coeffcient of variation would act as a polarization measure.*

Scoring a group of variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When one attempts to use scores generated by this package, using
specific vs reference corpus comparisons, on a group of variable then
both ranking of the scores as well as the relevant importance of each
score from a number of texts from the same source should be taken into
consideration.

5. State of the art
-------------------

-  Survey Paper: Kazi Saidul Hasan and Vincent Ng, 2014. “Automatic
   Keyphrase Extraction: A Survey of the State of the Art” Proceedings
   of the 52nd Annual Meeting of the Association for Computational
   Linguistics, pages 1262–1273.

-  Survey Paper: Sifatullah Siddiqi and Aditi Sharan. Article: Keyword
   and Keyphrase Extraction Techniques: A Literature Review.
   International Journal of Computer Applications 109(2):18-23, January
   2015

-  Survey Paper: Z. A. Merrouni, B. Frikh, and B. Ouhbi. Automatic
   keyphrase extraction: An overview of the state of the art. In 2016
   4th IEEE Colloquium on Information Science and Technology (CiSt),
   pages 306–313, Oct 2016

-  PageRank - Topical: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong
   Sun, 2010. “Automatic Keyphrase Extraction via Topic Decomposition”.
   Proceeding EMNLP '10 Proceedings of the 2010 Conference on Empirical
   Methods in Natural Language Processing Pages 366-376

-  RAKE (Rapid Automatic Keyword Extraction ): Stuart Rose, Dave Engel,
   Nick Cramer, and Wendy Cowley. Automatic keyword extraction from
   individual documents. Text Mining, pages 1–20, 2010.

-  TextRank - Graph Based : Rada Mihalcea and Paul Tarau. Textrank:
   Bringing order into texts. Association for Computational Linguistics,
   2004.

-  STOPWORDS: S. Popova, L. Kovriguina, D. Mouromtsev, and I. Khodyrev.
   Stopwords in keyphrase extraction problem. In 14th Conference

-  Corpus Similarity - Keyword frequency based: Adam Kilgarriff. Using
   word frequency lists to measure corpus homogeneity and similarity
   between corpora. In Proceedings of ACLSIGDAT Workshop on very large
   corpora, pages 231–245, 1997.

-  Recommendation - Keyphrase Based: F. Ferrara, N. Pudota and C. Tasso.
   A keyphrase-based paper recommender system. In: Digital Libraries and
   Archives. Springer Berlin Heidelberg, 2011. p. 14-25.

-  Jason Chuang, Christopher D. Manning, Jeffrey Heer, 2012. "Without
   the Clutter of Unimportant Words": Descriptive Keyphrases for Text
   Visualization" ACM Trans. on Computer-Human Interaction, 19(3), 1–29.

+--------------------------------------------------------------+
| Learn more about the OpenMaker project: http://openmaker.eu/ |
+--------------------------------------------------------------+

FAQs

What is omterms?

Is omterms well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install