Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Kex is a python library for unsurpervised keyword extractions, supporting the following features:
Our paper got accepted by EMNLP 2021 main conference 🎉 (camera-ready is here):
In our paper, we conducted an extensive comparison and analysis over existing keyword extraction algorithms and proposed new algorithms LexRank
and LexSpec
that
achieve very competitive baseline with very low complexity. Our proposed algorithms are based on the lexical specificity and we write a short introduction to the
lexical specificity here.
To reproduce all the result in the paper, please follow these instructions.
Install via pip
pip install kex
Built-in algorithms in kex is below:
FirstN
: heuristic baseline to pick up first n phrases as keywordsTF
: scoring by term frequencyTFIDF
: scoring by TFIDFLexSpec
: Ushio et al., 21TextRank
: Mihalcea et al., 04SingleRank
: Wan et al., 08TopicalPageRank
: Liu et al.,10SingleTPR
: Sterckx et al.,15TopicRank
: Bougouin et al.,13PositionRank
: Florescu et al.,18TFIDFRank
: SingleRank + TFIDF based word distribution priorLexRank
: Ushio et al., 21Basic usage:
>>> import kex
>>> model = kex.SingleRank() # any algorithm listed above
>>> sample = '''
We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection.
It starts by training word embeddings on the target document to capture semantic regularities among the words. It then
uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the
assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics
expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are
detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state
of-the-art and recent unsupervised keyphrase extraction methods.
'''
>>> model.get_keywords(sample, n_keywords=2)
[{'stemmed': 'non-keyphras word vector',
'pos': 'ADJ NOUN NOUN',
'raw': ['non-keyphrase word vectors'],
'offset': [[47, 49]],
'count': 1,
'score': 0.06874471825637762,
'n_source_tokens': 112},
{'stemmed': 'semant regular word',
'pos': 'ADJ NOUN NOUN',
'raw': ['semantic regularities words'],
'offset': [[28, 32]],
'count': 1,
'score': 0.06001468574146248,
'n_source_tokens': 112}]
Algorithms such as TF
, TFIDF
, TFIDFRank
, LexSpec
, LexRank
, TopicalPageRank
, and SingleTPR
need to compute
a prior distribution beforehand by
>>> import kex
>>> model = kex.SingleTPR()
>>> test_sentences = ['documentA', 'documentB', 'documentC']
>>> model.train(test_sentences, export_directory='./tmp')
Priors are cached and can be loaded on the fly as
>>> import kex
>>> model = kex.SingleTPR()
>>> model.load('./tmp')
Currently algorithms are available only in English, but soon we will relax the constrain to allow other language to be supported.
Users can fetch 15 public keyword extraction datasets via kex.get_benchmark_dataset
.
>>> import kex
>>> json_line, language = kex.get_benchmark_dataset('Inspec')
>>> json_line[0]
{
'keywords': ['kind infer', 'type check', 'overload', 'nonstrict pure function program languag', ...],
'source': 'A static semantics for Haskell\nThis paper gives a static semantics for Haskell 98, a non-strict ...',
'id': '1053.txt'
}
Please take a look an example script to run a benchmark on those datasets.
We provide an API to run a basic pipeline for preprocessing, by which one can implement a custom keyword extractor.
import kex
class CustomExtractor:
""" Custom keyword extractor example: First N keywords extractor """
def __init__(self, maximum_word_number: int = 3):
""" First N keywords extractor """
self.phrase_constructor = kex.PhraseConstructor(maximum_word_number=maximum_word_number)
def get_keywords(self, document: str, n_keywords: int = 10):
""" Get keywords
Parameter
------------------
document: str
n_keywords: int
Return
------------------
a list of dictionary consisting of 'stemmed', 'pos', 'raw', 'offset', 'count'.
eg) {'stemmed': 'grid comput', 'pos': 'ADJ NOUN', 'raw': ['grid computing'], 'offset': [[11, 12]], 'count': 1}
"""
phrase_instance, stemmed_tokens = self.phrase_constructor.tokenize_and_stem_and_phrase(document)
sorted_phrases = sorted(phrase_instance.values(), key=lambda x: x['offset'][0][0])
return sorted_phrases[:min(len(sorted_phrases), n_keywords)]
If you use any of these resources, please cite the following paper:
@inproceedings{ushio-etal-2021-kex,
title={{B}ack to the {B}asics: {A} {Q}uantitative {A}nalysis of {S}tatistical and {G}raph-{B}ased {T}erm {W}eighting {S}chemes for {K}eyword {E}xtraction},
author={Ushio, Asahi and Liberatore, Federico and Camacho-Collados, Jose},
booktitle={Proceedings of the {EMNLP} 2021 Main Conference},
year = {2021},
publisher={Association for Computational Linguistics}
}
FAQs
Light/easy keyword extraction from documents.
We found that kex demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.