
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
langscikw
is a Python package and command line tool for bigram keyterm extraction. It is optimized for long, English, linguistic publications and can also be applied to TeX code.
Keyword extraction is done in three steps. No preprocessing is needed.
keywordslist.txt
.The number of steps can be controlled by (not) providing the relevant corpora during training. The result needs some manual correction and supplementation of relevant unigrams or trigrams.
pip3 install langscikw
Developed in Python 3.7.3 32-bit. Needs at least Python 3.7 and the following packages: jellyfish, joblib, networkx, scikit-learn, segtok, regex.
Download the langsci corpus files from here or use your own.
The command line tool provides only a simple interface. If you'd like to customize the model parameters or the number of steps, please see below.
langscikw inputfile [n] [corpus1] [corpus2] [keywordslist] [--silent]
The keywords are printed to the console and can be redirected to a text file.
inputfile
: Path to a .txt or .tex file or directory from which to extract keywords.n
: Optional Number of keywords to extract. Defaults to 300.corpus1
: Optional Path to corpus for step 2, usually raw TeX files or a joblib-compressed file. If not provided, looks for corpus_tex.gz in the current directory.corpus2
: Optional Path to corpus for step 3, usually detexed files or a joblib-compressed file. If not provided, looks for corpus_detexed.gz in the current directory.keywordslist
: Optional Path to a list of gold keywords for step 3. A default list based on langsci publications is installed with the package.--silent
: Optional Only print the result to the console, no progress updates.import langscikw
input_path = "my_book" # File/directory to extract keywords from
keywords_path = "keywords.txt" # File to save keywords to
kwe = langscikw.KWE()
kwe.train("corpus_tex.gz", "corpus_detexed.gz")
kws = kwe.extract_keywords(input_path, n=300, dedup_lim=0.85)
for kw in kws:
print(kw) # Keywords are alphabetically sorted strings
n
: Optional Number of keywords to extract. Defaults to 300.dedup_lim
: Optional Deduplication limit. Keywords that have a Jaro-Winkler Similarity of >dedup_lim are not added to the final list. Defaults to 0.85.The YAKE and TF-IDF models may also be used on their own. Please consult the docstrings for more information.
import langscikw
yake = langscikw.yakemodel.YakeExtractor() # -> extract_keywords()
tfidf = langscikw.tfidfmodel.TfidfExtractor() # -> train() -> extract_keywords()
FAQs
Keyword extraction from linguistic publications
We found that langscikw demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.