Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
A library that tries to help you to understand (note the pun).
"What lies in word embeddings?"
This small library offers tools to make visualisation easier of both word embeddings as well as operations on them.
This project was initiated at Rasa as a by-product of our efforts in the developer advocacy and research teams. The project is maintained by koaning in order to support more use-cases.
This library has tools to help you understand what lies in word embeddings. This includes:
You can install the package via pip;
pip install whatlies
This will install the base dependencies. Depending on the transformers and language backends that you'll be using you may want to install more. Here's some of the possible installation settings you could go for.
pip install whatlies[spacy]
pip install whatlies[tfhub]
pip install whatlies[transformers]
If you want it all you can also install via;
pip install whatlies[all]
Note that this will install dependencies but it will not install all the language models you might want to visualise. For example, you might still need to manually download spaCy models if you intend to use that backend.
More in depth getting started guides can be found on the documentation page.
The idea is that you can load embeddings from a language backend and use mathematical operations on it.
from whatlies import EmbeddingSet
from whatlies.language import SpacyLanguage
lang = SpacyLanguage("en_core_web_md")
words = ["cat", "dog", "fish", "kitten", "man", "woman",
"king", "queen", "doctor", "nurse"]
emb = EmbeddingSet(*[lang[w] for w in words])
emb.plot_interactive(x_axis=emb["man"], y_axis=emb["woman"])
You can even do fancy operations. Like projecting onto and away from vector embeddings! You can perform these on embeddings as well as sets of embeddings. In the example below we attempt to filter away gender bias using linear algebra operations.
orig_chart = emb.plot_interactive('man', 'woman')
new_ts = emb | (emb['king'] - emb['queen'])
new_chart = new_ts.plot_interactive('man', 'woman')
There's also things like pca and umap.
from whatlies.transformers import Pca, Umap
orig_chart = emb.plot_interactive('man', 'woman')
pca_plot = emb.transform(Pca(2)).plot_interactive()
umap_plot = emb.transform(Umap(2)).plot_interactive()
pca_plot | umap_plot
Every language backend in this video is available as a scikit-learn featurizer as well.
import numpy as np
from whatlies.language import BytePairLanguage
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("embed", BytePairLanguage("en")),
("model", LogisticRegression())
])
X = [
"i really like this post",
"thanks for that comment",
"i enjoy this friendly forum",
"this is a bad post",
"i dislike this article",
"this is not well written"
]
y = np.array([1, 1, 1, 0, 0, 0])
pipe.fit(X, y)
To learn more and for a getting started guide, check out the documentation.
There are some similar projects out and we figured it fair to mention and compare them here.
The original inspiration for this project came from this web app and this pydata talk. It is a web app that takes a while to load but it is really fun to play with. The goal of this project is to make it easier to make similar charts from jupyter using different language backends.
From google there's the tensorflow projector project. It offers highly interactive 3d visualisations as well as some transformations via tensorboard.
From Uber AI Labs there's parallax which is described in a paper here. There's a common mindset in the two tools; the goal is to use arbitrary user defined projections to understand embedding spaces better. That said, some differences that are worth to mention.
If you want to develop locally you can start by running this command.
make develop
This is generated via
make docs
Please use the following citation when you found whatlies
helpful for any of your work (find the whatlies
paper here):
@inproceedings{warmerdam-etal-2020-going,
title = "Going Beyond {T}-{SNE}: Exposing whatlies in Text Embeddings",
author = "Warmerdam, Vincent and
Kober, Thomas and
Tatman, Rachael",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.8",
doi = "10.18653/v1/2020.nlposs-1.8",
pages = "52--60",
abstract = "We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported or shared via Jupyter notebooks. The project documentation is available from https://koaning.github.io/whatlies/.",
}
FAQs
Tools to help uncover `whatlies` in word embeddings.
We found that whatlies demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.