Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Utilities for managing nlp models and for processing text-related data at the Wellcome Trust
This package contains common utility functions for usual tasks at the Wellcome Trust, in particular functionalities for processing, embedding and classifying text data. This includes
For more information read the official docs.
Installing from PyPi
pip install wellcomeml
This will install the "vanilla" package with very little functionality, such as io, dataset download etc.
If space is not a problem, you can install the full package (around 2.2GB):
pip install wellcomeml[all]
The full package is relatively big, therefore we also have fine-grained installations if you only wish to use one specific module.
Those are core, transformers, tensorflow, torch, spacy
. You can install one, or more of those you want, e.g.:
pip install wellcomeml[tensorflow, core]
To check that your installation allows you to use a specific module, try (for example):
python -c "import wellcomeml.ml.bert_vectorizer"
If you don't have the correct dependencies installed for a module, an error will appear and point you to the right dependencies.
Torch has a different installation for windows so it will not get automatically installed with wellcomeml[all]. It needs to be installed first (this is for machines with no CUDA parallel computing platform for those that do look here https://pytorch.org/ for correct installation):
pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install wellcomeml[all]
make
Make changes to the .rst files in /docs
(please do not change the ones starting by wellcomeml as those are generated automatically)
Navigate to the root repository and run
make update-docs
Verify that _build/html/index.html
has generated correctly and submit a PR.
First create a github token, if you haven't one, with artifact write access and export it to the env variables:
export GITHUB_TOKEN=...
The checklist for a new release is:
wellcomeml/__version__.py
make dist
pip3 install <relative path to this folder>
On OSX, if you get a message complaining about the rust compiler, install and initialise it with:
brew install rustup
rustup-init
Examples can be found in the subfolder examples
.
If you experience a problem with installing or using WellcomeML please open an issue. It might be
worth setting the logging level to DEBUG export LOGGING_LEVEL=DEBUG
which will often expose
more information that might be informative to resolve the issue.
Module | Description | Extras needed |
---|---|---|
wellcomeml.ml.attention | Classes that implement keras layers for attention/self-attention | tensorflow |
wellcomeml.ml.bert_classifier | Classifier to facilitate fine-tuning bert/scibert | tensorflow |
wellcomeml.ml.bert_semantic_equivalence | Classifier to learn semantic equivalence between pairs of documents | tensorflow |
wellcomeml.ml.bert_vectorizer | Text vectorizer based on bert/scibert | torch |
wellcomeml.ml.bilstm | BILSTM Text classifier | tensorflow |
wellcomeml.ml.clustering | Text clustering pipeline | NA |
wellcomeml.ml.cnn | CNN Text Classifier | tensorflow |
wellcomeml.ml.doc2vec_vectorizer | Text vectorizer based on doc2vec | NA |
wellcomeml.ml.frequency_vectorizer | Text vectorizer based on TF-IDF | NA |
wellcomeml.ml.keras_utils | Utils for computing metrics during training | tensorflow |
wellcomeml.ml.keras_vectorizer | Text vectorizer based on Keras | tensorflow |
wellcomeml.ml.sent2vec_vectorizer | Text vectorizer based on Sent2Vec | (Requires sent2vec, a non-pypi package) |
wellcomeml.ml.similarity_entity_liking | A class to find most similar documents to a sentence in a corpus | tensorflow |
wellcomeml.ml.spacy_classifier | A text classifier based on spacy | spacy, torch |
wellcomeml.ml.spacy_entity_linking | Similar to similarity_entity_linking, but uses spacy | spacy |
wellcomeml.ml.spacy_knowledge_base | Creates a knowledge base of entities, based on spacy | spacy |
wellcomeml.ml.spacy_ner | Named entity recognition classifier based on spacy | spacy |
wellcomeml.ml.transformers_tokenizer | Bespoke tokenizer based on transformers | Transformers |
wellcomeml.ml.vectorizer | Abstract class for vectorizers | NA |
wellcomeml.ml.voting_classifier | Meta-classifier based on majority voting | NA |
FAQs
Utilities for managing nlp models and for processing text-related data at the Wellcome Trust
We found that wellcomeml demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.