Socket
Socket
Sign inDemoInstall

wellcomeml

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

wellcomeml

Utilities for managing nlp models and for processing text-related data at the Wellcome Trust


Maintainers
1

Build Status codecov GitHub PyPI docs

WellcomeML utils

This package contains common utility functions for usual tasks at the Wellcome Trust, in particular functionalities for processing, embedding and classifying text data. This includes

  • An intuitive sklearn-like API wrapping text vectorizers, such as Doc2vec, Bert, Scibert
  • Common API for off-the-shelf classifiers to allow quick iteration (e.g. Frequency Vectorizer, Bert, Scibert, basic CNN, BiLSTM, SemanticSimilarity)
  • Utils to download and convert academic text datasets for benchmark
  • Utils to download data from the EPMC API

For more information read the official docs.

1. Quickstart

Installing from PyPi

pip install wellcomeml

This will install the "vanilla" package with very little functionality, such as io, dataset download etc.

If space is not a problem, you can install the full package (around 2.2GB):

pip install wellcomeml[all]

The full package is relatively big, therefore we also have fine-grained installations if you only wish to use one specific module. Those are core, transformers, tensorflow, torch, spacy. You can install one, or more of those you want, e.g.:

pip install wellcomeml[tensorflow, core]

To check that your installation allows you to use a specific module, try (for example):

python -c "import wellcomeml.ml.bert_vectorizer"

If you don't have the correct dependencies installed for a module, an error will appear and point you to the right dependencies.

1.1 Installing wellcomeml[all] on windows

Torch has a different installation for windows so it will not get automatically installed with wellcomeml[all]. It needs to be installed first (this is for machines with no CUDA parallel computing platform for those that do look here https://pytorch.org/ for correct installation):

pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install wellcomeml[all]

2. Development

2.1 Build local virtualenv

make

2.2 Contributing to the docs

Make changes to the .rst files in /docs (please do not change the ones starting by wellcomeml as those are generated automatically)

Navigate to the root repository and run

make update-docs

Verify that _build/html/index.html has generated correctly and submit a PR.

2.3 Release a new version (and upload to aws s3/pypi/github)

First create a github token, if you haven't one, with artifact write access and export it to the env variables:

export GITHUB_TOKEN=...

The checklist for a new release is:

  • Change wellcomeml/__version__.py
  • Add changelog
  • make dist
  • Verify new package was generated correctly on the pip registry and GitHub releases

2.4 (Optional) Installing from other locations

pip3 install <relative path to this folder>

2.5 Transformers

On OSX, if you get a message complaining about the rust compiler, install and initialise it with:

brew install rustup
rustup-init

3. Example usage of some modules

Examples can be found in the subfolder examples.

4. Troubleshooting

If you experience a problem with installing or using WellcomeML please open an issue. It might be worth setting the logging level to DEBUG export LOGGING_LEVEL=DEBUG which will often expose more information that might be informative to resolve the issue.

5. Extras

ModuleDescriptionExtras needed
wellcomeml.ml.attentionClasses that implement keras layers for attention/self-attentiontensorflow
wellcomeml.ml.bert_classifierClassifier to facilitate fine-tuning bert/sciberttensorflow
wellcomeml.ml.bert_semantic_equivalenceClassifier to learn semantic equivalence between pairs of documentstensorflow
wellcomeml.ml.bert_vectorizerText vectorizer based on bert/sciberttorch
wellcomeml.ml.bilstmBILSTM Text classifiertensorflow
wellcomeml.ml.clusteringText clustering pipelineNA
wellcomeml.ml.cnnCNN Text Classifiertensorflow
wellcomeml.ml.doc2vec_vectorizerText vectorizer based on doc2vecNA
wellcomeml.ml.frequency_vectorizerText vectorizer based on TF-IDFNA
wellcomeml.ml.keras_utilsUtils for computing metrics during trainingtensorflow
wellcomeml.ml.keras_vectorizerText vectorizer based on Kerastensorflow
wellcomeml.ml.sent2vec_vectorizerText vectorizer based on Sent2Vec(Requires sent2vec, a non-pypi package)
wellcomeml.ml.similarity_entity_likingA class to find most similar documents to a sentence in a corpustensorflow
wellcomeml.ml.spacy_classifierA text classifier based on spacyspacy, torch
wellcomeml.ml.spacy_entity_linkingSimilar to similarity_entity_linking, but uses spacyspacy
wellcomeml.ml.spacy_knowledge_baseCreates a knowledge base of entities, based on spacyspacy
wellcomeml.ml.spacy_nerNamed entity recognition classifier based on spacyspacy
wellcomeml.ml.transformers_tokenizerBespoke tokenizer based on transformersTransformers
wellcomeml.ml.vectorizerAbstract class for vectorizersNA
wellcomeml.ml.voting_classifierMeta-classifier based on majority votingNA

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc