Security News
PyPI’s New Archival Feature Closes a Major Security Gap
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
Colab Notebook · Pre-trained Models · Report Bug
OpenAI recently released the paper Learning Transferable Visual Models From Natural Language Supervision in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions. OpenAI has since released a set of their smaller CLIP models, which can be found on the official CLIP Github.
While it is possible that other versions works equally fine, we have worked with the following:
pip install multilingual-clip torch
You can also choose to pip install tensorflow
instead of torch.
Inference code for Tensorflow is also available in inference_example.py
from multilingual_clip import pt_multilingual_clip
import transformers
texts = [
'Three blind horses listening to Mozart.',
'Älgen är skogens konung!',
'Wie leben Eisbären in der Antarktis?',
'Вы знали, что все белые медведи левши?'
]
model_name = 'M-CLIP/XLM-Roberta-Large-Vit-L-14'
# Load Model & Tokenizer
model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
embeddings = model.forward(texts, tokenizer)
print(embeddings.shape)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
Every text encoder is a Huggingface available transformer, with an additional linear layer on top. For more information of a specific model, click the Model Name to see its model card.
Name | Model Base | Vision Model | Vision Dimensions | Pre-trained Languages | #Parameters |
---|---|---|---|---|---|
LABSE Vit-L/14 | LaBSE | OpenAI ViT-L/14 | 768 | 109 Languages | 110 M |
XLM-R Large Vit-B/32 | XLM-Roberta-Large | OpenAI ViT-B/32 | 512 | 100 Languages | 344 M |
XLM-R Large Vit-L/14 | XLM-Roberta-Large | OpenAI ViT-L/14 | 768 | 100 Languages | 344 M |
XLM-R Large Vit-B/16+ | XLM-Roberta-Large | Open CLIP ViT-B-16-plus-240 | 640 | 100 Languages | 344 M |
Following is a table of the Txt2Img @10-Recal for the humanly tanslated MS-COCO testset.
Name | En | De | Es | Fr | Zh | It | Pl | Ko | Ru | Tr | Jp |
---|---|---|---|---|---|---|---|---|---|---|---|
OpenAI CLIP Vit-B/32 | 90.3 | - | - | - | - | - | - | - | - | - | - |
OpenAI CLIP Vit-L/14 | 91.8 | - | - | - | - | - | - | - | - | - | - |
OpenCLIP ViT-B-16+- | 94.3 | - | - | - | - | - | - | - | - | - | - |
LABSE Vit-L/14 | 91.6 | 89.6 | 89.5 | 89.9 | 88.9 | 90.1 | 89.8 | 80.8 | 85.5 | 89.8 | 73.9 |
XLM-R Large Vit-B/32 | 91.8 | 88.7 | 89.1 | 89.4 | 89.3 | 89.8 | 91.4 | 82.1 | 86.1 | 88.8 | 81.0 |
XLM-R Vit-L/14 | 92.4 | 90.6 | 91.0 | 90.0 | 89.7 | 91.1 | 91.3 | 85.2 | 85.8 | 90.3 | 81.9 |
XLM-R Large Vit-B/16+ | 95.0 | 93.0 | 93.6 | 93.1 | 94.0 | 93.1 | 94.4 | 89.0 | 90.0 | 93.0 | 84.2 |
The training curves for these models are available at this Weights and Biases.
Older versions of M-CLIP had the linear weights stored separately from Huggingface. Whilst the new models have them directly incorporated in the Huggingface repository. More information about these older models can be found in this section.
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
Replace cudatoolkit=11.0
above with the appropriate CUDA version on your machine or cpuonly
when installing on a machine without a GPU.
For more information please see the official CLIP repostitory.
# Linear Model Weights
$ bash legacy_get-weights.sh
from multilingual_clip import multilingual_clip
print(multilingual_clip.AVAILABLE_MODELS.keys())
model = multilingual_clip.load_model('M-BERT-Distil-40')
embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
print(embeddings.shape)
# Yields: torch.Size([3, 640])
For a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this colab notebook.
Every text encoder is a Huggingface available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card.
*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. ***
Name | Model Base | Vision Model | Pre-trained Languages | Target Languages | #Parameters |
---|---|---|---|---|---|
Multilingual | |||||
M-BERT Distil 40 | M-BERT Distil | RN50x4 | 101 Languages | 40 Languages | 66 M |
M-BERT Base 69 | M-BERT Base | RN50x4 | 101 Languages | 68 Languages | 110 M |
M-BERT Base ViT-B | M-BERT Base | ViT-B/32 | 101 Languages | 68 Languages | 110 M |
Monolingual | |||||
Swe-CLIP 500k | KB-BERT | RN50x4 | Swedish | Swedish | 110 M |
Swe-CLIP 2M | KB-BERT | RN50x4 | Swedish | Swedish | 110 M |
This folder contains the code used for training the above models. If you wsh to train your own model you must do the following things:
This Google Drive folder contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of GCC + MSCOCO + VizWiz.
The Google Drive folder also contains the translation data used to train the currently available models. Good Luck
If you have trained a CLIP Text encoder specific to your language, or another model covering a language not supported here, Please feel free to contact us and we will either upload your model and credit you, or simply link to your already uploaded model.
If you have questions regarding the code or otherwise related to this Github page, please open an issue.
For other purposes, feel free to contact me directly at: Fredrik.Carlsson@ri.se
Distributed under the MIT License. See LICENSE
for more information.
FAQs
OpenAI CLIP text encoders for multiple languages!
We found that multilingual-clip demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
Research
Security News
Malicious npm package postcss-optimizer delivers BeaverTail malware, targeting developer systems; similarities to past campaigns suggest a North Korean connection.
Security News
CISA's KEV data is now on GitHub, offering easier access, API integration, commit history tracking, and automated updates for security teams and researchers.