Launch Week Day 5: Introducing Reachability for PHP.Learn More
Socket
Book a DemoSign in
Socket

sonar-space

Package Overview
Dependencies
Maintainers
4
Versions
7
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

sonar-space

SONAR provides a set of speech and text encoders for multilingual, multimodal semantic embedding.

Source
pipPyPI
Version
0.3.2
Maintainers
4

SONAR

[Paper] [Demo]

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.

Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.

SONAR stands for Sentence-level multimOdal and laNguage-Agnostic Representations

The full list of supported languages (along with download links) can be found here below.

SONAR Architecture:


Text results


Speech results


Installing

You can install SONAR with pip install sonar-space. Note that there is another sonar package on pip that IS NOT this project, make sure to use sonar-space in your dependencies.

Note that SONAR depends on Fairseq2, which should precisely match the versions of pytorch and CUDA (here are the possible variants). You can check with pip show torch which version of pytorch you gave. For example, if it equals 2.6.0+cu124, you should install fairseq2 with from the following source:

pip install fairseq2 --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/pt2.6.0/cu124

If fairseq2 does not provide a build for your machine, check the readme of that project to build it locally.

We recommend installing SONAR only after you have a correct version of fairseq2 installed. Note that SONAR currently relies on the stable version of fairseq2>=0.5.2 (with minor variations possible).

If you want to install SONAR manually, you can install it localy:

pip install --upgrade pip
pip install -e .

Versions

Unfortunately, SONAR code is very much tied to fairseq2 code, and thus only specific version are compatible with each other:

  • sonar-space~=0.5.0 (the current version) requires fairseq2>=0.5.2
  • sonar-space~=0.4.0 required fairseq2~=0.4.0
  • sonar-space~=0.2.0 required fairseq2~=0.2.0

In the future, when the fairseq2 interface stabilizes, we hope to keep the version dependencies less loosely coupled.

Usage

fairseq2 will automatically download models into your $TORCH_HOME/hub directory upon using the commands below.

Compute text sentence embeddings with SONAR:

from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
                                           tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
print(embeddings.shape)
# torch.Size([2, 1024])

Note that by default, all SONAR models are loaded to a CPU device, which is relatively slow. If you want to use a GPU instead, you should provide the device argument when initializing the model (this applies to every model). Similarly, you can pass a dtype argument. For example:

import torch
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline

embedder = TextToEmbeddingModelPipeline(
  encoder="text_sonar_basic_encoder", 
  tokenizer="text_sonar_basic_encoder", 
  device=torch.device("cuda"),
  dtype=torch.float16,
)

Reconstruct text from SONAR embeddings

from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder",
                                              tokenizer="text_sonar_basic_encoder")
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)
# max_seq_len is a keyword argument passed to the fairseq2 BeamSearchSeq2SeqGenerator.
print(reconstructed)
# ['My name is SONAR.', 'I can embed the sentences into vector space.']

By default, text generation in SONAR is based on beam search (BeamSearchSeq2SeqGenerator from fairseq2) with the default setting of beam_size=5. If one passes a sampler argument, we will use a SamplingSeq2SeqGenerator instead. All additional arguments are passed to the generator constructor. For example:

from fairseq2.generation import TopPSampler, TopKSampler
embeddings = t2vec_model.predict(["Bonjour le monde!"] * 10, source_lang="fra_Latn")
vec2text_model.predict(embeddings, target_lang="eng_Latn", sampler=TopPSampler(0.99), max_seq_len=128)
# ['Hello, the world!',
#  'Hey, everybody!',
#  'Good day to you, world!',
#  'Hello, the world!',
#  'Hello, people.',
#  'Hello, everybody, around the world.',
#  'Hello, world. How are you?',
#  "Hey, what's up?",
#  'Good afternoon, everyone.',
#  'Hello to the world!']
# the outputs are now random, so they will be different every time

Note that the sampler argument was a singal to use a SamplingSeq2SeqGenerator instead of a BeamSearchSeq2SeqGenerator, and the max_seq_len argument was passed to the SamplingSeq2SeqGenerator constructor.

Translate text with SONAR

from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
                                    decoder="text_sonar_basic_decoder",
                                    tokenizer="text_sonar_basic_encoder")  # tokenizer is attached to both encoder and decoder cards

sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]

Compute speech sentence embeddings with SONAR

from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")

s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                     "./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])

Speech-to-text translation with SONAR

from sonar.inference_pipelines.speech import SpeechToTextModelPipeline

s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
                                      decoder="text_sonar_basic_decoder",
                                      tokenizer="text_sonar_basic_decoder")

import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']

# passing multiple wav files
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                   "./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']

Predicting sentence similarity with BLASER 2.0 models

BLASER 2.0 is a family of models for automatic evaluation of machine translation quality based on SONAR embeddings. They predict cross-lingual semantic similarity between the translation and the source (optionally, also using a reference translation).

from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
from sonar.models.blaser.loader import load_blaser_model

blaser_ref = load_blaser_model("blaser_2_0_ref").eval()
blaser_qe = load_blaser_model("blaser_2_0_qe").eval()
text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")

src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
ref_embs = text_embedder.predict(["The cat sat on the mat."], source_lang="eng_Latn")
mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")

with torch.inference_mode():
    print(blaser_ref(src=src_embs, ref=ref_embs, mt=mt_embs).item())  # 4.688
    print(blaser_qe(src=src_embs, mt=mt_embs).item())  # 4.708

Detailed model cards with more examples: facebook/blaser-2.0-ref, facebook/blaser-2.0-qe.

Classifying the toxicity of sentences with MuTox

MuTox, the first highly multilingual audio-based classifier (binary) and dataset with toxicity labels. The dataset consists of 20k audio utterances for English and Spanish, and 4k for the other 19 languages, and uses the multi-model and multilingual encoders from SONAR. The output of the MuTox classifier is a logit of the evaluated being "toxic", according to the definition adopted in the corresponding dataset.

from sonar.models.mutox.loader import load_mutox_model
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
import torch

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    dtype = torch.float16
else:
    device = torch.device("cpu")
    dtype = torch.float32

t2vec_model = TextToEmbeddingModelPipeline(
    encoder="text_sonar_basic_encoder",
    tokenizer="text_sonar_basic_encoder",
    device=device,
)
text_column='lang_txt'
classifier = load_mutox_model(
    "sonar_mutox",
    device=device,
    dtype=dtype,
).eval()

with torch.inference_mode():
    emb = t2vec_model.predict(["De peur que le pays ne se prostitue et ne se remplisse de crimes."], source_lang='fra_Latn')
    x = classifier(emb.to(device).to(dtype)) 
    print(x) # tensor([[-19.7812]], device='cuda:0', dtype=torch.float16)

with torch.inference_mode():
    emb = t2vec_model.predict(["She worked hard and made a significant contribution to the team."], source_lang='eng_Latn')
    x = classifier(emb.to(device).to(dtype))
    print(x) # tensor([[-53.5938]], device='cuda:0', dtype=torch.float16)

with torch.inference_mode():
    emb = t2vec_model.predict(["El no tiene ni el más mínimo talento, todo lo que ha logrado ha sido gracias a sobornos y manipulaciones."], source_lang='spa_Latn')
    x = classifier(emb.to(device).to(dtype))
    print(x) # tensor([[-21.4062]], device='cuda:0', dtype=torch.float16)

For a CLI way of running the MuTox pipeline, go to Seamless Communication/.../MuTox.

Demo notebooks

See more complete demo notebooks :

Troubleshooting

  • In case of errors like fairseq2.assets.card.AssetCardError: Model checkpoint of the blaser_2_0_qe asset card cannot be loaded, try removing the fairseq2 assets cache (located in ~/.cache/fairseq2); it might be that some of the downloaded model checkpoints are invalid.

The SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.

Available text encoders/decoders
modellink
encoderdownload
decoderdownload
finetuned decoderdownload
tokenizerdownload

The languages supported by SONAR text encoders/decoders are all the 202 languages from the NLLB-200 models. They comprise all 204 FLORES-200 languages, except arb_Latn and min_Arab (note that sat_Olck is supported under the name sat_Beng, alghough Olck is the right scripts).

See more details on the languages list in the No Language Left Behind paper (the table below is based on Table 1 in this paper):

flores_lang_codesonar_lang_codelang_namescriptfamilysubgroupingresource_levelvariety
ace_Arabace_ArabAcehneseArabicAustronesianMalayo-PolynesianLowNorth Acehnese
ace_Latnace_LatnAcehneseLatinAustronesianMalayo-PolynesianLowNorth Acehnese
acm_Arabacm_ArabMesopotamian ArabicArabicAfro-AsiaticSemiticLowBaghdadi
acq_Arabacq_ArabTaʽizzi-Adeni ArabicArabicAfro-AsiaticSemiticLow
aeb_Arabaeb_ArabTunisian ArabicArabicAfro-AsiaticSemiticLowDerja
afr_Latnafr_LatnAfrikaansLatinIndo-EuropeanGermanicHigh
ajp_Arabajp_ArabSouth Levantine ArabicArabicAfro-AsiaticSemiticLowAmmani
aka_Latnaka_LatnAkanLatinAtlantic-CongoKwa Volta-CongoLowAsante
amh_Ethiamh_EthiAmharicGeʽezAfro-AsiaticSemiticLowAddis Ababa
apc_Arabapc_ArabNorth Levantine ArabicArabicAfro-AsiaticSemiticLow
arb_Arabarb_ArabModern Standard ArabicArabicAfro-AsiaticSemiticHigh
arb_Latn-Modern Standard ArabicLatinAfro-AsiaticSemiticLow
ars_Arabars_ArabNajdi ArabicArabicAfro-AsiaticSemiticLow
ary_Arabary_ArabMoroccan ArabicArabicAfro-AsiaticSemiticLow
arz_Arabarz_ArabEgyptian ArabicArabicAfro-AsiaticSemiticLow
asm_Bengasm_BengAssameseBengaliIndo-EuropeanIndo-AryanLowEastern
ast_Latnast_LatnAsturianLatinIndo-EuropeanItalicLowCentral
awa_Devaawa_DevaAwadhiDevanagariIndo-EuropeanIndo-AryanLowAyodhya
ayr_Latnayr_LatnCentral AymaraLatinAymaranCentral Southern AymaraLowAymara La Paz jilata
azb_Arabazb_ArabSouth AzerbaijaniArabicTurkicCommon TurkicLowTabrizi
azj_Latnazj_LatnNorth AzerbaijaniLatinTurkicCommon TurkicLowShirvan
bak_Cyrlbak_CyrlBashkirCyrillicTurkicCommon TurkicLowLiterary
bam_Latnbam_LatnBambaraLatinMandeWestern MandeLow
ban_Latnban_LatnBalineseLatinAustronesianMalayo-PolynesianLow
bel_Cyrlbel_CyrlBelarusianCyrillicIndo-EuropeanBalto-SlavicLowCentral
bem_Latnbem_LatnBembaLatinAtlantic-CongoBenue-CongoLowCentral
ben_Bengben_BengBengaliBengaliIndo-EuropeanIndo-AryanHighRarhi
bho_Devabho_DevaBhojpuriDevanagariIndo-EuropeanIndo-AryanLow
bjn_Arabbjn_ArabBanjarArabicAustronesianMalayo-PolynesianLowBanjar Kuala
bjn_Latnbjn_LatnBanjarLatinAustronesianMalayo-PolynesianLowBanjar Kuala
bod_Tibtbod_TibtStandard TibetanTibetanSino-TibetanBodicLowLhasa
bos_Latnbos_LatnBosnianLatinIndo-EuropeanBalto-SlavicHigh
bug_Latnbug_LatnBugineseLatinAustronesianMalayo-PolynesianLowBone
bul_Cyrlbul_CyrlBulgarianCyrillicIndo-EuropeanBalto-SlavicHigh
cat_Latncat_LatnCatalanLatinIndo-EuropeanItalicHigh
ceb_Latnceb_LatnCebuanoLatinAustronesianMalayo-PolynesianLow
ces_Latnces_LatnCzechLatinIndo-EuropeanBalto-SlavicHigh
cjk_Latncjk_LatnChokweLatinAtlantic-CongoBenue-CongoLow
ckb_Arabckb_ArabCentral KurdishArabicIndo-EuropeanIranianLow
crh_Latncrh_LatnCrimean TatarLatinTurkicCommon TurkicLow
cym_Latncym_LatnWelshLatinIndo-EuropeanCelticLowY Wyndodeg
dan_Latndan_LatnDanishLatinIndo-EuropeanGermanicHigh
deu_Latndeu_LatnGermanLatinIndo-EuropeanGermanicHigh
dik_Latndik_LatnSouthwestern DinkaLatinNiloticWestern NiloticLowRek
dyu_Latndyu_LatnDyulaLatinMandeWestern MandeLow
dzo_Tibtdzo_TibtDzongkhaTibetanSino-TibetanBodicLow
ell_Grekell_GrekGreekGreekIndo-EuropeanGraeco-PhrygianHigh
eng_Latneng_LatnEnglishLatinIndo-EuropeanGermanicHigh
epo_Latnepo_LatnEsperantoLatinConstructedEsperanticLow
est_Latnest_LatnEstonianLatinUralicFinnicHigh
eus_Latneus_LatnBasqueLatinBasqueHigh
ewe_Latnewe_LatnEweLatinAtlantic-CongoKwa Volta-CongoLowAŋlo
fao_Latnfao_LatnFaroeseLatinIndo-EuropeanGermanicLow
fij_Latnfij_LatnFijianLatinAustronesianMalayo-PolynesianLowBau
fin_Latnfin_LatnFinnishLatinUralicFinnicHigh
fon_Latnfon_LatnFonLatinAtlantic-CongoKwa Volta-CongoLow
fra_Latnfra_LatnFrenchLatinIndo-EuropeanItalicHigh
fur_Latnfur_LatnFriulianLatinIndo-EuropeanItalicLowCentral
fuv_Latnfuv_LatnNigerian FulfuldeLatinAtlantic-CongoNorth-Central AtlanticLowSokoto
gla_Latngla_LatnScottish GaelicLatinIndo-EuropeanCelticLowNorthern Hebrides
gle_Latngle_LatnIrishLatinIndo-EuropeanCelticLow
glg_Latnglg_LatnGalicianLatinIndo-EuropeanItalicLow
grn_Latngrn_LatnGuaraniLatinTupianMaweti-GuaraniLow
guj_Gujrguj_GujrGujaratiGujaratiIndo-EuropeanIndo-AryanLowAmdavadi/Surti
hat_Latnhat_LatnHaitian CreoleLatinIndo-EuropeanItalicLow
hau_Latnhau_LatnHausaLatinAfro-AsiaticChadicLow
heb_Hebrheb_HebrHebrewHebrewAfro-AsiaticSemiticHigh
hin_Devahin_DevaHindiDevanagariIndo-EuropeanIndo-AryanHigh
hne_Devahne_DevaChhattisgarhiDevanagariIndo-EuropeanIndo-AryanLow
hrv_Latnhrv_LatnCroatianLatinIndo-EuropeanBalto-SlavicHigh
hun_Latnhun_LatnHungarianLatinUralicHigh
hye_Armnhye_ArmnArmenianArmenianIndo-EuropeanArmenicLowYerevan
ibo_Latnibo_LatnIgboLatinAtlantic-CongoBenue-CongoLowCentral
ilo_Latnilo_LatnIlocanoLatinAustronesianMalayo-PolynesianLow
ind_Latnind_LatnIndonesianLatinAustronesianMalayo-PolynesianHigh
isl_Latnisl_LatnIcelandicLatinIndo-EuropeanGermanicHigh
ita_Latnita_LatnItalianLatinIndo-EuropeanItalicHigh
jav_Latnjav_LatnJavaneseLatinAustronesianMalayo-PolynesianLow
jpn_Jpanjpn_JpanJapaneseJapaneseJaponicJapanesicHigh
kab_Latnkab_LatnKabyleLatinAfro-AsiaticBerberLowNorth Eastern
kac_Latnkac_LatnJingphoLatinSino-TibetanBrahmaputranLow
kam_Latnkam_LatnKambaLatinAtlantic-CongoBenue-CongoLowMachakos
kan_Kndakan_KndaKannadaKannadaDravidianSouth DravidianLowCentral
kas_Arabkas_ArabKashmiriArabicIndo-EuropeanIndo-AryanLowKishtwari
kas_Devakas_DevaKashmiriDevanagariIndo-EuropeanIndo-AryanLowKishtwari
kat_Georkat_GeorGeorgianGeorgianKartvelianGeorgian-ZanLowKartlian
knc_Arabknc_ArabCentral KanuriArabicSaharanWestern SaharanLowYerwa
knc_Latnknc_LatnCentral KanuriLatinSaharanWestern SaharanLowYerwa
kaz_Cyrlkaz_CyrlKazakhCyrillicTurkicCommon TurkicHigh
kbp_Latnkbp_LatnKabiyèLatinAtlantic-CongoNorth Volta-CongoLowKɛ̀̀wɛ
kea_Latnkea_LatnKabuverdianuLatinIndo-EuropeanItalicLowSotavento
khm_Khmrkhm_KhmrKhmerKhmerAustroasiaticKhmericLowCentral
kik_Latnkik_LatnKikuyuLatinAtlantic-CongoBenue-CongoLowSouthern
kin_Latnkin_LatnKinyarwandaLatinAtlantic-CongoBenue-CongoLow
kir_Cyrlkir_CyrlKyrgyzCyrillicTurkicCommon TurkicLowNorthern
kmb_Latnkmb_LatnKimbunduLatinAtlantic-CongoBenue-CongoLow
kmr_Latnkmr_LatnNorthern KurdishLatinIndo-EuropeanIranianLow
kon_Latnkon_LatnKikongoLatinAtlantic-CongoBenue-CongoLow
kor_Hangkor_HangKoreanHangulKoreanicKoreanHigh
lao_Laoolao_LaooLaoLaoTai-KadaiKam-TaiLowVientiane
lij_Latnlij_LatnLigurianLatinIndo-EuropeanItalicLowZeneise
lim_Latnlim_LatnLimburgishLatinIndo-EuropeanGermanicLowMaastrichtian
lin_Latnlin_LatnLingalaLatinAtlantic-CongoBenue-CongoLow
lit_Latnlit_LatnLithuanianLatinIndo-EuropeanBalto-SlavicHigh
lmo_Latnlmo_LatnLombardLatinIndo-EuropeanItalicLowWestern
ltg_Latnltg_LatnLatgalianLatinIndo-EuropeanBalto-SlavicLowCentral
ltz_Latnltz_LatnLuxembourgishLatinIndo-EuropeanGermanicLow
lua_Latnlua_LatnLuba-KasaiLatinAtlantic-CongoBenue-CongoLow
lug_Latnlug_LatnGandaLatinAtlantic-CongoBenue-CongoLow
luo_Latnluo_LatnLuoLatinNiloticWestern NiloticLow
lus_Latnlus_LatnMizoLatinSino-TibetanKuki-Chin-NagaLowAizawl
lvs_Latnlvs_LatnStandard LatvianLatinIndo-EuropeanBalto-SlavicHigh
mag_Devamag_DevaMagahiDevanagariIndo-EuropeanIndo-AryanLowGaya
mai_Devamai_DevaMaithiliDevanagariIndo-EuropeanIndo-AryanLow
mal_Mlymmal_MlymMalayalamMalayalamDravidianSouth DravidianLow
mar_Devamar_DevaMarathiDevanagariIndo-EuropeanIndo-AryanLowVarhadi
min_Arab-MinangkabauArabicAustronesianMalayo-PolynesianLowAgam-Tanah Datar
min_Latnmin_LatnMinangkabauLatinAustronesianMalayo-PolynesianLowAgam-Tanah Datar
mkd_Cyrlmkd_CyrlMacedonianCyrillicIndo-EuropeanBalto-SlavicHigh
plt_Latnplt_LatnPlateau MalagasyLatinAustronesianMalayo-PolynesianLowMerina
mlt_Latnmlt_LatnMalteseLatinAfro-AsiaticSemiticHigh
mni_Bengmni_BengMeiteiBengaliSino-TibetanKuki-Chin-NagaLow
khk_Cyrlkhk_CyrlHalh MongolianCyrillicMongolic-KhitanMongolicLow
mos_Latnmos_LatnMossiLatinAtlantic-CongoNorth Volta-CongoLowOuagadougou
mri_Latnmri_LatnMaoriLatinAustronesianMalayo-PolynesianLowWaikato-Ngapuhi
mya_Mymrmya_MymrBurmeseMyanmarSino-TibetanBurmo-QiangicLowMandalay-Yangon
nld_Latnnld_LatnDutchLatinIndo-EuropeanGermanicHigh
nno_Latnnno_LatnNorwegian NynorskLatinIndo-EuropeanGermanicLow
nob_Latnnob_LatnNorwegian BokmålLatinIndo-EuropeanGermanicLow
npi_Devanpi_DevaNepaliDevanagariIndo-EuropeanIndo-AryanLowEastern
nso_Latnnso_LatnNorthern SothoLatinAtlantic-CongoBenue-CongoLow
nus_Latnnus_LatnNuerLatinNiloticWestern NiloticLow
nya_Latnnya_LatnNyanjaLatinAtlantic-CongoBenue-CongoLow
oci_Latnoci_LatnOccitanLatinIndo-EuropeanItalicLow
gaz_Latngaz_LatnWest Central OromoLatinAfro-AsiaticCushiticLow
ory_Oryaory_OryaOdiaOriyaIndo-EuropeanIndo-AryanLowBaleswari (Northern)
pag_Latnpag_LatnPangasinanLatinAustronesianMalayo-PolynesianLow
pan_Gurupan_GuruEastern PanjabiGurmukhiIndo-EuropeanIndo-AryanLowMajhi
pap_Latnpap_LatnPapiamentoLatinIndo-EuropeanItalicLowRömer-Maduro-Jonis
pes_Arabpes_ArabWestern PersianArabicIndo-EuropeanIranianHigh
pol_Latnpol_LatnPolishLatinIndo-EuropeanBalto-SlavicHigh
por_Latnpor_LatnPortugueseLatinIndo-EuropeanItalicHighBrazil
prs_Arabprs_ArabDariArabicIndo-EuropeanIranianLowKabuli
pbt_Arabpbt_ArabSouthern PashtoArabicIndo-EuropeanIranianLowLiterary
quy_Latnquy_LatnAyacucho QuechuaLatinQuechuanChinchayLowSouthern Quechua
ron_Latnron_LatnRomanianLatinIndo-EuropeanItalicHigh
run_Latnrun_LatnRundiLatinAtlantic-CongoBenue-CongoLow
rus_Cyrlrus_CyrlRussianCyrillicIndo-EuropeanBalto-SlavicHigh
sag_Latnsag_LatnSangoLatinAtlantic-CongoNorth Volta-CongoLow
san_Devasan_DevaSanskritDevanagariIndo-EuropeanIndo-AryanLow
sat_Olcksat_BengSantaliOl ChikiAustroasiaticMundaicLow
scn_Latnscn_LatnSicilianLatinIndo-EuropeanItalicLowLiterary Sicilian
shn_Mymrshn_MymrShanMyanmarTai-KadaiKam-TaiLow
sin_Sinhsin_SinhSinhalaSinhalaIndo-EuropeanIndo-AryanLow
slk_Latnslk_LatnSlovakLatinIndo-EuropeanBalto-SlavicHigh
slv_Latnslv_LatnSlovenianLatinIndo-EuropeanBalto-SlavicHigh
smo_Latnsmo_LatnSamoanLatinAustronesianMalayo-PolynesianLow
sna_Latnsna_LatnShonaLatinAtlantic-CongoBenue-CongoLow
snd_Arabsnd_ArabSindhiArabicIndo-EuropeanIndo-AryanLowVicholi
som_Latnsom_LatnSomaliLatinAfro-AsiaticCushiticLowNsom
sot_Latnsot_LatnSouthern SothoLatinAtlantic-CongoBenue-CongoHigh
spa_Latnspa_LatnSpanishLatinIndo-EuropeanItalicHighLatin American
als_Latnals_LatnTosk AlbanianLatinIndo-EuropeanAlbanianHigh
srd_Latnsrd_LatnSardinianLatinIndo-EuropeanItalicLowLogudorese and Campidanese
srp_Cyrlsrp_CyrlSerbianCyrillicIndo-EuropeanBalto-SlavicLow
ssw_Latnssw_LatnSwatiLatinAtlantic-CongoBenue-CongoLow
sun_Latnsun_LatnSundaneseLatinAustronesianMalayo-PolynesianLow
swe_Latnswe_LatnSwedishLatinIndo-EuropeanGermanicHigh
swh_Latnswh_LatnSwahiliLatinAtlantic-CongoBenue-CongoHighKiunguja
szl_Latnszl_LatnSilesianLatinIndo-EuropeanBalto-SlavicLow
tam_Tamltam_TamlTamilTamilDravidianSouth DravidianLowChennai
tat_Cyrltat_CyrlTatarCyrillicTurkicCommon TurkicLowCentral and Middle
tel_Telutel_TeluTeluguTeluguDravidianSouth DravidianLowCoastal
tgk_Cyrltgk_CyrlTajikCyrillicIndo-EuropeanIranianLow
tgl_Latntgl_LatnTagalogLatinAustronesianMalayo-PolynesianHigh
tha_Thaitha_ThaiThaiThaiTai-KadaiKam-TaiHigh
tir_Ethitir_EthiTigrinyaGeʽezAfro-AsiaticSemiticLow
taq_Latntaq_LatnTamasheqLatinAfro-AsiaticBerberLowKal Ansar
taq_Tfngtaq_TfngTamasheqTifinaghAfro-AsiaticBerberLowKal Ansar
tpi_Latntpi_LatnTok PisinLatinIndo-EuropeanGermanicLow
tsn_Latntsn_LatnTswanaLatinAtlantic-CongoBenue-CongoHighSehurutshe
tso_Latntso_LatnTsongaLatinAtlantic-CongoBenue-CongoLow
tuk_Latntuk_LatnTurkmenLatinTurkicCommon TurkicLowTeke
tum_Latntum_LatnTumbukaLatinAtlantic-CongoBenue-CongoLowRumphi
tur_Latntur_LatnTurkishLatinTurkicCommon TurkicHigh
twi_Latntwi_LatnTwiLatinAtlantic-CongoKwa Volta-CongoLowAkuapem
tzm_Tfngtzm_TfngCentral Atlas TamazightTifinaghAfro-AsiaticBerberLow
uig_Arabuig_ArabUyghurArabicTurkicCommon TurkicLow
ukr_Cyrlukr_CyrlUkrainianCyrillicIndo-EuropeanBalto-SlavicHigh
umb_Latnumb_LatnUmbunduLatinAtlantic-CongoBenue-CongoLow
urd_Araburd_ArabUrduArabicIndo-EuropeanIndo-AryanLowLashkari
uzn_Latnuzn_LatnNorthern UzbekLatinTurkicCommon TurkicHigh
vec_Latnvec_LatnVenetianLatinIndo-EuropeanItalicLowVenice
vie_Latnvie_LatnVietnameseLatinAustroasiaticVieticHigh
war_Latnwar_LatnWarayLatinAustronesianMalayo-PolynesianLowTacloban
wol_Latnwol_LatnWolofLatinAtlantic-CongoNorth-Central AtlanticLowDakkar
xho_Latnxho_LatnXhosaLatinAtlantic-CongoBenue-CongoHighNgqika
ydd_Hebrydd_HebrEastern YiddishHebrewIndo-EuropeanGermanicLowHasidic
yor_Latnyor_LatnYorubaLatinAtlantic-CongoBenue-CongoLowỌyọ and Ibadan
yue_Hantyue_HantYue ChineseHan (Traditional)Sino-TibetanSiniticLow
zho_Hanszho_HansChineseHan (Simplified)Sino-TibetanSiniticHigh
zho_Hantzho_HantChineseHan (Traditional)Sino-TibetanSiniticHigh
zsm_Latnzsm_LatnStandard MalayLatinAustronesianMalayo-PolynesianHigh
zul_Latnzul_LatnZuluLatinAtlantic-CongoBenue-CongoHigh
Available speech encoders
lang_codelanguagelink
arbms arabicdownload
asmassamesedownload
belbelarussiandownload
benbengalidownload
bosbosniandownload
bulbulgariandownload
catcatalandownload
cesczechdownload
cmnmandarin chinesedownload
cymwelshdownload
dandanishdownload
deugermandownload
estestoniandownload
finfinnishdownload
frafrenchdownload
gujgujuratidownload
hebhebrewdownload
hinhindidownload
hrvcroatiandownload
indindonesiandownload
itaitaliandownload
jpnjapansedownload
kankannadadownload
korkoreandownload
laolaodownload
litlithaiandownload
lvsstandard latviandownload
malmalayalamdownload
marmarathidownload
mkdmacedoniandownload
mltmaltesedownload
npinepalidownload
nlddutchdownload
oryodiadownload
panpunjabidownload
peswestern persiandownload
polpolishdownload
porportuguesedownload
ronromaniandownload
rusrussiandownload
slkslovakdownload
slvsloveniandownload
sndsindhidownload
srpserbiandownload
spaspanishdownload
sweswedishdownload
swhswahilidownload
tamtamildownload
teltelugudownload
tgltagalogdownload
thathaidownload
turturkishdownload
ukrukrainiandownload
urdurdudownload
uznnorthern uzbekdownload
vievietnamesedownload
yueyuedownload

Citation Information

Please cite the paper when referencing the SONAR embedding space, encoders and decoders as:

@misc{Duquenne:2023:sonar_arxiv,
  author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
  title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
  publisher = {arXiv},
  year = {2023},
  url = {https://arxiv.org/abs/2308.11466},
}

Contributing

See the CONTRIBUTING file for how to help out.

License

SONAR code is released under the MIT license (see CODE_LICENSE).

Some of SONAR models are released with the same MIT license, BUT BEWARE, some of them are released under a non commercial license (see NC_MODEL_LICENSE). Please refer to LICENSE for the details.

Keywords

sentence embeddings

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts