Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. Original authors: Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani.
This package makes StyleTTS2, an approach to human-level text-to-speech, accessible with an inference module that uses strictly MIT licensed libraries. See Conditions and Terms of Use, Common Issues, and Notes below.
pip install styletts2
from styletts2 import tts
# No paths provided means default checkpoints/configs will be downloaded/cached.
my_tts = tts.StyleTTS2()
# Optionally create/write an output WAV file.
out = my_tts.inference("Hello there, I am now a python package.", output_wav_file="test.wav")
# Specific paths to a checkpoint and config can also be provided.
other_tts = tts.StyleTTS2(model_checkpoint_path='/PATH/TO/epochs_2nd_00020.pth', config_path='/PATH/TO/config.yml')
# Specify target voice to clone. When no target voice is provided, a default voice will be used.
other_tts.inference("Hello there, I am now a python package.", target_voice_path="/PATH/TO/some_voice.wav", output_wav_file="another_test.wav")
def inference(self,
text: str,
target_voice_path=None,
output_wav_file=None,
output_sample_rate=24000,
alpha=0.3,
beta=0.7,
diffusion_steps=5,
embedding_scale=1,
ref_s=None)
text: Input text to turn into speech.
target_voice_path: Path to audio file of target voice to clone.
output_wav_file: Name of output audio file (if output WAV file is desired).
output_sample_rate: Output sample rate (default 24000).
alpha: Determines timbre of speech, higher means style is more suitable to text than to the target voice.
beta: Determines prosody of speech, higher means style is more suitable to text than to the target voice.
diffusion_steps: The more the steps, the more diverse the samples are, with the cost of speed.
embedding_scale: Higher scale means style is more conditional to the input text and hence more emotional.
ref_s: Pre-computed style vector to pass directly.
return: audio data as a Numpy array (will also create the WAV file if output_wav_file was set).
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.
Paper: https://arxiv.org/abs/2306.07691
Audio samples: https://styletts2.github.io/
Online demo: Hugging Face (thank @fakerybakery for the wonderful online demo)
Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.
[MacOS] ImportError due to incompatible architecture for pycrfsuite: This is caused by a dependency on python-crfsuite by the gruut phoneme converter. If you are operating on a conda environment, try the following:
conda install -c conda-forge python-crfsuite
Another option is adding another phoneme converter with the abstraction detailed in phoneme.py
and using that instead.
Voice quality: This is more of a catch-all issue for voice quality related issues. In most cases, strange annunciations are the result of the phoneme converter. The hope is that the field of MIT licensed phoneme converters (i.e Gruut, DeepPhonemizer, etc.) will eventually become incredibly competitive with the legacy converters such as espeak
. However, in the meantime here are some potential avenues for quality improvement:
High-pitched background noise: This is caused by numerical float differences in older GPUs. For more details, please refer to issue #13. Basically, you will need to use more modern GPUs or do inference on CPUs.
Pre-trained model license: You only need to abide by the above rules if you use the pre-trained models and the voices are NOT in the training set, i.e., your reference speakers are not from any open access dataset. For more details of rules to use the pre-trained models, please see #37.
phoneme.py
If specific checkpoint paths are not provided, default checkpoints and sub-module checkpoints are downloaded from the HuggingFace repo and the original GitHub repo, respectively, and then cached (similar behavior to HuggingFace Transformers API).
This package currently only supports inference capabilities. Dependencies and scripts related to training and fine-tuning have been pruned out. Check the original repository for training/fine-tuning needs.
Currently using MIT-licensed gruut as the IPA phoneme converter. Found it to be the best alternative to phoneme converters based on espeak
FAQs
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. Original authors: Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani.
We found that styletts2 demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.