
Security News
MCP Community Begins Work on Official MCP Metaregistry
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Tortoise is a text-to-speech program built with the following priorities:
This repo contains all the code needed to run Tortoise TTS in inference mode.
Manuscript: https://arxiv.org/abs/2305.07243
Please duplicate space if you don't want to wait in a queue. https://huggingface.co/spaces/Manmay/tortoise-tts
I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.
See this page for a large list of example outputs.
Cool application of Tortoise+GPT-3 (not by me): https://twitter.com/lexman_ai
If you want to use this on your own computer, you must have an NVIDIA GPU.
On Windows, I highly recommend using the Conda installation path. I have been told that if you do not do this, you will spend a lot of time chasing dependency problems.
First, install miniconda: https://docs.conda.io/en/latest/miniconda.html
Then run the following commands, using anaconda prompt as the terminal (or any other terminal configured to work with conda)
This will:
conda create --name tortoise python=3.9 numba inflect
conda activate tortoise
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install transformers=4.29.2
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install
Optionally, pytorch can be installed in the base environment, so that other conda environments can use it too. To do this, simply send the conda install pytorch...
line before activating the tortoise environment.
Note: When you want to use tortoise-tts, you will always have to ensure the
tortoise
conda environment is activated.
If you are on windows, you may also need to install pysoundfile: conda install -c conda-forge pysoundfile
An easy way to hit the ground running and a good jumping off point depending on your use case.
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
docker build . -t tts
docker run --gpus all \
-e TORTOISE_MODELS_DIR=/models \
-v /mnt/user/data/tortoise_tts/models:/models \
-v /mnt/user/data/tortoise_tts/results:/results \
-v /mnt/user/data/.cache/huggingface:/root/.cache/huggingface \
-v /root:/work \
-it tts
This gives you an interactive terminal in an environment that's ready to do some tts. Now you can explore the different interfaces that tortoise exposes for tts.
For example:
cd app
conda activate tortoise
time python tortoise/do_tts.py \
--output_path /results \
--preset ultra_fast \
--voice geralt \
--text "Time flies like an arrow; fruit flies like a bananna."
On MacOS 13+ with M1/M2 chips you need to install the nighly version of pytorch, as stated in the official page you can do:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
Be sure to do that after you activate the environment. If you don't use conda the commands would look like this:
python3.10 -m venv .venv
source .venv/bin/activate
pip install numba inflect psutil
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install .
Be aware that DeepSpeed is disabled on Apple Silicon since it does not work. The flag --use_deepspeed
is ignored.
You may need to prepend PYTORCH_ENABLE_MPS_FALLBACK=1
to the commands below to make them work since MPS does not support all the operations in Pytorch.
This script allows you to speak a single phrase with one or more voices.
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
This script provides tools for reading large amounts of text.
python tortoise/read_fast.py --textfile <your text to be read> --voice random
This script provides tools for reading large amounts of text.
python tortoise/read.py --textfile <your text to be read> --voice random
This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.
Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py
with the --regenerate
argument.
Tortoise can be used programmatically, like so:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
To use deepspeed:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
To use kv cache:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(kv_cache=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
To run model in float16:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
for Faster runs use all three:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.
These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.
This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see what Tortoise can do for zero-shot mimicking, take a look at the others.
To add new voices to Tortoise, you will need to do the following:
As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips:
Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with these settings (and it's very likely that I missed something!)
These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
api.tts
for a full list.
Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the prompt "[I am really sad,] Please feed me." will only speak the words "Please feed me" (with a sad tonality).
Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.
This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like.
Use the script get_conditioning_latents.py
to extract conditioning latents for a voice you have installed. This script
will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).
Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.
After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single ".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).
Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip came from Tortoise.
This classifier can be run on any computer, usage is as follows:
python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false positives.
Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate models that work together. I've assembled a write-up of the system architecture here: https://nonint.com/2022/04/25/tortoise-architectural-design-doc/
These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of ~50k hours of speech data, most of which was transcribed by ocotillo. Training was done on my own DLAS trainer.
I currently do not have plans to release the training configurations or methodology. See the next section..
Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system could be misused are many. It doesn't take much creativity to think up how.
After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
tortoise-detect
above.The diversity expressed by ML models is strongly tied to the datasets they were trained on.
Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities or of people who speak with strong accents.
Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
I want to mention here that I think Tortoise could be a lot better. The three major components of Tortoise are either vanilla Transformer Encoder stacks or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason to believe that the same is not true of TTS.
This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:
Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.
If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.
FAQs
A high quality multi-voice text-to-speech library
We found that tortoise-tts demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.
Research
Security News
Malicious npm packages posing as developer tools target macOS Cursor IDE users, stealing credentials and modifying files to gain persistent backdoor access.