
Security News
ECMAScript 2025 Finalized with Iterator Helpers, Set Methods, RegExp.escape, and More
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Stream text into audio with an easy-to-use, highly configurable library delivering voice output with minimal latency.
To install realtimetts, you need to specify the TTS engine(s) you wish to use.
For example, to install all supported engines:
pip install realtimetts[all]
To install with the Coqui TTS engine:
pip install realtimetts[coqui]
Available engine options include:
pyttsx3
You can install multiple engines by separating them with commas. For example:
pip install realtimetts[azure,elevenlabs,openai]
Easy to use, low-latency text-to-speech library for realtime applications
RealtimeTTS is a state-of-the-art text-to-speech (TTS) library designed for real-time applications. It stands out in its ability to convert text streams fast into high-quality auditory output with minimal latency.
Important: Installation has changed to allow more customization. Please use
pip install realtimetts[all]
instead ofpip install realtimetts
now. More info here.
Hint: Check out Linguflex, the original project from which RealtimeTTS is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.
https://github.com/KoljaB/RealtimeTTS/assets/7604638/87dcd9a5-3a4e-4f57-be45-837fc63237e7
Hint: check out RealtimeSTT, the input counterpart of this library, for speech-to-text capabilities. Together, they form a powerful realtime audio wrapper around large language models.
Check the FAQ page for answers to a lot of questions around the usage of RealtimeTTS.
The documentation for RealtimeTTS is available in the following languages:
Let me know if you need any adjustments or additional languages!
Latest Version: v0.5.5
New Engine: OrpheusEngine
New Engine: KokoroEngine
Support for more kokoro languages. Full installation for also japanese and chinese languages (see updated test file):
pip install "RealtimeTTS[kokoro,jp,zh]"
If you run into problems with japanese (Error "module 'jieba' has no attribute 'lcut'") try:
pip uninstall jieba jieba3k
pip install jieba
StyleTTS2 engine:
https://github.com/user-attachments/assets/d1634012-ba53-4445-a43a-7042826eedd7
EdgeTTS engine:
https://github.com/user-attachments/assets/73ec6258-23ba-4bc6-acc7-7351a13c5509
See release history.
Added ParlerEngine. Needs flash attention, then barely runs fast enough for realtime inference on a 4090.
Parler Installation for Windows (after installing RealtimeTTS):
pip install git+https://github.com/huggingface/parler-tts.git
pip install torch==2.3.1+cu121 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install https://github.com/oobabooga/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu122torch2.3.1cxx11abiFALSE-cp310-cp310-win_amd64.whl
pip install "numpy<2"
This library uses:
π Local processing (no internet required) π Requires internet connection
By using "industry standard" components RealtimeTTS offers a reliable, high-end technological foundation for developing advanced voice solutions.
Note: Basic Installation with
pip install realtimetts
is not recommended anymore, usepip install realtimetts[all]
instead.
Note: Set
output_device_index
in TextToAudioStream if needed. Linux users: Install portaudio viaapt-get install -y portaudio19-dev
The RealtimeTTS library provides installation options for various dependencies for your use case. Here are the different ways you can install RealtimeTTS depending on your needs:
To install RealtimeTTS with support for all TTS engines:
pip install -U realtimetts[all]
Install only required dependencies using these options:
Example: pip install realtimetts[all]
, pip install realtimetts[azure]
, pip install realtimetts[azure,elevenlabs,openai]
For those who want to perform a full installation within a virtual environment, follow these steps:
python -m venv env_realtimetts
env_realtimetts\Scripts\activate.bat
python.exe -m pip install --upgrade pip
pip install -U realtimetts[all]
More information about CUDA installation.
Different engines supported by RealtimeTTS have unique requirements. Ensure you fulfill these requirements based on the engine you choose.
The SystemEngine
works out of the box with your system's built-in TTS capabilities. No additional setup is needed.
The GTTSEngine
works out of the box using Google Translate's text-to-speech API. No additional setup is needed.
To use the OpenAIEngine
:
To use the AzureEngine
, you will need:
Make sure you have these credentials available and correctly configured when initializing the AzureEngine
.
For the ElevenlabsEngine
, you need:
Elevenlabs API key (provided via ElevenlabsEngine constructor parameter "api_key" or in the environment variable ELEVENLABS_API_KEY)
mpv
installed on your system (essential for streaming mpeg audio, Elevenlabs only delivers mpeg).
πΉ Installing mpv
:
macOS:
brew install mpv
Linux and Windows: Visit mpv.io for installation instructions.
PiperEngine offers high-quality, real-time text-to-speech synthesis using the Piper model.
Separate Installation:
Configuration:
PiperEngine
.PiperVoice
is correctly set up with the model and configuration files.Delivers high quality, local, neural TTS with voice-cloning.
Downloads a neural TTS model first. In most cases it be fast enough for Realtime using GPU synthesis. Needs around 4-5 GB VRAM.
On most systems GPU support will be needed to run fast enough for realtime, otherwise you will experience stuttering.
Here's a basic usage example:
from RealtimeTTS import TextToAudioStream, SystemEngine, AzureEngine, ElevenlabsEngine
engine = SystemEngine() # replace with your TTS engine
stream = TextToAudioStream(engine)
stream.feed("Hello world! How are you today?")
stream.play_async()
You can feed individual strings:
stream.feed("Hello, this is a sentence.")
Or you can feed generators and character iterators for real-time streaming:
def write(prompt: str):
for chunk in openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content" : prompt}],
stream=True
):
if (text_chunk := chunk["choices"][0]["delta"].get("content")) is not None:
yield text_chunk
text_stream = write("A three-sentence relaxing speech.")
stream.feed(text_stream)
char_iterator = iter("Streaming this character by character.")
stream.feed(char_iterator)
Asynchronously:
stream.play_async()
while stream.is_playing():
time.sleep(0.1)
Synchronously:
stream.play()
The test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.
Note that most of the tests still rely on the "old" OpenAI API (<1.0.0). Usage of the new OpenAI API is demonstrated in openai_1.0_test.py.
simple_test.py
complex_test.py
coqui_test.py
translator.py
pip install openai realtimestt
.openai_voice_interface.py
pip install openai realtimestt
.advanced_talk.py
pip install openai keyboard realtimestt
.minimalistic_talkbot.py
pip install openai realtimestt
.simple_llm_test.py
pip install openai
.test_callbacks.py
pip install openai
.Pause the audio stream:
stream.pause()
Resume a paused stream:
stream.resume()
Stop the stream immediately:
stream.stop()
Python Version:
PyAudio: to create an output audio stream
stream2sentence: to split the incoming text stream into sentences
pyttsx3: System text-to-speech conversion engine
pydub: to convert audio chunk formats
azure-cognitiveservices-speech: Azure text-to-speech conversion engine
elevenlabs: Elevenlabs text-to-speech conversion engine
coqui-TTS: Coqui's XTTS text-to-speech library for high-quality local neural TTS
Shoutout to Idiap Research Institute for maintaining a fork of coqui tts.
openai: to interact with OpenAI's TTS API
gtts: Google translate text-to-speech conversion
TextToAudioStream
When you initialize the TextToAudioStream
class, you have various options to customize its behavior. Here are the available parameters:
engine
(BaseEngine)Union[BaseEngine, List[BaseEngine]]
on_text_stream_start
(callable)Callable
on_text_stream_start() -> None
.on_text_stream_stop
(callable)Callable
on_text_stream_stop() -> None
.on_audio_stream_start
(callable)Callable
on_audio_stream_start() -> None
.on_audio_stream_stop
(callable)Callable
on_audio_stream_stop() -> None
.on_character
(callable)Callable
on_character(character: str) -> None
.on_word
(callable, optional)Callable
None
TimingInfo
) that includes:
output_device_index
(int) β NOT SUPPORTED for ElevenlabsEngine and EdgeEngine (MPV playout)int
None
None
, the system's default audio output device is used.tokenizer
(string)str
"nltk"
"nltk"
(default) and "stanza"
.tokenize_sentences
parameter instead.language
(string)str
"en"
"en"
for English, "de"
for German, "fr"
for French.muted
(bool)bool
False
True
, audio playback is disabled and no audio stream will be opened, allowing the synthesis to generate audio data without playing it.frames_per_buffer
(int)int
pa.paFramesPerBufferUnspecified
pa.paFramesPerBufferUnspecified
, PyAudio selects a default value based on the platform and hardware.comma_silence_duration
(float)float
0.0
sentence_silence_duration
(float)float
0.0
default_silence_duration
(float)float
0.0
playout_chunk_size
(int)int
-1
-1
, the chunk size is determined dynamically based on frames_per_buffer
or a default internal value.level
(int)int
logging.WARNING
logging.DEBUG
: Detailed information for debugging.logging.INFO
: General runtime information.logging.WARNING
: Warnings about potential issues.logging.ERROR
: Serious errors requiring attention.engine = YourEngine() # Substitute with your engine
stream = TextToAudioStream(
engine=engine,
on_text_stream_start=my_text_start_func,
on_text_stream_stop=my_text_stop_func,
on_audio_stream_start=my_audio_start_func,
on_audio_stream_stop=my_audio_stop_func,
level=logging.INFO
)
play
and play_async
These methods are responsible for executing the text-to-audio synthesis and playing the audio stream. The difference is that play
is a blocking function, while play_async
runs in a separate thread, allowing other operations to proceed.
fast_sentence_fragment
(bool)True
True
, the method will prioritize speed, generating and playing sentence fragments faster. This is useful for applications where latency matters.fast_sentence_fragment_allsentences
(bool)False
True
, applies the fast sentence fragment processing to all sentences, not just the first one.fast_sentence_fragment_allsentences_multiple
(bool)False
True
, allows yielding multiple sentence fragments instead of just a single one.buffer_threshold_seconds
(float)Default: 0.0
Description: Specifies the time in seconds for the buffering threshold, which impacts the smoothness and continuity of audio playback.
buffer_threshold_seconds
. If so, it retrieves another sentence from the text generator, assuming that it can fetch and synthesize this new sentence within the time window provided by the remaining audio in the buffer. This process allows the text-to-speech engine to have more context for better synthesis, enhancing the user experience.A higher value ensures that there's more pre-buffered audio, reducing the likelihood of silence or gaps during playback. If you experience breaks or pauses, consider increasing this value.
minimum_sentence_length
(int)10
minimum_first_fragment_length
(int)10
log_synthesized_text
(bool)False
reset_generated_text
(bool)True
output_wavfile
(str)None
on_sentence_synthesized
(callable)None
before_sentence_synthesized
(callable)None
on_audio_chunk
(callable)None
tokenizer
(str)"nltk"
tokenize_sentences
(callable)None
language
(str)"en"
context_size
(int)12
context_size_look_overhead
(int)12
muted
(bool)False
sentence_fragment_delimiters
(str)".?!;:,\nβ¦)]}γ-"
force_first_fragment_after_words
(int)15
These steps are recommended for those who require better performance and have a compatible NVIDIA GPU.
Note: to check if your NVIDIA GPU supports CUDA, visit the official CUDA GPUs list.
To use a torch with support via CUDA please follow these steps:
Note: newer pytorch installations may (unverified) not need Toolkit (and possibly cuDNN) installation anymore.
Install NVIDIA CUDA Toolkit: For example, to install Toolkit 12.X, please
or to install Toolkit 11.8, please
Install NVIDIA cuDNN:
For example, to install cuDNN 8.7.0 for CUDA 11.x please
Install ffmpeg:
You can download an installer for your OS from the ffmpeg Website.
Or use a package manager:
On Ubuntu or Debian:
sudo apt update && sudo apt install ffmpeg
On Arch Linux:
sudo pacman -S ffmpeg
On MacOS using Homebrew (https://brew.sh/):
brew install ffmpeg
On Windows using Chocolatey (https://chocolatey.org/):
choco install ffmpeg
On Windows using Scoop (https://scoop.sh/):
scoop install ffmpeg
Install PyTorch with CUDA support:
To upgrade your PyTorch installation to enable GPU support with CUDA, follow these instructions based on your specific CUDA version. This is useful if you wish to enhance the performance of RealtimeSTT with CUDA capabilities.
For CUDA 11.8:
To update PyTorch and Torchaudio to support CUDA 11.8, use the following commands:
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
For CUDA 12.X:
To update PyTorch and Torchaudio to support CUDA 12.X, execute the following:
pip install torch==2.5.1+cu121 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
Replace 2.3.1
with the version of PyTorch that matches your system and requirements.
Fix for to resolve compatibility issues: If you run into library compatibility issues, try setting these libraries to fixed versions:
pip install networkx==2.8.8
pip install typing_extensions==4.8.0
pip install fsspec==2023.6.0
pip install imageio==2.31.6
pip install networkx==2.8.8
pip install numpy==1.24.3
pip install requests==2.31.0
Huge shoutout to the team behind Coqui AI - especially the brilliant Eren GΓΆlge - for being the first to give us local high-quality synthesis with real-time speed and even a clonable voice!
Thank you Pierre Nicolas Durette for giving us a free tts to use without GPU using Google Translate with his gtts python library.
Contributions are always welcome (e.g. PR to add a new engine).
While the source of this library is open-source, the usage of many of the engines it depends on is not: External engine providers often restrict commercial use in their free plans. This means the engines can be used for noncommercial projects, but commercial usage requires a paid plan.
Disclaimer: This is a summarization of the licenses as understood at the time of writing. It is not legal advice. Please read and respect the licenses of the different engine providers if you plan to use them in a project.
Kolja Beigel Email: kolja.beigel@web.de
FAQs
Stream text into audio with an easy-to-use, highly configurable library delivering voice output with minimal latency.
We found that realtimetts demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Security News
A new Node.js homepage button linking to paid support for EOL versions has sparked a heated discussion among contributors and the wider community.
Research
North Korean threat actors linked to the Contagious Interview campaign return with 35 new malicious npm packages using a stealthy multi-stage malware loader.