Echogarden
Echogarden is an easy-to-use speech toolset that includes a variety of speech processing tools.
- Easy to install, run, and update
- Runs on Windows (x64), macOS (x64, ARM64) and Linux (x64, ARM64)
- Written in TypeScript, for the Node.js runtime
- Doesn't require Python, Docker, or other system-level dependencies
- Doesn't rely on essential platform-specific binaries. Engines are either ported via WebAssembly, imported using the ONNX runtime, or written in pure JavaScript
Features
- Text-to-speech using the VITS neural architecture, and 15 other offline and online engines, including cloud services by Google, Microsoft, Amazon, OpenAI and Elevenlabs
- Speech-to-text using OpenAI Whisper, and several other engines, including cloud services by Google, Microsoft, Amazon and OpenAI
- Speech-to-transcript alignment using several variants of dynamic time warping (DTW, DTW-RA), including support for multi-pass (hierarchical) processing, or via guided decoding using Whisper recognition models. Supports 100+ languages
- Speech-to-text translation, translates speech in any of the 98 languages supported by Whisper, to English, with near word-level timing for the translated transcript
- Speech-to-translated-transcript alignment attempts to synchronize spoken audio in one language, to a provided English-translated transcript, using the Whisper engine
- Language detection identifies the language of a given audio or text. Provides Whisper or Silero engines for audio, and TinyLD or FastText for text
- Voice activity detection attempts to identify segments of audio where voice is active or inactive. Includes WebRTC VAD, Silero VAD, RNNoise-based VAD and a custom Adaptive Gate
- Speech denoising attenuates background noise from spoken audio. Includes the RNNoise engine
- Source separation isolates voice from any music or background ambience. Supports the MDX-NET deep learning architecture
- Word-level timestamps for all recognition, synthesis, alignment and translation outputs
- Advanced subtitle generation, accounting for sentence and phrase boundaries
- For the VITS and eSpeak-NG synthesis engines, includes enhancements to improve TTS pronunciation accuracy: adds text normalization (e.g. idiomatic date and currency pronunciation), heteronym disambiguation (based on a rule-based model) and user-customizable pronunciation lexicons
- Internal package system that auto-downloads and installs voices, models and other resources, as needed
Installation
Ensure you have Node.js v18.16.0
or later installed.
then:
npm install echogarden -g
Additional required tools:
ffmpeg
: used for codec conversionssox
: used for the CLI's audio playback
Both tools are auto-downloaded as internal packages on Windows and Linux.
On macOS, only ffmpeg
is currently auto-downloaded. It is recommended to install sox
via a system package manager like Homebrew (brew install sox
) to ensure it is available on the system path.
Updating to latest version
npm update echogarden -g
Using the toolset
Tools are accessible via a command-line interface, which enables powerful customization and is especially useful for long-running bulk operations.
Development of more graphical and interactive tooling is planned. A text-to-speech browser extension is currently under development (but not released yet).
If you are a developer, you can also import the package as a module or interface with it via a local WebSocket service (currently experimental).
Documentation
Credits
This project consolidates, and builds upon the effort of many different individuals and companies, as well as contributing a number of original works.
Developed by Rotem Dan (IPA: /ˈʁɒːtem ˈdän/).
License
GNU General Public License v3
Licenses for components, models and other dependencies are detailed on this page.