DSpeech: A Command-line Speech Processing Toolkit
中文 | English
DSpeech is an advanced command-line toolkit designed for speech processing tasks such as transcription, voice activity detection (VAD), punctuation addition, and emotion classification. It is built on top of state-of-the-art models and provides an easy-to-use interface for handling various speech processing jobs.
1. Installation
1.1 Prerequisites
- Python 3.6 or later
- PyTorch 1.7 or later
- torchaudio
- rich
- soundfile
- funasr (A lightweight AutoModel library for speech processing)
1.2 Installation Steps
-
Clone the repository:
git clone https://gitee.com/iint/dspeech.git
cd dspeech
-
Install the required packages:
pip install -r requirements.txt
-
Directly install dspeech via pip:
pip install dspeech
-
Set the DSPEECH_HOME
environment variable to the directory where your models are stored:
export DSPEECH_HOME=/path/to/dspeech/models
-
Download the necessary models and place them in the DSPEECH_HOME
directory. You can download the models using the following commands (replace <model_id>
with the actual model ID):
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download <model_id> --local-dir $DSPEECH_HOME/<model_name>
-
(Optional) Also you can install Dguard if you want to do the speaker diarization task:
pip install dguard==0.1.20
export DGUARD_MODEL_PATH=<path to dguard model home>
dguard_info
-
Print the help message to see the available commands:
dspeech help
You should see a list of available commands and options.
DSpeech: A Command-line Speech Processing Toolkit
Usage: dspeech
Commands:
help Show this help message
transcribe Transcribe an audio file
vad Perform VAD on an audio file
punc Add punctuation to a text
emo Perform emotion classification on an audio file
clone Clone speaker's voice and generate audio
clone_with_emo Clone speaker's voice with emotion and generate audio
Options (for asr and emotion classify):
--model Model name (default: sensevoicesmall)
--vad-model VAD model name (default: fsmn-vad)
--punc-model Punctuation model name (default: ct-punc)
--emo-model Emotion model name (default: emotion2vec_plus_large)
--device Device to run the models on (default: cuda)
--file Audio file path for transcribing, VAD, or emotion classification
--text Text to process with punctuation model
--start Start time in seconds for processing audio files (default: 0)
--end End time in seconds for processing audio files (default: end of file)
--sample-rate Sample rate of the audio file (default: 16000)
Options (for tts):
--ref_audio Reference audio file path for voice cloning
--ref_text Reference text for voice cloning
--speaker_folder Speaker folder path for emotional voice cloning
--text Text to generate audio
--audio_save_path Path to save the audio
--spectrogram_save_path * [Optional] Path to save the spectrogram
--speed Speed of the audio
--sample_rate Sample rate of the audio file (default: 16000)
Example: dspeech transcribe --file audio.wav
2. Features
DSpeech offers the following functionalities:
- Transcription: Convert audio files to text using state-of-the-art speech recognition models.
- Voice Activity Detection (VAD): Detect and segment speech regions in an audio file.
- Punctuation Addition: Add punctuation to raw text transcriptions to improve readability.
- Emotion Classification: Classify the emotional content of an audio file into various categories.
- Voice Cloning: Clone a voice from a given audio file using a text-to-speech (TTS) model.
- Emotion TTS: Generate emotional speech using a text-to-speech (TTS) model.
3. Introduction to the dspeech.STT
To use DSpeech in a Python script, you can import the STT
class and create an instance with the desired models:
from dspeech.stt import STT
handler = STT(model_name="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", emo_model="emotion2vec_plus_large")
transcription = handler.transcribe_file("audio.wav")
print(transcription)
vad_result = handler.vad_file("audio.wav")
print(vad_result)
punctuation_text = handler.punc_result("this is a test")
print(punctuation_text)
emotion_result = handler.emo_classify_file("audio.wav")
print(emotion_result)
4. Introduction to the dspeech.TTS
4.1 Initialization
To initialize the TTS
module, create a TTS
handler object specifying the target device (CPU or GPU) and sample rate for generated audio.
from dspeech import TTS
import torch
tts_handler = TTS(
device="cuda",
target_sample_rate=24000
)
The device
parameter can be set to "cuda" for GPU usage or "cpu" for running on the CPU.
4.2 Basic Voice Cloning
In basic voice cloning, you provide a reference audio and text, and the system generates a new speech that mimics the voice from the reference audio with the content of the provided text.
import torchaudio
ref_audio, sample_rate = torchaudio.load("tests/a.wav")
r = tts_handler.clone(
ref_audio=(ref_audio, sample_rate),
ref_text="Reference text",
gen_text_batches=["Hello, my name is Xiao Ming", "I am an AI", "I can speak Chinese"],
speed=1,
channel=-1,
remove_silence=True,
wave_path="tests/tts_output.wav",
spectrogram_path="tests/tts_output.png",
concat=True
)
Parameters:
ref_audio
: The reference audio in the format (Tensor, int)
or as a file path.
ref_text
: The transcription of the reference audio.
gen_text_batches
: A list of text strings that you want to convert into speech.
speed
: Adjusts the speed of the generated speech (default is 1, for normal speed).
remove_silence
: Option to remove silence from the reference audio (boolean).
wave_path
: Path to save the generated audio file.
spectrogram_path
: Path to save the spectrogram image of the generated audio.
For complex voice cloning with multiple speakers and emotions, you need to extract speaker information from a directory containing multiple audio files for different speakers and emotions.
The directory should have the following structure:
<path>/
├── speaker1/
│ ├── happy.wav
│ ├── happy.txt
│ ├── neutral.wav
│ ├── neutral.txt
Each subdirectory represents a speaker, and each audio file should have an accompanying text file.
spk_info = tts_handler.get_speaker_info("tests/speaker")
print(spk_info)
This function returns a dictionary of speaker information, which will be used for advanced cloning tasks.
4.4 Voice Cloning with Emotions**
To clone a voice with emotional expressions, you can use the clone_with_emo
method. The generated text should contain emotional markers, e.g., [[zhaosheng_angry]]
, where zhaosheng
is the speaker and angry
is the emotion.
r = tts_handler.clone_with_emo(
gen_text_batches=[
"[[zhaosheng_angry]] How could you talk to me like that? It's too much!",
"[[zhaosheng_whisper]] Be careful, don't let anyone hear, it's a secret.",
"[[zhaosheng_sad]] I'm really sad, things are out of my control."
],
speaker_info=spk_info,
speed=1,
channel=-1,
remove_silence=True,
wave_path="tests/tts_output_emo.wav",
spectrogram_path="tests/tts_output_emo.png"
)
4.5 Multi-Speaker and Multi-Emotion Dialogues
For generating dialogues between multiple speakers with different emotions, make sure the directory tests/speaker
contains subdirectories for each speaker, and the corresponding audio and text files exist for each emotion.
spk_info = tts_handler.get_speaker_info("tests/speaker")
r = tts_handler.clone_with_emo(
gen_text_batches=[
"[[zhaosheng_angry]] How could you talk to me like that? It's too much!",
"[[duanyibo_whisper]] Be careful, don't let anyone hear, it's a secret.",
"[[zhaosheng_sad]] I'm really sad, things are out of my control."
],
speaker_info=spk_info,
speed=1,
channel=-1,
remove_silence=True,
wave_path="tests/tts_output_emo.wav",
spectrogram_path="tests/tts_output_emo.png"
)
This method will generate a single audio file containing speech from multiple speakers with different emotional expressions.
4.6 Output Files**
- Wave Path (
wave_path
): Specifies where to save the generated audio output. If concat=True
, all gen_text_batches
will be concatenated into one audio file.
- Spectrogram Path (
spectrogram_path
): Specifies where to save the spectrogram image of the generated speech. This is useful for visual analysis of the audio.
Command-line Interface
DSpeech provides a command-line interface for quick and easy access to its functionalities. To see the available commands, run:
dspeech help
DSpeech: A Command-line Speech Processing Toolkit
Usage: dspeech
Commands:
transcribe Transcribe an audio file
vad Perform VAD on an audio file
punc Add punctuation to a text
emo Perform emotion classification on an audio file
Options:
--model Model name (default: sensevoicesmall)
--vad-model VAD model name (default: fsmn-vad)
--punc-model Punctuation model name (default: ct-punc)
--emo-model Emotion model name (default: emotion2vec_plus_large)
--device Device to run the models on (default: cuda)
--file Audio file path for transcribing, VAD, or emotion classification
--text Text to process with punctuation model
--start Start time in seconds for processing audio files (default: 0)
--end End time in seconds for processing audio files (default: end of file)
--sample-rate Sample rate of the audio file (default: 16000)
Example: dspeech transcribe --file audio.wav
Usage Examples
- Transcribe an audio file:
dspeech transcribe --file audio.wav
- Perform VAD on an audio file:
dspeech vad --file audio.wav
- Add punctuation to a text:
dspeech punc --text "this is a test"
- Perform emotion classification on an audio file:
dspeech emo --file audio.wav
License
DSpeech is licensed under the MIT License. See the LICENSE file for more details.