Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
中文 | English
DSpeech is an advanced command-line toolkit designed for speech processing tasks such as transcription, voice activity detection (VAD), punctuation addition, and emotion classification. It is built on top of state-of-the-art models and provides an easy-to-use interface for handling various speech processing jobs.
Clone the repository:
git clone https://gitee.com/iint/dspeech.git
cd dspeech
Install the required packages:
pip install -r requirements.txt
Directly install dspeech via pip:
pip install dspeech
Set the DSPEECH_HOME
environment variable to the directory where your models are stored:
export DSPEECH_HOME=/path/to/dspeech/models
Download the necessary models and place them in the DSPEECH_HOME
directory. You can download the models using the following commands (replace <model_id>
with the actual model ID):
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download <model_id> --local-dir $DSPEECH_HOME/<model_name>
(Optional) Also you can install Dguard if you want to do the speaker diarization task:
pip install dguard==0.1.20
export DGUARD_MODEL_PATH=<path to dguard model home>
dguard_info
Print the help message to see the available commands:
dspeech help
You should see a list of available commands and options.
DSpeech: A Command-line Speech Processing Toolkit
Usage: dspeech
Commands:
help Show this help message
transcribe Transcribe an audio file
vad Perform VAD on an audio file
punc Add punctuation to a text
emo Perform emotion classification on an audio file
clone Clone speaker's voice and generate audio
clone_with_emo Clone speaker's voice with emotion and generate audio
Options (for asr and emotion classify):
--model Model name (default: sensevoicesmall)
--vad-model VAD model name (default: fsmn-vad)
--punc-model Punctuation model name (default: ct-punc)
--emo-model Emotion model name (default: emotion2vec_plus_large)
--device Device to run the models on (default: cuda)
--file Audio file path for transcribing, VAD, or emotion classification
--text Text to process with punctuation model
--start Start time in seconds for processing audio files (default: 0)
--end End time in seconds for processing audio files (default: end of file)
--sample-rate Sample rate of the audio file (default: 16000)
Options (for tts):
--ref_audio Reference audio file path for voice cloning
--ref_text Reference text for voice cloning
--speaker_folder Speaker folder path for emotional voice cloning
--text Text to generate audio
--audio_save_path Path to save the audio
--spectrogram_save_path * [Optional] Path to save the spectrogram
--speed Speed of the audio
--sample_rate Sample rate of the audio file (default: 16000)
Example: dspeech transcribe --file audio.wav
DSpeech offers the following functionalities:
To use DSpeech in a Python script, you can import the STT
class and create an instance with the desired models:
from dspeech.stt import STT
# Initialize the STT handler with the specified models
handler = STT(model_name="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", emo_model="emotion2vec_plus_large")
# Transcribe an audio file
transcription = handler.transcribe_file("audio.wav")
print(transcription)
# Perform VAD on an audio file
vad_result = handler.vad_file("audio.wav")
print(vad_result)
# Add punctuation to a text
punctuation_text = handler.punc_result("this is a test")
print(punctuation_text)
# Perform emotion classification on an audio file
emotion_result = handler.emo_classify_file("audio.wav")
print(emotion_result)
To initialize the TTS
module, create a TTS
handler object specifying the target device (CPU or GPU) and sample rate for generated audio.
from dspeech import TTS
import torch
# Initialize TTS handler
tts_handler = TTS(
device="cuda", # Use "cpu" if no GPU is available
target_sample_rate=24000 # Define target sample rate for output audio
)
The device
parameter can be set to "cuda" for GPU usage or "cpu" for running on the CPU.
In basic voice cloning, you provide a reference audio and text, and the system generates a new speech that mimics the voice from the reference audio with the content of the provided text.
import torchaudio
# Load reference audio using torchaudio
ref_audio, sample_rate = torchaudio.load("tests/a.wav")
# Clone voice based on reference audio and text
r = tts_handler.clone(
ref_audio=(ref_audio, sample_rate), # Reference audio in (Tensor, int) format or file path
ref_text="Reference text", # The transcription of the reference audio
gen_text_batches=["Hello, my name is Xiao Ming", "I am an AI", "I can speak Chinese"], # Text to generate speech for
speed=1, # Speech speed (1 is normal speed)
channel=-1, # Merge all channels (-1) or specify one channel
remove_silence=True, # Option to remove silence from the reference audio
wave_path="tests/tts_output.wav", # Path to save generated audio
spectrogram_path="tests/tts_output.png", # Path to save spectrogram of generated audio
concat=True # Whether to merge all generated audio into a single output file
)
Parameters:
ref_audio
: The reference audio in the format (Tensor, int)
or as a file path.ref_text
: The transcription of the reference audio.gen_text_batches
: A list of text strings that you want to convert into speech.speed
: Adjusts the speed of the generated speech (default is 1, for normal speed).remove_silence
: Option to remove silence from the reference audio (boolean).wave_path
: Path to save the generated audio file.spectrogram_path
: Path to save the spectrogram image of the generated audio.For complex voice cloning with multiple speakers and emotions, you need to extract speaker information from a directory containing multiple audio files for different speakers and emotions.
The directory should have the following structure:
<path>/
├── speaker1/
│ ├── happy.wav
│ ├── happy.txt
│ ├── neutral.wav
│ ├── neutral.txt
Each subdirectory represents a speaker, and each audio file should have an accompanying text file.
# Extract speaker information from the folder
spk_info = tts_handler.get_speaker_info("tests/speaker")
print(spk_info)
This function returns a dictionary of speaker information, which will be used for advanced cloning tasks.
To clone a voice with emotional expressions, you can use the clone_with_emo
method. The generated text should contain emotional markers, e.g., [[zhaosheng_angry]]
, where zhaosheng
is the speaker and angry
is the emotion.
r = tts_handler.clone_with_emo(
gen_text_batches=[
"[[zhaosheng_angry]] How could you talk to me like that? It's too much!",
"[[zhaosheng_whisper]] Be careful, don't let anyone hear, it's a secret.",
"[[zhaosheng_sad]] I'm really sad, things are out of my control."
],
speaker_info=spk_info, # Dictionary of speaker information
speed=1, # Speech speed
channel=-1, # Merge all channels
remove_silence=True, # Remove silence in the generated output
wave_path="tests/tts_output_emo.wav", # Path to save output audio with emotions
spectrogram_path="tests/tts_output_emo.png" # Path to save spectrogram with emotions
)
For generating dialogues between multiple speakers with different emotions, make sure the directory tests/speaker
contains subdirectories for each speaker, and the corresponding audio and text files exist for each emotion.
# Extract speaker information for multiple speakers
spk_info = tts_handler.get_speaker_info("tests/speaker")
# Generate multi-speaker and multi-emotion dialogue
r = tts_handler.clone_with_emo(
gen_text_batches=[
"[[zhaosheng_angry]] How could you talk to me like that? It's too much!",
"[[duanyibo_whisper]] Be careful, don't let anyone hear, it's a secret.",
"[[zhaosheng_sad]] I'm really sad, things are out of my control."
],
speaker_info=spk_info, # Speaker information extracted from directory
speed=1, # Speech speed
channel=-1, # Merge all channels
remove_silence=True, # Remove silence from the reference audio
wave_path="tests/tts_output_emo.wav", # Path to save generated audio
spectrogram_path="tests/tts_output_emo.png" # Path to save generated spectrogram
)
This method will generate a single audio file containing speech from multiple speakers with different emotional expressions.
wave_path
): Specifies where to save the generated audio output. If concat=True
, all gen_text_batches
will be concatenated into one audio file.spectrogram_path
): Specifies where to save the spectrogram image of the generated speech. This is useful for visual analysis of the audio.DSpeech provides a command-line interface for quick and easy access to its functionalities. To see the available commands, run:
dspeech help
DSpeech: A Command-line Speech Processing Toolkit
Usage: dspeech
Commands:
transcribe Transcribe an audio file
vad Perform VAD on an audio file
punc Add punctuation to a text
emo Perform emotion classification on an audio file
Options:
--model Model name (default: sensevoicesmall)
--vad-model VAD model name (default: fsmn-vad)
--punc-model Punctuation model name (default: ct-punc)
--emo-model Emotion model name (default: emotion2vec_plus_large)
--device Device to run the models on (default: cuda)
--file Audio file path for transcribing, VAD, or emotion classification
--text Text to process with punctuation model
--start Start time in seconds for processing audio files (default: 0)
--end End time in seconds for processing audio files (default: end of file)
--sample-rate Sample rate of the audio file (default: 16000)
Example: dspeech transcribe --file audio.wav
dspeech transcribe --file audio.wav
dspeech vad --file audio.wav
dspeech punc --text "this is a test"
dspeech emo --file audio.wav
DSpeech is licensed under the MIT License. See the LICENSE file for more details.
FAQs
A Speech-to-Text toolkit with VAD, punctuation, and emotion classification
We found that dspeech demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.