Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine!
WhisperS2T is an optimized lightning-fast open-sourced Speech-to-Text (ASR) pipeline. It is tailored for the whisper model to provide faster whisper transcription. It's designed to be exceptionally fast than other implementation, boasting a 2.3X speed improvement over WhisperX and a 3X speed boost compared to HuggingFace Pipeline with FlashAttention 2 (Insanely Fast Whisper). Moreover, it includes several heuristics to enhance transcription accuracy.
Whisper is a general-purpose speech recognition model developed by OpenAI and not me. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
txt, json, tsv, srt, vtt
. (Check complete release note)Checkout the Google Colab notebooks provided here: notebooks
Stay tuned for a technical report comparing WhisperS2T against other whisper pipelines. Meanwhile, check some quick benchmarks on A30 GPU. See scripts/
directory for the benchmarking scripts that I used.
NOTE: I conducted all the benchmarks using the without_timestamps
parameter set as True
. Adjusting this parameter to False
may enhance the Word Error Rate (WER) of the HuggingFace pipeline but at the expense of increased inference time. Notably, the improvements in inference speed were achieved solely through a superior pipeline design, without any specific optimization made to the backend inference engines (such as CTranslate2, FlashAttention2, etc.). For instance, WhisperS2T (utilizing FlashAttention2) demonstrates significantly superior inference speed compared to the HuggingFace pipeline (also using FlashAttention2), despite both leveraging the same inference engine—HuggingFace whisper model with FlashAttention2. Additionally, there is a noticeable difference in the WER as well.
docker pull shashikg/whisper_s2t:dev-trtllm
Dockerhub repo: https://hub.docker.com/r/shashikg/whisper_s2t/tags
Build from main
branch.
docker build --build-arg WHISPER_S2T_VER=main --build-arg SKIP_TENSORRT_LLM=1 -t whisper_s2t:main .
Build from specific release v1.3.0
.
git checkout v1.3.0
docker build --build-arg WHISPER_S2T_VER=v1.3.0 --build-arg SKIP_TENSORRT_LLM=1 -t whisper_s2t:1.3.0 .
To build the container with TensorRT-LLM support:
docker build --build-arg WHISPER_S2T_VER=main -t whisper_s2t:main-trtllm .
Install audio packages required for resampling and loading audio files.
apt-get install -y libsndfile1 ffmpeg
To install or update to the latest released version of WhisperS2T use the following command:
pip install -U whisper-s2t
Or to install from latest commit in this repo:
pip install -U git+https://github.com/shashikg/WhisperS2T.git
NOTE: If your CUDNN and CUBLAS installation is done using pip wheel, you can run the following to add CUDNN path to LD_LIBRARY_PATH
:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
To use TensorRT-LLM Backend
For TensortRT-LLM backend, you will need to install TensorRT and TensorRT-LLM.
bash <repo_dir>/install_tensorrt.sh
For most of the system the given bash script should work, if it doesn't please follow the official TensorRT-LLM instructions here.
import whisper_s2t
model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')
files = ['data/KINCAID46/audio/1.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=32)
print(out[0][0]) # Print first utterance for first file
"""
[Console Output]
{'text': "Let's bring in Phil Mackie who is there at the palace. We're looking at Teresa and Philip May. Philip, can you see how he's being transferred from the helicopters? It looks like, as you said, the beast. It's got its headlights on because the sun is beginning to set now, certainly sinking behind some clouds. It's about a quarter of a mile away down the Grand Drive",
'avg_logprob': -0.25426941679184695,
'no_speech_prob': 8.147954940795898e-05,
'start_time': 0.0,
'end_time': 24.8}
"""
To use word alignment load the model using this:
model = whisper_s2t.load_model("large-v2", asr_options={'word_timestamps': True})
import whisper_s2t
model = whisper_s2t.load_model(model_identifier="large-v2", backend='TensorRT-LLM')
files = ['data/KINCAID46/audio/1.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=24)
print(out[0][0]) # Print first utterance for first file
"""
[Console Output]
{'text': "Let's bring in Phil Mackie who is there at the palace. We're looking at Teresa and Philip May. Philip, can you see how he's being transferred from the helicopters? It looks like, as you said, the beast. It's got its headlights on because the sun is beginning to set now, certainly sinking behind some clouds. It's about a quarter of a mile away down the Grand Drive",
'start_time': 0.0,
'end_time': 24.8}
"""
Check this Documentation for more details.
NOTE: For first run the model may give slightly slower inference speed. After 1-2 runs it will give better inference speed. This is due to the JIT tracing of the VAD model.
This project is licensed under MIT License - see the LICENSE file for details.
FAQs
An Optimized Speech-to-Text Pipeline for the Whisper Model.
We found that whisper-s2t demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.