African Whisper: ASR for African Languages
Enhancing Automatic Speech Recognition (ASR): translation and transcription capabilities for African languages by providing seamless fine-tuning and deploying pipelines for Whisper Model.
Features
-
🔧 Fine-tune the Whisper model on any audio dataset from Huggingface, e.g., Mozilla's Common Voice datasets.
-
📊 View training run metrics on Wandb.
-
🎙️ Test your fine-tuned model using Gradio UI or directly on an audio file (.mp3 or .wav).
-
🚀 Deploy an API endpoint for audio file transcription or translation.
-
🐳 Containerize your API endpoint application and push to DockerHub.
Why Whisper? 🤔
-
🌐 Extensive Training Data: Trained on 680,000 hours of multilingual and multitask(translation and transcription) supervised data from the web.
-
🗣️ Sequence-based Understanding: Whisper considers the full sequence of spoken words, ensuring accurate context recognition, unlike Word2Vec.
-
💻 Simplification for Applications: Deploy one model for transcribing and translating a multitude of languages, without sacrificing quality or context.
For more details, you can refer to the Whisper ASR model paper.
Want proof, check this repo
🚀 Getting Started
Prerequisites
Step 1: Installation
!pip install africanwhisper
Step 2: Set Parameters
huggingface_read_token = " "
huggingface_write_token = " "
dataset_name = "mozilla-foundation/common_voice_16_1"
language_abbr= [ ]
model_id= "model-id"
processing_task= "automatic-speech-recognition"
wandb_api_key = " "
use_peft = True
Step 3: Prepare the Model
from training.data_prep import DataPrep
process = DataPrep(
huggingface_read_token,
dataset_name,
language_abbr,
model_id,
processing_task,
use_peft
)
tokenizer, feature_extractor, feature_processor, model = process.prepare_model()
Step 4: Preprocess the Dataset
processed_dataset = process.load_dataset(
feature_extractor=feature_extractor,
tokenizer=tokenizer,
processor=feature_processor,
num_samples = None
Step 5: Train the Model
from training.model_trainer import Trainer
trainer = Trainer(
huggingface_write_token,
model_id,
processed_dataset,
model,
feature_processor,
feature_extractor,
tokenizer,
wandb_api_key,
use_peft
)
trainer.train(
max_steps=100,
learning_rate=1e-3,
per_device_train_batch_size=96,
per_device_eval_batch_size=64,
optim="adamw_bnb_8bit"
)
Step 6: Test Model using an Audio File
from deployment.peft_speech_inference import SpeechInference
model_name = "your-finetuned-model-name-on-huggingface-hub"
huggingface_read_token = " "
task = "desired-task"
audiofile_dir = "location-of-audio-file"
inference = SpeechInference(model_name, huggingface_read_token)
pipeline = inference.pipe_initialization()
transcription = inference.output(pipeline, audiofile_dir, task)
print(transcription.text)
print(transcription.chunks)
print(transcription.timestamps)
print(transcription.chunk_texts)
from deployment.speech_inference import SpeechTranscriptionPipeline, ModelOptimization
model_name = "your-finetuned-model-name-on-huggingface-hub"
huggingface_read_token = " "
task = "desired-task"
audiofile_dir = "location-of-audio-file"
model_optimizer = ModelOptimization(model_name=model_name)
model_optimizer.convert_model_to_optimized_format()
model = model_optimizer.load_transcription_model()
inference = SpeechTranscriptionPipeline(
audio_file_path=audiofile_dir,
task=task,
huggingface_read_token=huggingface_read_token
)
transcription = inference.transcribe_audio(model=model)
print(transcription)
alignment_result = inference.align_transcription(transcription)
diarization_result = inference.diarize_audio(alignment_result)
print(diarization_result)
inference.generate_subtitles(transcription, alignment_result, diarization_result)
🖥️ Using the CLI
Step 1: Clone and Install Dependencies
- Clone the Repository: Clone or download the application code to your local machine.
git clone https://github.com/KevKibe/African-Whisper.git
- Create a virtual environment for the project and activate it.
python3 -m venv env
source venv/bin/activate
- Install dependencies by running this command
pip install -r requirements.txt
cd src
Step 2: Finetune the Model
- To start the training , use the following command:
python -m training.main --huggingface_read_token YOUR_HUGGING_FACE_READ_TOKEN_HERE --huggingface_write_token YOUR_HUGGING_FACE_WRITE_TOKEN_HERE --dataset_name AUDIO_DATASET_NAME --language_abbr LANGUAGE_ABBREVIATION LANGUAGE_ABBREVIATION --model_id MODEL_ID --processing_task PROCESSING_TASK --wandb_api_key YOUR_WANDB_API_KEY_HERE --use_peft
Flags:
- Find a description of these commands here.
Step 3: Get Inference
Install ffmpeg
-
To get inference from your fine-tuned model, follow these steps:
-
Ensure that ffmpeg is installed by running the following commands:
sudo apt update && sudo apt install ffmpeg
sudo pacman -S ffmpeg
brew install ffmpeg
choco install ffmpeg
scoop install ffmpeg
To get inference on CLI Locally
cd src/deployment
- Create a
.env
file using nano .env
command and add these keys and save the file.
MODEL_NAME = "your-finetuned-model"
HUGGINGFACE_READ_TOKEN = "huggingface-read-token"
- To perform transcriptions and translations:
python -m deployment.peft_speech_inference_cli --audio_file FILENAME --task TASK
python -m deployment.speech_inference_cli --audio_file FILENAME --task TASK --perform_diarization --perform_alignment
Flags:
🛳️ Step 4: Deployment
- To deploy your fine-tuned model as a REST API endpoint, follow these instructions.
Contributing
Contributions are welcome and encouraged.
Before contributing, please take a moment to review our Contribution Guidelines for important information on how to contribute to this project.
If you're unsure about anything or need assistance, don't hesitate to reach out to us or open an issue to discuss your ideas.
We look forward to your contributions!
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For any enquiries, please reach out to me through keviinkibe@gmail.com