Introducing Socket Firewall: Free, Proactive Protection for Your Software Supply Chain.Learn More
Socket
Book a DemoInstallSign in
Socket

pingala-shunya

Package Overview
Dependencies
Maintainers
1
Versions
13
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pingala-shunya

Speech transcription package by Shunya Labs with ct2 and transformers backends

Source
pipPyPI
Version
0.1.6
Maintainers
1

Pingala Shunya

A comprehensive speech transcription package by Shunya Labs supporting ct2 (CTranslate2) and transformers backends. Get superior transcription quality with unified API and advanced features.

Overview

Pingala Shunya provides a unified interface for transcribing audio files using state-of-the-art backends optimized by Shunya Labs. Whether you want the high-performance CTranslate2 optimization or the flexibility of Hugging Face transformers, Pingala Shunya delivers exceptional results with the shunyalabs/pingala-v1-en-verbatim model.

Features

  • Shunya Labs Optimized: Built by Shunya Labs for superior performance
  • CT2 Backend: High-performance CTranslate2 optimization (default)
  • Transformers Backend: Hugging Face models and latest research
  • Auto-Detection: Automatically selects the best backend for your model
  • Unified API: Same interface across all backends
  • Word-Level Timestamps: Precise timing for individual words
  • Confidence Scores: Quality metrics for transcription segments and words
  • Voice Activity Detection (VAD): Filter out silence and background noise
  • Language Detection: Automatic language identification
  • Multiple Output Formats: Text, SRT subtitles, and WebVTT
  • Streaming Support: Process segments as they are generated
  • Advanced Parameters: Full control over all backend features
  • Rich CLI: Command-line tool with comprehensive options
  • Error Handling: Comprehensive error handling and validation

Installation

Standard Installation (All Backends Included)

pip install pingala-shunya

This installs all dependencies including:

  • faster-whisper ≥ 0.10.0 (CT2 backend)
  • transformers == 4.52.4 (Transformers backend)
  • ctranslate2 == 4.4.0 (GPU acceleration)
  • librosa ≥ 0.10.0 (Audio processing)
  • torch ≥ 1.9.0 & torchaudio ≥ 0.9.0 (PyTorch)
  • datasets ≥ 2.0.0 & numpy ≥ 1.21.0

Development Installation

# Complete installation with development tools
pip install "pingala-shunya[complete]"

This adds development tools: pytest, black, flake8, mypy

Requirements

  • Python 3.8 or higher
  • CUDA-compatible GPU (recommended for optimal performance)
  • PyTorch and torchaudio

Unlike other transcription tools, FFmpeg does not need to be installed on the system. The audio is decoded with the Python library PyAV which bundles the FFmpeg libraries in its package.

GPU Support and CUDA Installation

GPU Requirements

GPU execution requires the following NVIDIA libraries to be installed:

Important: The latest versions of ctranslate2 only support CUDA 12 and cuDNN 9. For CUDA 11 and cuDNN 8, downgrade to the 3.24.0 version of ctranslate2. For CUDA 12 and cuDNN 8, use ctranslate2==4.4.0 (already included in pingala-shunya):

pip install --force-reinstall ctranslate2==4.4.0

CUDA Installation Methods

Method 1: Docker (Recommended)

The easiest way is to use the official NVIDIA CUDA Docker image:

docker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
pip install pingala-shunya
Method 2: pip Installation (Linux only)

Install CUDA libraries via pip:

pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
Method 3: Manual Download (Windows & Linux)

Download pre-built CUDA libraries from Purfview's repository. Extract and add to your system PATH.

Performance Benchmarks

Based on faster-whisper benchmarks transcribing 13 minutes of audio:

GPU Performance (NVIDIA RTX 3070 Ti 8GB)

BackendPrecisionBeam sizeTimeVRAM Usage
transformers (SDPA)fp1651m52s4960MB
faster-whisper (CT2)fp1651m03s4525MB
faster-whisper (CT2) (batch_size=8)fp16517s6090MB
faster-whisper (CT2)int8559s2926MB

CPU Performance (Intel Core i7-12700K, 8 threads)

BackendPrecisionBeam sizeTimeRAM Usage
faster-whisper (CT2)fp3252m37s2257MB
faster-whisper (CT2) (batch_size=8)fp3251m06s4230MB
faster-whisper (CT2)int851m42s1477MB

Pingala Shunya delivers superior performance with optimized CTranslate2 backend and efficient memory usage.

Supported Backends

ct2 (CTranslate2) - Default

  • Performance: Fastest inference with CTranslate2 optimization
  • Features: Full parameter control, VAD, streaming, GPU acceleration
  • Models: All compatible models, optimized for Shunya Labs models
  • Best for: Production use, real-time applications

transformers

  • Performance: Good performance with Hugging Face ecosystem
  • Features: Access to latest models, easy fine-tuning integration
  • Models: Any Seq2Seq model on Hugging Face Hub
  • Best for: Research, latest models, custom transformer models

Supported Models

Default Model

  • shunyalabs/pingala-v1-en-verbatim - High-quality English transcription model by Shunya Labs

Shunya Labs Models

  • shunyalabs/pingala-v1-en-verbatim - Optimized for English verbatim transcription
  • More Shunya Labs models coming soon!

Custom Models (Advanced Users)

  • Any Hugging Face Seq2Seq model compatible with automatic-speech-recognition pipeline
  • Local model paths supported

Local Models

  • /path/to/local/model - Local model directory or file

Quick Start

Basic Usage with Auto-Detection

from pingala_shunya import PingalaTranscriber

# Initialize with default Shunya Labs model and auto-detected backend
transcriber = PingalaTranscriber()

# Simple transcription
segments = transcriber.transcribe_file_simple("audio.wav")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Backend Selection

from pingala_shunya import PingalaTranscriber

# Explicitly choose backends with Shunya Labs model
transcriber_ct2 = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")
transcriber_tf = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")  

# Auto-detection (recommended)
transcriber_auto = PingalaTranscriber()  # Uses default Shunya Labs model with ct2

Advanced Usage with All Features

from pingala_shunya import PingalaTranscriber

# Initialize with specific backend and settings
transcriber = PingalaTranscriber(
    model_name="shunyalabs/pingala-v1-en-verbatim",
    backend="ct2",
    device="cuda", 
    compute_type="float16"
)

# Advanced transcription with full metadata
segments, info = transcriber.transcribe_file(
    "audio.wav",
    beam_size=10,                    # Higher beam size for better accuracy
    word_timestamps=True,            # Enable word-level timestamps
    temperature=0.0,                 # Deterministic output
    compression_ratio_threshold=2.4, # Filter out low-quality segments
    log_prob_threshold=-1.0,         # Filter by probability
    no_speech_threshold=0.6,         # Silence detection threshold
    initial_prompt="High quality audio recording",  # Guide the model
    hotwords="Python, machine learning, AI",        # Boost specific words
    vad_filter=True,                 # Enable voice activity detection
    task="transcribe"                # or "translate" for translation
)

# Print transcription info
model_info = transcriber.get_model_info()
print(f"Backend: {model_info['backend']}")
print(f"Model: {model_info['model_name']}")
print(f"Language: {info.language} (confidence: {info.language_probability:.3f})")
print(f"Duration: {info.duration:.2f} seconds")

# Process segments with all metadata
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
    if segment.confidence:
        print(f"Confidence: {segment.confidence:.3f}")
    
    # Word-level details
    for word in segment.words:
        print(f"  '{word.word}' [{word.start:.2f}-{word.end:.2f}s] (conf: {word.probability:.3f})")

Using Transformers Backend

# Use Shunya Labs model with transformers backend
transcriber = PingalaTranscriber(
    model_name="shunyalabs/pingala-v1-en-verbatim",
    backend="transformers"
)

segments = transcriber.transcribe_file_simple("audio.wav")

# Auto-detection will use ct2 by default for Shunya Labs models
transcriber = PingalaTranscriber()  # Uses ct2 backend (recommended)

Command-Line Interface

The package includes a comprehensive CLI supporting both backends:

Basic CLI Usage

# Basic transcription with auto-detected backend
pingala audio.wav

# Specify backend explicitly  
pingala audio.wav --backend ct2
pingala audio.wav --backend transformers

# Use Shunya Labs model with different backends
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend ct2
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers

# Save to file
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim -o transcript.txt

# Use CPU for processing
pingala audio.wav --device cpu

Advanced CLI Features

# Word-level timestamps with confidence scores (ct2)
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --word-timestamps --show-confidence --show-words

# Voice Activity Detection (ct2 only)
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --vad --verbose

# Language detection with different backends
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --detect-language --backend ct2

# SRT subtitles with word-level timing
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --format srt --word-timestamps -o subtitles.srt

# Transformers backend with Shunya Labs model
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers --verbose

# Advanced parameters (ct2)
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim \
  --beam-size 10 \
  --temperature 0.2 \
  --compression-ratio-threshold 2.4 \
  --log-prob-threshold -1.0 \
  --initial-prompt "This is a technical presentation" \
  --hotwords "Python,AI,machine learning"

CLI Options Reference

OptionDescriptionBackendsDefault
--modelModel name or pathAllshunyalabs/pingala-v1-en-verbatim
--backendBackend selectionAllauto-detect
--deviceDevice: cuda, cpu, autoAllcuda
--compute-typePrecision: float16, float32, int8Allfloat16
--beam-sizeBeam size for decodingAll5
--languageLanguage code (e.g., 'en')Allauto-detect
--word-timestampsEnable word-level timestampsct2False
--show-confidenceShow confidence scoresAllFalse
--show-wordsShow word-level detailsAllFalse
--vadEnable VAD filteringct2False
--detect-languageLanguage detection onlyAllFalse
--formatOutput format: text, srt, vttAlltext
--temperatureSampling temperatureAll0.0
--compression-ratio-thresholdCompression ratio filterct22.4
--log-prob-thresholdLog probability filterct2-1.0
--no-speech-thresholdNo speech thresholdAll0.6
--initial-promptInitial prompt textAllNone
--hotwordsHotwords to boostct2None
--taskTask: transcribe, translateAlltranscribe

Backend Comparison

Featurect2transformers
PerformanceFastestGood
GPU AccelerationOptimizedStandard
Memory UsageLowestModerate
Model SupportAny modelAny HF model
Word TimestampsFull supportLimited
VAD FilteringBuilt-inNo
StreamingTrue streamingBatch only
Advanced ParamsAll featuresBasic
Latest ModelsUpdatedLatest
Custom ModelsCTranslate2Any format

Recommendations

  • Production/Performance: Use ct2 with Shunya Labs models
  • Latest Research Models: Use transformers
  • Real-time Applications: Use ct2 with VAD
  • Custom Transformer Models: Use transformers

Performance Optimization

Backend Selection Tips

# Real-time/Production: Use ct2 with Shunya Labs model
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")

# Maximum accuracy: Use Shunya Labs model with ct2  
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")

# Alternative backend: Use transformers with Shunya Labs model
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")

# Research/Latest models: Use transformers backend
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")

Hardware Recommendations

Use CaseModelBackendHardware
Real-timeshunyalabs/pingala-v1-en-verbatimct2GPU 4GB+
Productionshunyalabs/pingala-v1-en-verbatimct2GPU 6GB+
Maximum Qualityshunyalabs/pingala-v1-en-verbatimct2GPU 8GB+
Alternativeshunyalabs/pingala-v1-en-verbatimtransformersGPU 4GB+
CPU-onlyshunyalabs/pingala-v1-en-verbatimany8GB+ RAM

GPU Optimization

# Maximum performance on GPU - FP16 precision
transcriber = PingalaTranscriber(
    model_name="shunyalabs/pingala-v1-en-verbatim",
    device="cuda",
    compute_type="float16"  # Fastest GPU performance
)

# Memory constrained GPU - INT8 quantization
transcriber = PingalaTranscriber(
    model_name="shunyalabs/pingala-v1-en-verbatim", 
    device="cuda",
    compute_type="int8_float16"  # Lower memory usage
)

# Batched processing for multiple files
segments, info = transcriber.transcribe_file(
    "audio.wav",
    batch_size=8,  # Process multiple segments in parallel
    beam_size=5
)

CPU Optimization

# Optimized CPU settings
transcriber = PingalaTranscriber(
    model_name="shunyalabs/pingala-v1-en-verbatim",
    device="cpu",
    compute_type="int8"  # Lower memory, faster on CPU
)

# Control CPU threads for best performance
import os
os.environ["OMP_NUM_THREADS"] = "4"  # Adjust based on your CPU

Memory Optimization Tips

  • GPU VRAM: Use int8_float16 compute type to reduce memory usage by ~40%
  • System RAM: Use int8 compute type on CPU to reduce memory usage
  • Batch Size: Increase batch size if you have sufficient memory for faster processing
  • Model Size: Consider smaller models for memory-constrained environments

Performance Comparison Tips

When comparing against other implementations:

  • Use same beam size (default is 5 in pingala-shunya)
  • Compare with similar Word Error Rate (WER)
  • Set consistent thread count: OMP_NUM_THREADS=4 python script.py
  • Ensure similar transcription quality metrics

Troubleshooting

Common CUDA Issues

Issue: RuntimeError: No CUDA capable device found

# Check CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# If False, install CUDA toolkit or use CPU
transcriber = PingalaTranscriber(device="cpu")

Issue: CUDA out of memory

# Solution 1: Use INT8 quantization
transcriber = PingalaTranscriber(compute_type="int8_float16")

# Solution 2: Reduce batch size
segments, info = transcriber.transcribe_file("audio.wav", batch_size=1)

# Solution 3: Use CPU
transcriber = PingalaTranscriber(device="cpu")

Issue: cuDNN/cuBLAS library not found

# Install CUDA libraries via pip (Linux)
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

# Or use Docker
docker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04

Issue: ctranslate2 version compatibility

# For CUDA 12 + cuDNN 8 (default in pingala-shunya)
pip install ctranslate2==4.4.0

# For CUDA 11 + cuDNN 8
pip install ctranslate2==3.24.0

Model Loading Issues

Issue: Model download fails

# Use local model path
transcriber = PingalaTranscriber(model_name="/path/to/local/model")

# Or specify different Shunya Labs model
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim")

Issue: Alignment heads error (word timestamps)

# This is handled automatically with fallback to no word timestamps
# Word timestamps are supported with Shunya Labs models
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim")
segments, info = transcriber.transcribe_file("audio.wav", word_timestamps=True)

Examples

See example.py for comprehensive examples:

# Run with default backend (auto-detected)
python example.py audio.wav

# Test specific backends with Shunya Labs model
python example.py audio.wav --backend ct2
python example.py audio.wav --backend transformers  

# Test Shunya Labs model with different backends
python example.py audio.wav shunyalabs/pingala-v1-en-verbatim

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About Shunya Labs

Visit Shunya Labs to learn more about our AI research and products. Contact us at 0@shunyalabs.ai for questions or collaboration opportunities.

Keywords

speech-to-text

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts