
Research
/Security News
Weaponizing Discord for Command and Control Across npm, PyPI, and RubyGems.org
Socket researchers uncover how threat actors weaponize Discord across the npm, PyPI, and RubyGems ecosystems to exfiltrate sensitive data.
pingala-shunya
Advanced tools
Speech transcription package by Shunya Labs with ct2 and transformers backends
A comprehensive speech transcription package by Shunya Labs supporting ct2 (CTranslate2) and transformers backends. Get superior transcription quality with unified API and advanced features.
Pingala Shunya provides a unified interface for transcribing audio files using state-of-the-art backends optimized by Shunya Labs. Whether you want the high-performance CTranslate2 optimization or the flexibility of Hugging Face transformers, Pingala Shunya delivers exceptional results with the shunyalabs/pingala-v1-en-verbatim
model.
pip install pingala-shunya
This installs all dependencies including:
# Complete installation with development tools
pip install "pingala-shunya[complete]"
This adds development tools: pytest
, black
, flake8
, mypy
Unlike other transcription tools, FFmpeg does not need to be installed on the system. The audio is decoded with the Python library PyAV which bundles the FFmpeg libraries in its package.
GPU execution requires the following NVIDIA libraries to be installed:
Important: The latest versions of ctranslate2
only support CUDA 12 and cuDNN 9. For CUDA 11 and cuDNN 8, downgrade to the 3.24.0
version of ctranslate2
. For CUDA 12 and cuDNN 8, use ctranslate2==4.4.0
(already included in pingala-shunya):
pip install --force-reinstall ctranslate2==4.4.0
The easiest way is to use the official NVIDIA CUDA Docker image:
docker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
pip install pingala-shunya
Install CUDA libraries via pip:
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
Download pre-built CUDA libraries from Purfview's repository. Extract and add to your system PATH.
Based on faster-whisper benchmarks transcribing 13 minutes of audio:
Backend | Precision | Beam size | Time | VRAM Usage |
---|---|---|---|---|
transformers (SDPA) | fp16 | 5 | 1m52s | 4960MB |
faster-whisper (CT2) | fp16 | 5 | 1m03s | 4525MB |
faster-whisper (CT2) (batch_size=8 ) | fp16 | 5 | 17s | 6090MB |
faster-whisper (CT2) | int8 | 5 | 59s | 2926MB |
Backend | Precision | Beam size | Time | RAM Usage |
---|---|---|---|---|
faster-whisper (CT2) | fp32 | 5 | 2m37s | 2257MB |
faster-whisper (CT2) (batch_size=8 ) | fp32 | 5 | 1m06s | 4230MB |
faster-whisper (CT2) | int8 | 5 | 1m42s | 1477MB |
Pingala Shunya delivers superior performance with optimized CTranslate2 backend and efficient memory usage.
shunyalabs/pingala-v1-en-verbatim
- High-quality English transcription model by Shunya Labsshunyalabs/pingala-v1-en-verbatim
- Optimized for English verbatim transcription/path/to/local/model
- Local model directory or filefrom pingala_shunya import PingalaTranscriber
# Initialize with default Shunya Labs model and auto-detected backend
transcriber = PingalaTranscriber()
# Simple transcription
segments = transcriber.transcribe_file_simple("audio.wav")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
from pingala_shunya import PingalaTranscriber
# Explicitly choose backends with Shunya Labs model
transcriber_ct2 = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")
transcriber_tf = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")
# Auto-detection (recommended)
transcriber_auto = PingalaTranscriber() # Uses default Shunya Labs model with ct2
from pingala_shunya import PingalaTranscriber
# Initialize with specific backend and settings
transcriber = PingalaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
backend="ct2",
device="cuda",
compute_type="float16"
)
# Advanced transcription with full metadata
segments, info = transcriber.transcribe_file(
"audio.wav",
beam_size=10, # Higher beam size for better accuracy
word_timestamps=True, # Enable word-level timestamps
temperature=0.0, # Deterministic output
compression_ratio_threshold=2.4, # Filter out low-quality segments
log_prob_threshold=-1.0, # Filter by probability
no_speech_threshold=0.6, # Silence detection threshold
initial_prompt="High quality audio recording", # Guide the model
hotwords="Python, machine learning, AI", # Boost specific words
vad_filter=True, # Enable voice activity detection
task="transcribe" # or "translate" for translation
)
# Print transcription info
model_info = transcriber.get_model_info()
print(f"Backend: {model_info['backend']}")
print(f"Model: {model_info['model_name']}")
print(f"Language: {info.language} (confidence: {info.language_probability:.3f})")
print(f"Duration: {info.duration:.2f} seconds")
# Process segments with all metadata
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
if segment.confidence:
print(f"Confidence: {segment.confidence:.3f}")
# Word-level details
for word in segment.words:
print(f" '{word.word}' [{word.start:.2f}-{word.end:.2f}s] (conf: {word.probability:.3f})")
# Use Shunya Labs model with transformers backend
transcriber = PingalaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
backend="transformers"
)
segments = transcriber.transcribe_file_simple("audio.wav")
# Auto-detection will use ct2 by default for Shunya Labs models
transcriber = PingalaTranscriber() # Uses ct2 backend (recommended)
The package includes a comprehensive CLI supporting both backends:
# Basic transcription with auto-detected backend
pingala audio.wav
# Specify backend explicitly
pingala audio.wav --backend ct2
pingala audio.wav --backend transformers
# Use Shunya Labs model with different backends
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend ct2
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers
# Save to file
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim -o transcript.txt
# Use CPU for processing
pingala audio.wav --device cpu
# Word-level timestamps with confidence scores (ct2)
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --word-timestamps --show-confidence --show-words
# Voice Activity Detection (ct2 only)
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --vad --verbose
# Language detection with different backends
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --detect-language --backend ct2
# SRT subtitles with word-level timing
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --format srt --word-timestamps -o subtitles.srt
# Transformers backend with Shunya Labs model
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers --verbose
# Advanced parameters (ct2)
pingala audio.wav --model shunyalabs/pingala-v1-en-verbatim \
--beam-size 10 \
--temperature 0.2 \
--compression-ratio-threshold 2.4 \
--log-prob-threshold -1.0 \
--initial-prompt "This is a technical presentation" \
--hotwords "Python,AI,machine learning"
Option | Description | Backends | Default |
---|---|---|---|
--model | Model name or path | All | shunyalabs/pingala-v1-en-verbatim |
--backend | Backend selection | All | auto-detect |
--device | Device: cuda, cpu, auto | All | cuda |
--compute-type | Precision: float16, float32, int8 | All | float16 |
--beam-size | Beam size for decoding | All | 5 |
--language | Language code (e.g., 'en') | All | auto-detect |
--word-timestamps | Enable word-level timestamps | ct2 | False |
--show-confidence | Show confidence scores | All | False |
--show-words | Show word-level details | All | False |
--vad | Enable VAD filtering | ct2 | False |
--detect-language | Language detection only | All | False |
--format | Output format: text, srt, vtt | All | text |
--temperature | Sampling temperature | All | 0.0 |
--compression-ratio-threshold | Compression ratio filter | ct2 | 2.4 |
--log-prob-threshold | Log probability filter | ct2 | -1.0 |
--no-speech-threshold | No speech threshold | All | 0.6 |
--initial-prompt | Initial prompt text | All | None |
--hotwords | Hotwords to boost | ct2 | None |
--task | Task: transcribe, translate | All | transcribe |
Feature | ct2 | transformers |
---|---|---|
Performance | Fastest | Good |
GPU Acceleration | Optimized | Standard |
Memory Usage | Lowest | Moderate |
Model Support | Any model | Any HF model |
Word Timestamps | Full support | Limited |
VAD Filtering | Built-in | No |
Streaming | True streaming | Batch only |
Advanced Params | All features | Basic |
Latest Models | Updated | Latest |
Custom Models | CTranslate2 | Any format |
ct2
with Shunya Labs modelstransformers
ct2
with VADtransformers
# Real-time/Production: Use ct2 with Shunya Labs model
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")
# Maximum accuracy: Use Shunya Labs model with ct2
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")
# Alternative backend: Use transformers with Shunya Labs model
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")
# Research/Latest models: Use transformers backend
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")
Use Case | Model | Backend | Hardware |
---|---|---|---|
Real-time | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 4GB+ |
Production | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 6GB+ |
Maximum Quality | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 8GB+ |
Alternative | shunyalabs/pingala-v1-en-verbatim | transformers | GPU 4GB+ |
CPU-only | shunyalabs/pingala-v1-en-verbatim | any | 8GB+ RAM |
# Maximum performance on GPU - FP16 precision
transcriber = PingalaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
device="cuda",
compute_type="float16" # Fastest GPU performance
)
# Memory constrained GPU - INT8 quantization
transcriber = PingalaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
device="cuda",
compute_type="int8_float16" # Lower memory usage
)
# Batched processing for multiple files
segments, info = transcriber.transcribe_file(
"audio.wav",
batch_size=8, # Process multiple segments in parallel
beam_size=5
)
# Optimized CPU settings
transcriber = PingalaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
device="cpu",
compute_type="int8" # Lower memory, faster on CPU
)
# Control CPU threads for best performance
import os
os.environ["OMP_NUM_THREADS"] = "4" # Adjust based on your CPU
int8_float16
compute type to reduce memory usage by ~40%int8
compute type on CPU to reduce memory usageWhen comparing against other implementations:
OMP_NUM_THREADS=4 python script.py
Issue: RuntimeError: No CUDA capable device found
# Check CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# If False, install CUDA toolkit or use CPU
transcriber = PingalaTranscriber(device="cpu")
Issue: CUDA out of memory
# Solution 1: Use INT8 quantization
transcriber = PingalaTranscriber(compute_type="int8_float16")
# Solution 2: Reduce batch size
segments, info = transcriber.transcribe_file("audio.wav", batch_size=1)
# Solution 3: Use CPU
transcriber = PingalaTranscriber(device="cpu")
Issue: cuDNN/cuBLAS library not found
# Install CUDA libraries via pip (Linux)
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
# Or use Docker
docker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
Issue: ctranslate2
version compatibility
# For CUDA 12 + cuDNN 8 (default in pingala-shunya)
pip install ctranslate2==4.4.0
# For CUDA 11 + cuDNN 8
pip install ctranslate2==3.24.0
Issue: Model download fails
# Use local model path
transcriber = PingalaTranscriber(model_name="/path/to/local/model")
# Or specify different Shunya Labs model
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim")
Issue: Alignment heads error (word timestamps)
# This is handled automatically with fallback to no word timestamps
# Word timestamps are supported with Shunya Labs models
transcriber = PingalaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim")
segments, info = transcriber.transcribe_file("audio.wav", word_timestamps=True)
See example.py
for comprehensive examples:
# Run with default backend (auto-detected)
python example.py audio.wav
# Test specific backends with Shunya Labs model
python example.py audio.wav --backend ct2
python example.py audio.wav --backend transformers
# Test Shunya Labs model with different backends
python example.py audio.wav shunyalabs/pingala-v1-en-verbatim
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Visit Shunya Labs to learn more about our AI research and products. Contact us at 0@shunyalabs.ai for questions or collaboration opportunities.
FAQs
Speech transcription package by Shunya Labs with ct2 and transformers backends
We found that pingala-shunya demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
Socket researchers uncover how threat actors weaponize Discord across the npm, PyPI, and RubyGems ecosystems to exfiltrate sensitive data.
Security News
Socket now integrates with Bun 1.3’s Security Scanner API to block risky packages at install time and enforce your organization’s policies in local dev and CI.
Research
The Socket Threat Research Team is tracking weekly intrusions into the npm registry that follow a repeatable adversarial playbook used by North Korean state-sponsored actors.