Maivi - My AI Voice Input 🎤
Real-time voice-to-text transcription with hotkey support
Maivi (My AI Voice Input) is a cross-platform desktop application that turns your voice into text using state-of-the-art AI models. Simply press Alt+Q (Option+Q on macOS) to start recording, and press again to stop. Your transcription appears in real-time and is automatically copied to your clipboard.

✨ Features
- 🎤 Hotkey Recording - Toggle recording with Alt+Q (Option+Q on macOS)
- ⚡ Real-time Transcription - See text appear as you speak
- 📋 Clipboard Integration - Automatic copy to clipboard
- 🪟 Floating Overlay - Live transcription in a sleek overlay window
- 🔄 Smart Chunk Merging - Advanced overlap-based merging eliminates duplicates
- 💻 CPU-Only - No GPU required (though GPU acceleration is supported)
- 🌍 High Accuracy - Powered by NVIDIA Parakeet TDT 0.6B model (~6-9% WER)
- 🚀 Fast - ~0.36x RTF (processes 7s audio in 2.5s on CPU)
🚀 Quick Start
Installation
CPU-only (Recommended - much faster, 100MB vs 2GB+):
pip install maivi --extra-index-url https://download.pytorch.org/whl/cpu
Or with GPU support (if you have NVIDIA GPU):
pip install maivi --extra-index-url https://download.pytorch.org/whl/cu121
Standard install (may download large CUDA files):
pip install maivi
System Requirements
Linux:
sudo apt-get install portaudio19-dev python3-pyaudio
macOS:
Grant Maivi microphone, Accessibility, and Input Monitoring permissions the first time you run it (System Settings → Privacy & Security). No additional Homebrew packages are required for audio capture.
Windows:
- PortAudio is usually included with PyAudio
Usage
GUI Mode (Recommended):
maivi
Press Alt+Q (Option+Q on macOS) to start recording, press Alt+Q again to stop. The transcription will appear in a floating overlay and be copied to your clipboard.
CLI Mode:
maivi-cli
maia-cli --show-ui
maia-cli --window 10 --slide 5 --show-ui
Controls:
- Alt+Q (Option+Q on macOS) - Start/stop recording (toggle mode)
- Esc - Exit application
📖 How It Works
Maia uses a sophisticated streaming architecture:
- Sliding Window Recording - Captures audio in overlapping 7-second chunks every 3 seconds
- Real-time Transcription - Each chunk is transcribed by the NVIDIA Parakeet model
- Smart Merging - Chunks are merged using overlap detection (4-second overlap)
- Live Updates - The UI updates in real-time as transcription progresses
Why Overlapping Chunks?
Chunk 1: "hello world how are you"
Chunk 2: "how are you doing today"
^^^^^^^^^^^^^^
Overlap detected → merge!
Result: "hello world how are you doing today"
This approach ensures:
- ✅ No words cut mid-syllable
- ✅ Context preserved for better accuracy
- ✅ Seamless merging without duplicates
- ✅ Fast processing (no queue buildup)
⚙️ Configuration
Chunk Parameters
maia-cli --window 7.0 --slide 3.0 --delay 2.0
--window: Chunk size in seconds (default: 7.0)
- Larger = better quality, slower processing
--slide: Slide interval in seconds (default: 3.0)
- Smaller = more overlap, higher CPU usage
- Rule: Must be >
window × 0.36 to avoid queue buildup
--delay: Processing start delay in seconds (default: 2.0)
Advanced Options
maia-cli --speed 1.5
maia-cli --show-ui --ui-width 50
maia-cli --no-pause-breaks
maia-cli --output-file transcription.txt
📦 Building Executables
Maivi can be packaged as standalone executables for easy distribution:
pip install maivi[build]
pyinstaller --onefile --windowed \
--name maivi \
--add-data "src/maia:maia" \
src/maia/__main__.py
Pre-built executables are available in Releases.
🏗️ Development
Setup Development Environment
git clone https://github.com/MaximeRivest/maivi.git
cd maivi
pip install -e .[dev]
pytest
Project Structure
maia/
├── src/maia/
│ ├── __init__.py
│ ├── __main__.py # GUI entry point
│ ├── core/
│ │ ├── streaming_recorder.py
│ │ ├── chunk_merger.py
│ │ └── pause_detector.py
│ ├── gui/
│ │ └── qt_gui.py
│ ├── cli/
│ │ ├── cli.py
│ │ ├── server.py
│ │ └── terminal_ui.py
│ └── utils/
├── tests/
├── docs/
├── pyproject.toml
├── README.md
└── LICENSE
🐛 Troubleshooting
"No overlap found" warnings
This is expected behavior when there are long pauses (5+ seconds of silence). The system adds "..." gap markers to indicate the pause.
Queue buildup (transcription continues after stopping)
Check that processing time < slide interval:
- Processing:
window_seconds × 0.36 (RTF)
- Should be <
slide_seconds
- Default:
7 × 0.36 = 2.52s < 3s ✅
Model download issues
The first run downloads the NVIDIA Parakeet model (~600MB) from HuggingFace. If download fails:
- Check internet connection
- Verify HuggingFace is accessible
- Clear cache:
rm -rf ~/.cache/huggingface/
Qt/GUI crashes
If the GUI crashes on Linux:
python -c "from PySide6 import QtWidgets; print('Qt OK')"
maia-cli --show-ui
📊 Performance
Memory:
- Model: ~2GB RAM
- Audio buffer: ~1MB
- Total: ~2.5GB RAM
CPU:
- Idle: <5% CPU
- Recording: 30-40% of 1 core
- Transcription: 100% of 1 core (during processing)
Latency:
- First transcription: 2s (start delay)
- Updates: Every 3s (slide interval)
- Completion: 1-3s after recording stops
Accuracy:
- Model WER: ~5-8%
- Overlap merging: <1% word loss
- Total effective WER: ~6-9%
🗺️ Roadmap
v0.2 - Platform Support:
v0.3 - Features:
v0.4 - Optimization:
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature)
- Commit your changes (
git commit -m 'Add amazing feature')
- Push to the branch (
git push origin feature/amazing-feature)
- Open a Pull Request
💬 Support
Made with ❤️ by Maxime Rivest