
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
pyannote-cpp-node
Advanced tools
Node.js bindings for whisper.cpp plus the pyannote speaker diarization pipeline
Node.js native bindings for whisper.cpp transcription/VAD plus the pyannote speaker diarization pipeline.
pyannote-cpp-node is now the single package for both:
WhisperContext, VadContext, transcribe, transcribeAsync, getGpuDevicesPipeline, PipelineSessionprocessAgcPlatform support:
darwin-arm64: full pipeline (CoreML + Metal acceleration)win32-x64: full pipeline (Vulkan GPU + optional OpenVINO acceleration)darwin-x64, win32-ia32, LinuxOn both supported pipeline platforms, getCapabilities().pipeline is true.
The integrated pipeline combines Whisper transcription and optional speaker diarization into a single API (transcriptionOnly: true skips diarization).
Given 16 kHz mono PCM audio (Float32Array), it produces transcript segments shaped as below. In streaming mode, diarization emits cumulative segments events, while transcriptionOnly: true emits incremental segments events. finalize() returns all segments in both modes.
SPEAKER_00, SPEAKER_01, ...), "UNKNOWN" when diarization could not assign a speaker, or empty string ("") when transcriptionOnly is trueThe API supports three modes: offline batch processing (transcribeOffline), one-shot streaming (transcribe), and incremental streaming (createSession + push/finalize). All three modes support transcription-only operation via transcriptionOnly: true. All heavy operations are asynchronous and run on libuv worker threads.
whisper-cpp-node usageVadContextgetGpuDevices()segments events and audio chunk streamingPipeline.load(), reused across offline/streaming/session modesonProgress callback for transcribeOffline reports Whisper, diarization, and alignment phasesonSegment callback for transcribeOffline delivers each Whisper segment (start, end, text) as it's produced — enables live transcript preview and time-based loading barsprocessAgc() for offline audio normalization without the full pipelinesegModelPath) — required on all platformsembModelPath) — required unless transcriptionOnly is truepldaPath) — required unless transcriptionOnly is truewhisperModelPath) — required on all platformsvadModelPath)backend) with one of: metal, vulkan, coreml, or openvino-hybridbackend
backend: { type: 'coreml', segPath, embPath, whisperEncoderPath } uses CoreML .mlpackage / .mlmodelc assets on macOSbackend: { type: 'openvino-hybrid', whisperEncoderPath, embPath } uses OpenVINO IR .xml assets on Windowsbackend: { type: 'metal', segPath, embPath } on macOS requires segmentation and embedding model pathsbackend: { type: 'vulkan' } on Windows does not need extra accelerator pathsnpm install pyannote-cpp-node
pnpm add pyannote-cpp-node
The package installs a platform-specific native addon through optionalDependencies.
import {
WhisperContext,
createVadContext,
getCapabilities,
transcribeAsync,
} from 'pyannote-cpp-node';
const capabilities = getCapabilities();
console.log(capabilities);
const ctx = new WhisperContext({
model: './models/ggml-base.en.bin',
use_gpu: true,
no_prints: true,
});
const result = await transcribeAsync(ctx, {
fname_inp: './audio.wav',
language: 'en',
});
const vad = createVadContext({
model: './models/ggml-silero-v6.2.0.bin',
});
console.log(result.segments);
console.log(vad.getWindowSamples());
vad.free();
ctx.free();
import { Pipeline } from 'pyannote-cpp-node';
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
language: 'en',
});
const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
const result = await pipeline.transcribeOffline(audio);
for (const segment of result.segments) {
const end = segment.start + segment.duration;
console.log(
`[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
);
}
pipeline.close();
import { Pipeline } from 'pyannote-cpp-node';
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
language: 'en',
backend: { type: 'vulkan' },
});
const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
const result = await pipeline.transcribeOffline(audio);
for (const segment of result.segments) {
const end = segment.start + segment.duration;
console.log(
`[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
);
}
pipeline.close();
To use the Windows OpenVINO hybrid path instead, pass the OpenVINO assets through backend:
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'openvino-hybrid',
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
embPath: './models/embedding-openvino.xml',
},
});
const macPipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
language: 'en',
transcriptionOnly: true,
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
});
const windowsPipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
language: 'en',
transcriptionOnly: true,
backend: { type: 'vulkan' },
});
const result = await macPipeline.transcribe(audio);
for (const segment of result.segments) {
const end = segment.start + segment.duration;
// No speaker label - segment.speaker is empty string
console.log(`${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`);
}
macPipeline.close();
windowsPipeline.close();
Pipelineclass Pipeline {
static async load(config: PipelineConfig): Promise<Pipeline>;
async transcribeOffline(audio: Float32Array, onProgress?: (phase: number, progress: number) => void, onSegment?: (start: number, end: number, text: string) => void): Promise<TranscriptionResult>;
async transcribe(audio: Float32Array): Promise<TranscriptionResult>;
setLanguage(language: string): void;
setDecodeOptions(options: DecodeOptions): void;
createSession(): PipelineSession;
async setExecutionBackend(backend: BackendConfig): Promise<void>;
cancel(): void;
close(): void;
get isClosed(): boolean;
}
static async load(config: PipelineConfig): Promise<Pipeline>Validates model paths and loads all models into a shared cache on a background thread. The accelerator assets are selected by config.backend, which is required and has no default. On macOS, backend: { type: 'coreml', ... } loads CoreML segmentation, embedding, and Whisper encoder assets, while backend: { type: 'metal', ... } uses Metal with segmentation and embedding model paths. On Windows x64, backend: { type: 'vulkan' } loads the Vulkan path, and backend: { type: 'openvino-hybrid', ... } also loads OpenVINO IR models for the Whisper encoder and embedding model. Models are loaded once and reused across all subsequent transcribe(), transcribeOffline(), and createSession() calls. Models are freed only when close() is called.
backend is required in every Pipeline.load() callcoreml requires segPath, embPath, and whisperEncoderPathopenvino-hybrid requires whisperEncoderPath and embPathmetal requires segPath and embPathvulkan does not require extra accelerator model pathsasync transcribeOffline(audio: Float32Array, onProgress?, onSegment?): Promise<TranscriptionResult>Runs Whisper on the entire audio buffer in a single whisper_full() call, then runs offline diarization and WhisperX-style speaker alignment. In transcription-only mode, diarization and speaker alignment are skipped, and segments have an empty speaker field. This is the fastest mode for batch processing — no streaming infrastructure is involved.
If cancellation is requested with pipeline.cancel() or pipeline.close() while this call is in flight, the promise rejects with Error("Operation cancelled").
The optional onProgress callback receives (phase, progress) updates:
| Phase | Value | Meaning |
|---|---|---|
0 | 0–100 | Whisper transcription progress (percentage) |
1 | 0 | Diarization started |
2 | 0 | Speaker alignment started |
const result = await pipeline.transcribeOffline(audio, (phase, progress) => {
if (phase === 0) console.log(`Transcribing: ${progress}%`);
if (phase === 1) console.log('Running diarization...');
if (phase === 2) console.log('Aligning speakers...');
});
The optional onSegment callback receives (start, end, text) for each Whisper segment as it's produced during transcription. Times are in seconds. This enables live transcript preview before diarization and alignment complete.
const result = await pipeline.transcribeOffline(audio, undefined, (start, end, text) => {
console.log(`[${start.toFixed(2)}-${end.toFixed(2)}] ${text}`);
});
Both callbacks can be used simultaneously:
const result = await pipeline.transcribeOffline(
audio,
(phase, progress) => {
if (phase === 0) updateProgressBar(progress);
},
(start, end, text) => {
appendToTranscriptPreview(start, end, text);
},
);
async transcribe(audio: Float32Array): Promise<TranscriptionResult>Runs one-shot transcription (+ diarization unless transcriptionOnly is set) using the streaming pipeline internally (pushes 1-second chunks then finalizes).
If cancellation is requested with pipeline.cancel() or pipeline.close() while this call is in flight, the promise rejects with Error("Operation cancelled").
setLanguage(language: string): voidUpdates the Whisper decode language for subsequent transcribe() calls. This is a convenience shorthand for setDecodeOptions({ language }).
setDecodeOptions(options: DecodeOptions): voidUpdates one or more Whisper decode options for subsequent transcribe() calls. Only the fields you pass are changed; others retain their current values. See DecodeOptions for available fields.
async setExecutionBackend(backend: BackendConfig): Promise<void>Switches the inference backend at runtime. Tears down and reloads the entire model cache with the new backend configuration. The promise resolves when the new models are ready.
metal and coremlvulkan and openvino-hybridPass one of these BackendConfig variants:
type BackendConfig =
| { type: 'metal'; gpuDevice?: number; flashAttn?: boolean; segPath: string; embPath: string }
| { type: 'vulkan'; gpuDevice?: number; flashAttn?: boolean }
| {
type: 'coreml';
gpuDevice?: number;
flashAttn?: boolean;
segPath: string;
embPath: string;
whisperEncoderPath: string;
}
| {
type: 'openvino-hybrid';
gpuDevice?: number;
flashAttn?: boolean;
whisperEncoderPath: string;
embPath: string;
openvinoDevice?: string;
openvinoCacheDir?: string;
};
Warning: This is a heavy operation (~5-6s on Intel iGPU). It fully tears down and rebuilds the model cache. Treat it as a one-time configuration change, not something to call in a loop. See Warnings and Known Issues for Intel iGPU limitations.
// macOS: switch to Metal
await pipeline.setExecutionBackend({
type: 'metal',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
});
// macOS: switch to CoreML
await pipeline.setExecutionBackend({
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
});
// Windows: switch to OpenVINO hybrid
await pipeline.setExecutionBackend({
type: 'openvino-hybrid',
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
embPath: './models/embedding-openvino.xml',
});
// Windows: switch back to Vulkan
await pipeline.setExecutionBackend({ type: 'vulkan' });
createSession(): PipelineSessionCreates an independent streaming session for incremental processing. This method takes no arguments; native segment/audio callbacks are wired internally.
cancel(): voidRequests cooperative cancellation of the active transcribe() or transcribeOffline() call, if any. Native work stops at the next supported cancellation point, and the in-flight promise rejects with Error("Operation cancelled").
close(): voidRequests cancellation of any active model-level operation, then releases native resources. Safe to call multiple times.
get isClosed(): booleanReturns true after close().
PipelineSession (extends EventEmitter)class PipelineSession extends EventEmitter {
async push(audio: Float32Array): Promise<PushResult>;
cancel(): void;
setLanguage(language: string): void;
setDecodeOptions(options: DecodeOptions): void;
async finalize(): Promise<TranscriptionResult>;
close(): Promise<void>;
get isClosed(): boolean;
on<K extends keyof PipelineSessionEvents>(
event: K,
listener: (...args: PipelineSessionEvents[K]) => void
): this;
}
interface PipelineSessionEvents {
segments: [segments: AlignedSegment[]];
audio: [audio: Float32Array];
error: [error: Error];
}
async push(audio: Float32Array): Promise<PushResult>Pushes an arbitrary number of samples into the streaming pipeline.
interface PushResult {
vad: boolean[]; // per-window VAD (true = speech, false = silence)
rms: number[]; // per-window RMS of input audio, aligned 1:1 with vad
}
vad contains per-frame VAD booleans (one per 512-sample window)rms contains the RMS energy of each corresponding window (post-AGC when agcEnabled is true, raw input otherwise)If the session is cancelled before or during the call, the promise rejects with Error("Operation cancelled").
setLanguage(language: string): voidUpdates the Whisper decode language on the live streaming session. Takes effect on the next Whisper decode run. Thread-safe — the change is pushed to the C++ pipeline immediately.
setDecodeOptions(options: DecodeOptions): voidUpdates one or more Whisper decode options on the live streaming session. Takes effect on the next Whisper decode run. Thread-safe — changes are pushed to the C++ pipeline immediately. Only the fields you pass are changed; others retain their current values.
async finalize(): Promise<TranscriptionResult>Flushes all stages, runs final recluster + alignment, and returns the definitive result. finalize() always returns all accumulated segments regardless of mode. In diarization mode this is the final re-aligned output, and in transcription-only mode this is the union of all incremental segments emissions.
If the session is cancelled before or during the call, the promise rejects with Error("Operation cancelled").
type TranscriptionResult = {
segments: AlignedSegment[];
/** Silence-filtered audio when VAD model is loaded. Timestamps align to this audio. */
filteredAudio?: Float32Array;
};
cancel(): voidRequests cooperative cancellation of the active streaming operation, if any. After calling cancel(), the current push() or finalize() promise rejects with Error("Operation cancelled").
close(): Promise<void>Requests cancellation of any active push() / finalize() work, then releases native session resources. Safe to call multiple times.
get isClosed(): booleanReturns true after close().
'segments'Emitted after each Whisper transcription result. Behavior depends on mode:
finalize() is the definitive output.transcriptionOnly: true: each emission contains only the new segments from the latest Whisper result. Earlier segments never change, so incremental delivery is safe. Accumulate across emissions to build the full transcript.// With diarization (default): cumulative, re-aligned output
session.on('segments', (segments: AlignedSegment[]) => {
// `segments` contains the latest full speaker-labeled transcript so far
const latest = segments[segments.length - 1];
if (latest) {
const end = latest.start + latest.duration;
console.log(`[${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
}
});
// With transcriptionOnly: incremental output, accumulate manually
const allSegments: AlignedSegment[] = [];
session.on('segments', (newSegments: AlignedSegment[]) => {
allSegments.push(...newSegments);
for (const seg of newSegments) {
const end = seg.start + seg.duration;
console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
}
});
'audio'Emitted in real-time with silence-filtered PCM chunks (Float32Array) as the pipeline processes audio.
session.on('audio', (chunk: Float32Array) => {
// `chunk` is silence-filtered audio emitted for streaming consumers
});
processAgcfunction processAgc(audio: Float32Array, sampleRate?: number): Float32Array;
Standalone WebRTC AGC2 audio normalization. Processes the entire audio buffer through the adaptive gain controller and returns the gain-normalized result.
audio: 16 kHz mono PCM samples in [-1.0, 1.0]sampleRate: sample rate in Hz (default: 16000)Float32Array with normalized audioThis is a synchronous, stateless operation — a new AGC instance is created and destroyed per call. For streaming use, AGC is integrated into the pipeline via agcEnabled: true.
import { processAgc } from 'pyannote-cpp-node';
const normalized = processAgc(quietAudio);
// normalized has higher RMS energy than quietAudio
export interface PipelineConfig {
// === Required Model Paths ===
/** Path to segmentation GGUF model */
segModelPath: string;
/** Path to Whisper GGUF model */
whisperModelPath: string;
/** Path to embedding GGUF model (required unless transcriptionOnly is true) */
embModelPath?: string;
/** Path to PLDA GGUF model (required unless transcriptionOnly is true) */
pldaPath?: string;
// === Optional Model Paths ===
/** Path to Silero VAD model (optional, enables silence compression) */
vadModelPath?: string;
/**
* Transcription-only mode - skip speaker diarization (default: false).
* When true, embModelPath, pldaPath, and backend embedding assets are not required.
*/
transcriptionOnly?: boolean;
/**
* Enable WebRTC AGC2 audio normalization at the pipeline ingress (default: false).
* When true, audio is gain-normalized before silence filtering and downstream processing.
*/
agcEnabled?: boolean;
/** Required execution backend configuration */
backend: BackendConfig;
/** Suppress whisper.cpp log output (default: false) */
noPrints?: boolean;
// === Whisper Decode Options ===
/** Number of threads for Whisper inference (default: 4) */
nThreads?: number;
/** Language code (e.g., 'en', 'zh'). Omit for auto-detect. (default: 'en') */
language?: string;
/** Translate non-English speech to English (default: false) */
translate?: boolean;
/** Auto-detect spoken language. Overrides 'language' when true. (default: false) */
detectLanguage?: boolean;
// === Sampling ===
/** Sampling temperature. 0.0 = greedy deterministic. (default: 0.0) */
temperature?: number;
/** Temperature increment for fallback retries (default: 0.2) */
temperatureInc?: number;
/** Disable temperature fallback. If true, temperatureInc is ignored. (default: false) */
noFallback?: boolean;
/** Beam search size. -1 uses greedy decoding. >1 enables beam search. (default: -1) */
beamSize?: number;
/** Best-of-N sampling candidates for greedy decoding (default: 5) */
bestOf?: number;
// === Thresholds ===
/** Entropy threshold for decoder fallback (default: 2.4) */
entropyThold?: number;
/** Log probability threshold for decoder fallback (default: -1.0) */
logprobThold?: number;
/** No-speech probability threshold (default: 0.6) */
noSpeechThold?: number;
// === Context ===
/** Initial prompt text to condition the decoder (default: none) */
prompt?: string;
/** Don't use previous segment as context for next segment (default: true) */
noContext?: boolean;
/** Suppress blank outputs at the beginning of segments (default: true) */
suppressBlank?: boolean;
/** Suppress non-speech tokens (default: false) */
suppressNst?: boolean;
}
export type BackendConfig =
| {
/** Metal backend on macOS */
type: 'metal';
/** GPU device index */
gpuDevice?: number;
/** Enable Flash Attention */
flashAttn?: boolean;
/** Path to segmentation model directory */
segPath: string;
/** Path to embedding model directory */
embPath: string;
}
| {
/** Vulkan backend on Windows */
type: 'vulkan';
/** GPU device index */
gpuDevice?: number;
/** Enable Flash Attention */
flashAttn?: boolean;
}
| {
/** CoreML backend on macOS */
type: 'coreml';
/** GPU device index */
gpuDevice?: number;
/** Enable Flash Attention */
flashAttn?: boolean;
/** Path to segmentation CoreML .mlpackage directory */
segPath: string;
/** Path to embedding CoreML .mlpackage directory */
embPath: string;
/** Path to Whisper encoder CoreML .mlmodelc directory */
whisperEncoderPath: string;
}
| {
/** OpenVINO hybrid backend on Windows */
type: 'openvino-hybrid';
/** GPU device index */
gpuDevice?: number;
/** Enable Flash Attention */
flashAttn?: boolean;
/** Path to Whisper encoder OpenVINO IR (.xml) */
whisperEncoderPath: string;
/** Path to embedding OpenVINO IR (.xml) */
embPath: string;
/** OpenVINO device target (default: 'GPU') */
openvinoDevice?: string;
/** OpenVINO model cache directory */
openvinoCacheDir?: string;
};
export interface DecodeOptions {
/** Language code (e.g., 'en', 'zh'). Omit for auto-detect. */
language?: string;
/** Translate non-English speech to English */
translate?: boolean;
/** Auto-detect spoken language. Overrides 'language' when true. */
detectLanguage?: boolean;
/** Number of threads for Whisper inference */
nThreads?: number;
/** Sampling temperature. 0.0 = greedy deterministic. */
temperature?: number;
/** Temperature increment for fallback retries */
temperatureInc?: number;
/** Disable temperature fallback. If true, temperatureInc is ignored. */
noFallback?: boolean;
/** Beam search size. -1 uses greedy decoding. >1 enables beam search. */
beamSize?: number;
/** Best-of-N sampling candidates for greedy decoding */
bestOf?: number;
/** Entropy threshold for decoder fallback */
entropyThold?: number;
/** Log probability threshold for decoder fallback */
logprobThold?: number;
/** No-speech probability threshold */
noSpeechThold?: number;
/** Initial prompt text to condition the decoder */
prompt?: string;
/** Don't use previous segment as context for next segment */
noContext?: boolean;
/** Suppress blank outputs at the beginning of segments */
suppressBlank?: boolean;
/** Suppress non-speech tokens */
suppressNst?: boolean;
}
export interface AlignedSegment {
/** Global speaker label (e.g., SPEAKER_00). "UNKNOWN" when diarization could not assign a speaker. Empty string when transcriptionOnly is true. */
speaker: string;
/** Segment start time in seconds. */
start: number;
/** Segment duration in seconds. */
duration: number;
/** Transcribed text for this segment. */
text: string;
}
export interface TranscriptionResult {
/** Transcript segments. Speaker-labeled when diarization is enabled; speaker is empty string in transcription-only mode. */
segments: AlignedSegment[];
/**
* Silence-filtered audio (16 kHz mono Float32Array).
* Present when a VAD model is loaded (`vadModelPath` in config).
* Silence longer than 2 seconds is compressed to 2 seconds.
* All segment timestamps are aligned to this audio —
* save it directly and timestamps will sync correctly.
*/
filteredAudio?: Float32Array;
}
import { Pipeline } from 'pyannote-cpp-node';
async function runOffline(audio: Float32Array) {
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
});
// Runs Whisper on full audio at once + offline diarization
const result = await pipeline.transcribeOffline(audio);
for (const seg of result.segments) {
const end = seg.start + seg.duration;
console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
}
pipeline.close();
}
When a VAD model is provided, transcribeOffline automatically compresses silence longer than 2 seconds down to 2 seconds before running Whisper and diarization. The filtered audio is returned alongside segments so you can save it with correctly aligned timestamps.
import { Pipeline } from 'pyannote-cpp-node';
import { writeFileSync } from 'node:fs';
async function runOfflineWithVAD(audio: Float32Array) {
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
vadModelPath: './models/ggml-silero-v6.2.0.bin', // enables silence filtering
});
const result = await pipeline.transcribeOffline(audio);
// Save the silence-filtered audio — timestamps in result.segments align to this
if (result.filteredAudio) {
// filteredAudio is 16 kHz mono Float32Array with silence compressed
writeFileSync('./output-filtered.pcm', Buffer.from(result.filteredAudio.buffer));
console.log(`Filtered: ${audio.length} -> ${result.filteredAudio.length} samples`);
}
for (const seg of result.segments) {
const end = seg.start + seg.duration;
console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
}
pipeline.close();
}
import { Pipeline } from 'pyannote-cpp-node';
async function runOfflineWithCallbacks(audio: Float32Array) {
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
});
const result = await pipeline.transcribeOffline(
audio,
// Progress callback — phase 0 is Whisper (0-100%), phase 1 is diarization, phase 2 is alignment
(phase, progress) => {
if (phase === 0) updateProgressBar(progress);
if (phase === 1) showStatus('Identifying speakers...');
if (phase === 2) showStatus('Aligning speakers to transcript...');
},
// Segment callback — each Whisper segment as it's produced (before diarization)
(start, end, text) => {
appendToLivePreview(`[${start.toFixed(2)}-${end.toFixed(2)}] ${text}`);
},
);
console.log(`Done: ${result.segments.length} speaker-labeled segments`);
pipeline.close();
}
import { Pipeline } from 'pyannote-cpp-node';
async function runOneShot(audio: Float32Array) {
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
});
// Uses streaming pipeline internally (push 1s chunks + finalize)
const result = await pipeline.transcribe(audio);
for (const seg of result.segments) {
const end = seg.start + seg.duration;
console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
}
pipeline.close();
}
import { Pipeline } from 'pyannote-cpp-node';
async function runStreaming(audio: Float32Array) {
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
});
const session = pipeline.createSession();
// Diarization mode (default): each event is cumulative and may relabel earlier segments
session.on('segments', (segments) => {
const latest = segments[segments.length - 1];
if (latest) {
const end = latest.start + latest.duration;
console.log(`[live][${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
}
});
session.on('audio', (chunk) => {
console.log(`silence-filtered audio chunk: ${chunk.length} samples`);
});
const chunkSize = 16000;
for (let i = 0; i < audio.length; i += chunkSize) {
const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
const { vad, rms } = await session.push(chunk);
if (vad.length > 0) {
const speechFrames = vad.filter(Boolean).length;
const avgRms = rms.reduce((a, b) => a + b, 0) / rms.length;
console.log(`VAD: ${vad.length} frames (${speechFrames} speech), avg RMS: ${avgRms.toFixed(4)}`);
}
}
const finalResult = await session.finalize();
console.log(`Final segments: ${finalResult.segments.length}`);
await session.close();
pipeline.close();
}
import { Pipeline, type AlignedSegment } from 'pyannote-cpp-node';
async function runStreamingTranscriptionOnly(audio: Float32Array) {
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
transcriptionOnly: true,
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
});
const session = pipeline.createSession();
// Transcription-only: each event has only NEW segments
const allSegments: AlignedSegment[] = [];
session.on('segments', (newSegments) => {
allSegments.push(...newSegments);
for (const seg of newSegments) {
const end = seg.start + seg.duration;
console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
}
});
const chunkSize = 16000;
for (let i = 0; i < audio.length; i += chunkSize) {
const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
await session.push(chunk);
}
const finalResult = await session.finalize();
console.log(`Final segments from finalize(): ${finalResult.segments.length}`);
console.log(`Accumulated from incremental events: ${allSegments.length}`);
await session.close();
pipeline.close();
}
Use cancel() to stop in-flight native work before tearing down resources. This is useful for long offline or streaming runs that should exit promptly on Ctrl+C.
const pipeline = await Pipeline.load(config);
const session = pipeline.createSession();
process.once('SIGINT', async () => {
session.cancel();
pipeline.cancel();
await session.close().catch(() => {});
pipeline.close();
process.exit(130);
});
import { Pipeline } from 'pyannote-cpp-node';
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
flashAttn: true,
gpuDevice: 0,
},
// Decode strategy
nThreads: 8,
language: 'ko',
translate: false,
detectLanguage: false,
temperature: 0.0,
temperatureInc: 0.2,
noFallback: false,
beamSize: 5,
bestOf: 5,
// Thresholds and context
entropyThold: 2.4,
logprobThold: -1.0,
noSpeechThold: 0.6,
prompt: 'Meeting transcript with technical terminology.',
noContext: true,
suppressBlank: true,
suppressNst: false,
});
import { Pipeline } from 'pyannote-cpp-node';
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
},
language: 'en',
});
// First transcription in English
const result1 = await pipeline.transcribe(englishAudio);
// Switch to Korean for the next transcription
pipeline.setLanguage('ko');
const result2 = await pipeline.transcribe(koreanAudio);
// Or update multiple decode options at once
pipeline.setDecodeOptions({
language: 'zh',
temperature: 0.2,
beamSize: 5,
});
const result3 = await pipeline.transcribe(chineseAudio);
pipeline.close();
import { Pipeline } from 'pyannote-cpp-node';
// Start with Metal
const pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
backend: {
type: 'metal',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
},
});
// Switch to CoreML
await pipeline.setExecutionBackend({
type: 'coreml',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
});
const result1 = await pipeline.transcribeOffline(audio);
// Switch back to Metal
await pipeline.setExecutionBackend({
type: 'metal',
segPath: './models/segmentation.mlpackage',
embPath: './models/embedding.mlpackage',
});
const result2 = await pipeline.transcribeOffline(audio);
pipeline.close();
setExecutionBackend(backend) switches the inference backend at runtime.
metal and coremlvulkan and openvino-hybridopenvino-hybrid uses OpenVINO for the Whisper encoder and embedding model, and Vulkan for everything elsegpuDevice and flashAttn are configured inside the backend objectconst pipeline = await Pipeline.load({
segModelPath: './models/segmentation.gguf',
embModelPath: './models/embedding.gguf',
pldaPath: './models/plda.gguf',
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
language: 'en',
backend: {
type: 'openvino-hybrid',
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
embPath: './models/embedding-openvino.xml',
},
});
// Switch to OpenVINO-hybrid at runtime
await pipeline.setExecutionBackend({
type: 'openvino-hybrid',
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
embPath: './models/embedding-openvino.xml',
});
const result = await pipeline.transcribeOffline(audio);
// Switch back to Vulkan
await pipeline.setExecutionBackend({ type: 'vulkan' });
Streaming sessions also support runtime changes:
const session = pipeline.createSession();
session.on('segments', (segments) => {
console.log(segments);
});
// Push English audio
await session.push(englishChunk);
// Switch language mid-stream — takes effect on the next Whisper decode
session.setLanguage('ko');
await session.push(koreanChunk);
const result = await session.finalize();
await session.close();
The pipeline returns this JSON shape:
{
"segments": [
{
"speaker": "SPEAKER_00",
"start": 0.497000,
"duration": 2.085000,
"text": "Hello world"
}
]
}
When transcriptionOnly is true, the speaker field is an empty string:
{
"segments": [
{
"speaker": "",
"start": 0.497000,
"duration": 2.085000,
"text": "Hello world"
}
]
}
Float32Array16000 Hz[-1.0, 1.0]All API methods expect decoded PCM samples; file decoding/resampling is handled by the caller.
transcribeOffline)vadModelPath provided)whisper_full() call on filtered audioIn transcription-only mode, steps 3 (diarization) and 4 (alignment) are skipped.
transcribe / createSession)The streaming pipeline runs in 8 stages:
agcEnabled: true)segments updates + audio chunk streaming)In transcription-only mode, steps 5 (alignment) and 6 (recluster) are skipped, and segments are emitted with an empty speaker field. Each segments event contains only the new segments from that Whisper call (incremental), unlike diarization mode which re-emits all segments after each recluster (cumulative).
Pipeline.load() / pipeline.close() cycles in the same process (crashes after ~8-10 cycles)setExecutionBackend() calls (each call tears down and rebuilds the full GPU context)The pipeline enforces exclusive access to its GPU resources. Only one of the following can be active at a time:
createSession())transcribe())transcribeOffline())setExecutionBackend())Attempting to start a second operation while one is active throws an error. Close the current session or wait for the current operation to complete before starting the next one.
If you need to stop the current operation early, call cancel() and wait for the in-flight promise to reject before starting the next operation.
// CORRECT: sequential operations
const session = pipeline.createSession();
// ... push audio, finalize ...
await session.close();
const result = await pipeline.transcribeOffline(audio); // OK — session is closed
// ERROR: concurrent operations
const session1 = pipeline.createSession();
const session2 = pipeline.createSession(); // throws: "A session is already active"
await pipeline.transcribeOffline(audio); // throws: "Model is busy"
createSession() borrows pre-loaded models and GPU contexts from the cache// SAFE: load once, create many sessions sequentially
const pipeline = await Pipeline.load(config);
for (const file of files) {
const session = pipeline.createSession();
// ... push audio, finalize ...
await session.close(); // cheap — no GPU teardown
}
pipeline.close(); // once, at shutdown
// DANGEROUS on Intel iGPU: repeated load/close cycles
for (const file of files) {
const pipeline = await Pipeline.load(config); // creates Vulkan context
await pipeline.transcribe(audio);
pipeline.close(); // destroys Vulkan context - driver leaks ~8th cycle crashes
}
| Platform | Low-level (Whisper/VAD) | Pipeline (Transcription + Diarization) |
|---|---|---|
| macOS arm64 (Apple Silicon) | Supported | Supported (CoreML + Metal) |
| Windows x64 | Supported | Supported (Vulkan + optional OpenVINO) |
| macOS x64 (Intel) | Supported | Not tested |
| Linux | Not supported | Not supported |
MIT
FAQs
Node.js bindings for whisper.cpp plus the pyannote speaker diarization pipeline
The npm package pyannote-cpp-node receives a total of 98 weekly downloads. As such, pyannote-cpp-node popularity was classified as not popular.
We found that pyannote-cpp-node demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.