New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details
Socket
Book a DemoSign in
Socket

pyannote-cpp-node

Package Overview
Dependencies
Maintainers
1
Versions
29
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pyannote-cpp-node

Node.js bindings for whisper.cpp plus the pyannote speaker diarization pipeline

latest
Source
npmnpm
Version
1.8.0
Version published
Weekly downloads
103
-86.71%
Maintainers
1
Weekly downloads
 
Created
Source

pyannote-cpp-node

Platform Node

Node.js native bindings for whisper.cpp transcription/VAD plus the pyannote speaker diarization pipeline.

Overview

pyannote-cpp-node is now the single package for both:

  • low-level whisper.cpp APIs: WhisperContext, VadContext, transcribe, transcribeAsync, getGpuDevices
  • high-level pyannote pipeline APIs: Pipeline, PipelineSession
  • standalone audio processing: processAgc

Platform support:

  • darwin-arm64: full pipeline (CoreML + Metal acceleration)
  • win32-x64: full pipeline (Vulkan GPU + optional OpenVINO acceleration)
  • unsupported: darwin-x64, win32-ia32, Linux

On both supported pipeline platforms, getCapabilities().pipeline is true.

The integrated pipeline combines Whisper transcription and optional speaker diarization into a single API (transcriptionOnly: true skips diarization).

Given 16 kHz mono PCM audio (Float32Array), it produces transcript segments shaped as below. In streaming mode, diarization emits cumulative segments events, while transcriptionOnly: true emits incremental segments events. finalize() returns all segments in both modes.

  • speaker label (SPEAKER_00, SPEAKER_01, ...), "UNKNOWN" when diarization could not assign a speaker, or empty string ("") when transcriptionOnly is true
  • segment start/duration in seconds
  • segment text

The API supports three modes: offline batch processing (transcribeOffline), one-shot streaming (transcribe), and incremental streaming (createSession + push/finalize). All three modes support transcription-only operation via transcriptionOnly: true. All heavy operations are asynchronous and run on libuv worker threads.

Features

  • Low-level whisper.cpp transcription API compatible with prior whisper-cpp-node usage
  • Built-in Silero VAD via VadContext
  • GPU device enumeration via getGpuDevices()
  • Integrated transcription + diarization in one pipeline
  • Speaker-labeled transcript segments with sentence-level text
  • Offline mode: runs Whisper on the full audio at once + offline diarization (fastest for batch)
  • One-shot mode: streaming pipeline with automatic chunking
  • Streaming mode: incremental push/finalize with real-time segments events and audio chunk streaming
  • Transcription-only mode: skip speaker diarization entirely, only segmentation, VAD, and Whisper models required
  • Deterministic output for the same audio/models/config
  • CoreML-accelerated inference on macOS
  • Shared model cache: all models loaded once during Pipeline.load(), reused across offline/streaming/session modes
  • Runtime backend switching: switch inference backends at runtime on macOS and Windows
  • Progress reporting: optional onProgress callback for transcribeOffline reports Whisper, diarization, and alignment phases
  • Real-time segment streaming: optional onSegment callback for transcribeOffline delivers each Whisper segment (start, end, text) as it's produced — enables live transcript preview and time-based loading bars
  • WebRTC AGC2 audio normalization: adaptive gain control at the pipeline ingress, using an RNN-based VAD for speech-aware gain adjustment
  • Standalone AGC API: processAgc() for offline audio normalization without the full pipeline
  • TypeScript-first API with complete type definitions

Requirements

  • macOS Apple Silicon or Windows x64
  • Node.js >= 18
  • Model files:
    • Segmentation GGUF (segModelPath) — required on all platforms
    • Embedding GGUF (embModelPath) — required unless transcriptionOnly is true
    • PLDA GGUF (pldaPath) — required unless transcriptionOnly is true
    • Whisper GGUF (whisperModelPath) — required on all platforms
    • Optional Silero VAD model (vadModelPath)
    • Required backend config (backend) with one of: metal, vulkan, coreml, or openvino-hybrid
    • Accelerator-specific paths now live inside backend
      • backend: { type: 'coreml', segPath, embPath, whisperEncoderPath } uses CoreML .mlpackage / .mlmodelc assets on macOS
      • backend: { type: 'openvino-hybrid', whisperEncoderPath, embPath } uses OpenVINO IR .xml assets on Windows
      • backend: { type: 'metal', segPath, embPath } on macOS requires segmentation and embedding model paths
      • backend: { type: 'vulkan' } on Windows does not need extra accelerator paths

Installation

npm install pyannote-cpp-node
pnpm add pyannote-cpp-node

The package installs a platform-specific native addon through optionalDependencies.

Low-Level Quick Start

import {
  WhisperContext,
  createVadContext,
  getCapabilities,
  transcribeAsync,
} from 'pyannote-cpp-node';

const capabilities = getCapabilities();
console.log(capabilities);

const ctx = new WhisperContext({
  model: './models/ggml-base.en.bin',
  use_gpu: true,
  no_prints: true,
});

const result = await transcribeAsync(ctx, {
  fname_inp: './audio.wav',
  language: 'en',
});

const vad = createVadContext({
  model: './models/ggml-silero-v6.2.0.bin',
});

console.log(result.segments);
console.log(vad.getWindowSamples());

vad.free();
ctx.free();

Quick Start

macOS (Apple Silicon)

import { Pipeline } from 'pyannote-cpp-node';

const pipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  embModelPath: './models/embedding.gguf',
  pldaPath: './models/plda.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  backend: {
    type: 'coreml',
    segPath: './models/segmentation.mlpackage',
    embPath: './models/embedding.mlpackage',
    whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
  },
  language: 'en',
});

const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
const result = await pipeline.transcribeOffline(audio);

for (const segment of result.segments) {
  const end = segment.start + segment.duration;
  console.log(
    `[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
  );
}

pipeline.close();

Windows (x64)

import { Pipeline } from 'pyannote-cpp-node';

const pipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  embModelPath: './models/embedding.gguf',
  pldaPath: './models/plda.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  language: 'en',
  backend: { type: 'vulkan' },
});

const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
const result = await pipeline.transcribeOffline(audio);

for (const segment of result.segments) {
  const end = segment.start + segment.duration;
  console.log(
    `[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
  );
}

pipeline.close();

To use the Windows OpenVINO hybrid path instead, pass the OpenVINO assets through backend:

const pipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  embModelPath: './models/embedding.gguf',
  pldaPath: './models/plda.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  backend: {
    type: 'openvino-hybrid',
    whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
    embPath: './models/embedding-openvino.xml',
  },
});

Transcription-only mode

const macPipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  language: 'en',
  transcriptionOnly: true,
  backend: {
    type: 'coreml',
    segPath: './models/segmentation.mlpackage',
    embPath: './models/embedding.mlpackage',
    whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
  },
});

const windowsPipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  language: 'en',
  transcriptionOnly: true,
  backend: { type: 'vulkan' },
});

const result = await macPipeline.transcribe(audio);

for (const segment of result.segments) {
  const end = segment.start + segment.duration;
  // No speaker label - segment.speaker is empty string
  console.log(`${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`);
}

macPipeline.close();
windowsPipeline.close();

API Reference

Pipeline

class Pipeline {
  static async load(config: PipelineConfig): Promise<Pipeline>;
  async transcribeOffline(audio: Float32Array, onProgress?: (phase: number, progress: number) => void, onSegment?: (start: number, end: number, text: string) => void): Promise<TranscriptionResult>;
  async transcribe(audio: Float32Array): Promise<TranscriptionResult>;
  setLanguage(language: string): void;
  setDecodeOptions(options: DecodeOptions): void;
  createSession(): PipelineSession;
  async setExecutionBackend(backend: BackendConfig): Promise<void>;
  cancel(): void;
  close(): void;
  get isClosed(): boolean;
}

static async load(config: PipelineConfig): Promise<Pipeline>

Validates model paths and loads all models into a shared cache on a background thread. The accelerator assets are selected by config.backend, which is required and has no default. On macOS, backend: { type: 'coreml', ... } loads CoreML segmentation, embedding, and Whisper encoder assets, while backend: { type: 'metal', ... } uses Metal with segmentation and embedding model paths. On Windows x64, backend: { type: 'vulkan' } loads the Vulkan path, and backend: { type: 'openvino-hybrid', ... } also loads OpenVINO IR models for the Whisper encoder and embedding model. Models are loaded once and reused across all subsequent transcribe(), transcribeOffline(), and createSession() calls. Models are freed only when close() is called.

  • backend is required in every Pipeline.load() call
  • coreml requires segPath, embPath, and whisperEncoderPath
  • openvino-hybrid requires whisperEncoderPath and embPath
  • metal requires segPath and embPath
  • vulkan does not require extra accelerator model paths

async transcribeOffline(audio: Float32Array, onProgress?, onSegment?): Promise<TranscriptionResult>

Runs Whisper on the entire audio buffer in a single whisper_full() call, then runs offline diarization and WhisperX-style speaker alignment. In transcription-only mode, diarization and speaker alignment are skipped, and segments have an empty speaker field. This is the fastest mode for batch processing — no streaming infrastructure is involved.

If cancellation is requested with pipeline.cancel() or pipeline.close() while this call is in flight, the promise rejects with Error("Operation cancelled").

The optional onProgress callback receives (phase, progress) updates:

PhaseValueMeaning
00100Whisper transcription progress (percentage)
10Diarization started
20Speaker alignment started
const result = await pipeline.transcribeOffline(audio, (phase, progress) => {
  if (phase === 0) console.log(`Transcribing: ${progress}%`);
  if (phase === 1) console.log('Running diarization...');
  if (phase === 2) console.log('Aligning speakers...');
});

The optional onSegment callback receives (start, end, text) for each Whisper segment as it's produced during transcription. Times are in seconds. This enables live transcript preview before diarization and alignment complete.

const result = await pipeline.transcribeOffline(audio, undefined, (start, end, text) => {
  console.log(`[${start.toFixed(2)}-${end.toFixed(2)}] ${text}`);
});

Both callbacks can be used simultaneously:

const result = await pipeline.transcribeOffline(
  audio,
  (phase, progress) => {
    if (phase === 0) updateProgressBar(progress);
  },
  (start, end, text) => {
    appendToTranscriptPreview(start, end, text);
  },
);

async transcribe(audio: Float32Array): Promise<TranscriptionResult>

Runs one-shot transcription (+ diarization unless transcriptionOnly is set) using the streaming pipeline internally (pushes 1-second chunks then finalizes).

If cancellation is requested with pipeline.cancel() or pipeline.close() while this call is in flight, the promise rejects with Error("Operation cancelled").

setLanguage(language: string): void

Updates the Whisper decode language for subsequent transcribe() calls. This is a convenience shorthand for setDecodeOptions({ language }).

setDecodeOptions(options: DecodeOptions): void

Updates one or more Whisper decode options for subsequent transcribe() calls. Only the fields you pass are changed; others retain their current values. See DecodeOptions for available fields.

async setExecutionBackend(backend: BackendConfig): Promise<void>

Switches the inference backend at runtime. Tears down and reloads the entire model cache with the new backend configuration. The promise resolves when the new models are ready.

  • macOS: supports metal and coreml
  • Windows: supports vulkan and openvino-hybrid

Pass one of these BackendConfig variants:

type BackendConfig =
  | { type: 'metal'; gpuDevice?: number; flashAttn?: boolean; segPath: string; embPath: string }
  | { type: 'vulkan'; gpuDevice?: number; flashAttn?: boolean }
  | {
      type: 'coreml';
      gpuDevice?: number;
      flashAttn?: boolean;
      segPath: string;
      embPath: string;
      whisperEncoderPath: string;
    }
  | {
      type: 'openvino-hybrid';
      gpuDevice?: number;
      flashAttn?: boolean;
      whisperEncoderPath: string;
      embPath: string;
      openvinoDevice?: string;
      openvinoCacheDir?: string;
    };

Warning: This is a heavy operation (~5-6s on Intel iGPU). It fully tears down and rebuilds the model cache. Treat it as a one-time configuration change, not something to call in a loop. See Warnings and Known Issues for Intel iGPU limitations.

// macOS: switch to Metal
await pipeline.setExecutionBackend({
  type: 'metal',
  segPath: './models/segmentation.mlpackage',
  embPath: './models/embedding.mlpackage',
});

// macOS: switch to CoreML
await pipeline.setExecutionBackend({
  type: 'coreml',
  segPath: './models/segmentation.mlpackage',
  embPath: './models/embedding.mlpackage',
  whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
});

// Windows: switch to OpenVINO hybrid
await pipeline.setExecutionBackend({
  type: 'openvino-hybrid',
  whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
  embPath: './models/embedding-openvino.xml',
});

// Windows: switch back to Vulkan
await pipeline.setExecutionBackend({ type: 'vulkan' });

createSession(): PipelineSession

Creates an independent streaming session for incremental processing. This method takes no arguments; native segment/audio callbacks are wired internally.

cancel(): void

Requests cooperative cancellation of the active transcribe() or transcribeOffline() call, if any. Native work stops at the next supported cancellation point, and the in-flight promise rejects with Error("Operation cancelled").

close(): void

Requests cancellation of any active model-level operation, then releases native resources. Safe to call multiple times.

get isClosed(): boolean

Returns true after close().

PipelineSession (extends EventEmitter)

class PipelineSession extends EventEmitter {
  async push(audio: Float32Array): Promise<PushResult>;
  cancel(): void;
  setLanguage(language: string): void;
  setDecodeOptions(options: DecodeOptions): void;
  async finalize(): Promise<TranscriptionResult>;
  close(): Promise<void>;
  get isClosed(): boolean;
  on<K extends keyof PipelineSessionEvents>(
    event: K,
    listener: (...args: PipelineSessionEvents[K]) => void
  ): this;
}
interface PipelineSessionEvents {
  segments: [segments: AlignedSegment[]];
  audio: [audio: Float32Array];
  error: [error: Error];
}

async push(audio: Float32Array): Promise<PushResult>

Pushes an arbitrary number of samples into the streaming pipeline.

interface PushResult {
  vad: boolean[];  // per-window VAD (true = speech, false = silence)
  rms: number[];   // per-window RMS of input audio, aligned 1:1 with vad
}
  • vad contains per-frame VAD booleans (one per 512-sample window)
  • rms contains the RMS energy of each corresponding window (post-AGC when agcEnabled is true, raw input otherwise)
  • Both arrays are the same length
  • The pipeline pre-fills 10 seconds of silence during initialization, so results are returned immediately from the first push (~0.94s step size)
  • Chunk size is flexible; not restricted to 16,000-sample pushes

If the session is cancelled before or during the call, the promise rejects with Error("Operation cancelled").

setLanguage(language: string): void

Updates the Whisper decode language on the live streaming session. Takes effect on the next Whisper decode run. Thread-safe — the change is pushed to the C++ pipeline immediately.

setDecodeOptions(options: DecodeOptions): void

Updates one or more Whisper decode options on the live streaming session. Takes effect on the next Whisper decode run. Thread-safe — changes are pushed to the C++ pipeline immediately. Only the fields you pass are changed; others retain their current values.

async finalize(): Promise<TranscriptionResult>

Flushes all stages, runs final recluster + alignment, and returns the definitive result. finalize() always returns all accumulated segments regardless of mode. In diarization mode this is the final re-aligned output, and in transcription-only mode this is the union of all incremental segments emissions.

If the session is cancelled before or during the call, the promise rejects with Error("Operation cancelled").

type TranscriptionResult = {
  segments: AlignedSegment[];
  /** Silence-filtered audio when VAD model is loaded. Timestamps align to this audio. */
  filteredAudio?: Float32Array;
};

cancel(): void

Requests cooperative cancellation of the active streaming operation, if any. After calling cancel(), the current push() or finalize() promise rejects with Error("Operation cancelled").

close(): Promise<void>

Requests cancellation of any active push() / finalize() work, then releases native session resources. Safe to call multiple times.

get isClosed(): boolean

Returns true after close().

Event: 'segments'

Emitted after each Whisper transcription result. Behavior depends on mode:

  • With diarization (default): each emission contains all segments re-aligned against the latest speaker clustering. Earlier segments may get updated speaker labels as more data arrives. The final emission after finalize() is the definitive output.
  • With transcriptionOnly: true: each emission contains only the new segments from the latest Whisper result. Earlier segments never change, so incremental delivery is safe. Accumulate across emissions to build the full transcript.
// With diarization (default): cumulative, re-aligned output
session.on('segments', (segments: AlignedSegment[]) => {
  // `segments` contains the latest full speaker-labeled transcript so far
  const latest = segments[segments.length - 1];
  if (latest) {
    const end = latest.start + latest.duration;
    console.log(`[${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
  }
});

// With transcriptionOnly: incremental output, accumulate manually
const allSegments: AlignedSegment[] = [];
session.on('segments', (newSegments: AlignedSegment[]) => {
  allSegments.push(...newSegments);
  for (const seg of newSegments) {
    const end = seg.start + seg.duration;
    console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
  }
});

Event: 'audio'

Emitted in real-time with silence-filtered PCM chunks (Float32Array) as the pipeline processes audio.

session.on('audio', (chunk: Float32Array) => {
  // `chunk` is silence-filtered audio emitted for streaming consumers
});

processAgc

function processAgc(audio: Float32Array, sampleRate?: number): Float32Array;

Standalone WebRTC AGC2 audio normalization. Processes the entire audio buffer through the adaptive gain controller and returns the gain-normalized result.

  • audio: 16 kHz mono PCM samples in [-1.0, 1.0]
  • sampleRate: sample rate in Hz (default: 16000)
  • Returns a new Float32Array with normalized audio

This is a synchronous, stateless operation — a new AGC instance is created and destroyed per call. For streaming use, AGC is integrated into the pipeline via agcEnabled: true.

import { processAgc } from 'pyannote-cpp-node';

const normalized = processAgc(quietAudio);
// normalized has higher RMS energy than quietAudio

Types

export interface PipelineConfig {
  // === Required Model Paths ===
  /** Path to segmentation GGUF model */
  segModelPath: string;

  /** Path to Whisper GGUF model */
  whisperModelPath: string;

  /** Path to embedding GGUF model (required unless transcriptionOnly is true) */
  embModelPath?: string;

  /** Path to PLDA GGUF model (required unless transcriptionOnly is true) */
  pldaPath?: string;

  // === Optional Model Paths ===
  /** Path to Silero VAD model (optional, enables silence compression) */
  vadModelPath?: string;

  /**
   * Transcription-only mode - skip speaker diarization (default: false).
   * When true, embModelPath, pldaPath, and backend embedding assets are not required.
   */
  transcriptionOnly?: boolean;

  /**
   * Enable WebRTC AGC2 audio normalization at the pipeline ingress (default: false).
   * When true, audio is gain-normalized before silence filtering and downstream processing.
   */
  agcEnabled?: boolean;

  /** Required execution backend configuration */
  backend: BackendConfig;

  /** Suppress whisper.cpp log output (default: false) */
  noPrints?: boolean;

  // === Whisper Decode Options ===
  /** Number of threads for Whisper inference (default: 4) */
  nThreads?: number;

  /** Language code (e.g., 'en', 'zh'). Omit for auto-detect. (default: 'en') */
  language?: string;

  /** Translate non-English speech to English (default: false) */
  translate?: boolean;

  /** Auto-detect spoken language. Overrides 'language' when true. (default: false) */
  detectLanguage?: boolean;

  // === Sampling ===
  /** Sampling temperature. 0.0 = greedy deterministic. (default: 0.0) */
  temperature?: number;

  /** Temperature increment for fallback retries (default: 0.2) */
  temperatureInc?: number;

  /** Disable temperature fallback. If true, temperatureInc is ignored. (default: false) */
  noFallback?: boolean;

  /** Beam search size. -1 uses greedy decoding. >1 enables beam search. (default: -1) */
  beamSize?: number;

  /** Best-of-N sampling candidates for greedy decoding (default: 5) */
  bestOf?: number;

  // === Thresholds ===
  /** Entropy threshold for decoder fallback (default: 2.4) */
  entropyThold?: number;

  /** Log probability threshold for decoder fallback (default: -1.0) */
  logprobThold?: number;

  /** No-speech probability threshold (default: 0.6) */
  noSpeechThold?: number;

  // === Context ===
  /** Initial prompt text to condition the decoder (default: none) */
  prompt?: string;

  /** Don't use previous segment as context for next segment (default: true) */
  noContext?: boolean;

  /** Suppress blank outputs at the beginning of segments (default: true) */
  suppressBlank?: boolean;

  /** Suppress non-speech tokens (default: false) */
  suppressNst?: boolean;
}

export type BackendConfig =
  | {
      /** Metal backend on macOS */
      type: 'metal';
      /** GPU device index */
      gpuDevice?: number;
      /** Enable Flash Attention */
      flashAttn?: boolean;
      /** Path to segmentation model directory */
      segPath: string;
      /** Path to embedding model directory */
      embPath: string;
    }
  | {
      /** Vulkan backend on Windows */
      type: 'vulkan';
      /** GPU device index */
      gpuDevice?: number;
      /** Enable Flash Attention */
      flashAttn?: boolean;
    }
  | {
      /** CoreML backend on macOS */
      type: 'coreml';
      /** GPU device index */
      gpuDevice?: number;
      /** Enable Flash Attention */
      flashAttn?: boolean;
      /** Path to segmentation CoreML .mlpackage directory */
      segPath: string;
      /** Path to embedding CoreML .mlpackage directory */
      embPath: string;
      /** Path to Whisper encoder CoreML .mlmodelc directory */
      whisperEncoderPath: string;
    }
  | {
      /** OpenVINO hybrid backend on Windows */
      type: 'openvino-hybrid';
      /** GPU device index */
      gpuDevice?: number;
      /** Enable Flash Attention */
      flashAttn?: boolean;
      /** Path to Whisper encoder OpenVINO IR (.xml) */
      whisperEncoderPath: string;
      /** Path to embedding OpenVINO IR (.xml) */
      embPath: string;
      /** OpenVINO device target (default: 'GPU') */
      openvinoDevice?: string;
      /** OpenVINO model cache directory */
      openvinoCacheDir?: string;
    };

export interface DecodeOptions {
  /** Language code (e.g., 'en', 'zh'). Omit for auto-detect. */
  language?: string;
  /** Translate non-English speech to English */
  translate?: boolean;
  /** Auto-detect spoken language. Overrides 'language' when true. */
  detectLanguage?: boolean;
  /** Number of threads for Whisper inference */
  nThreads?: number;
  /** Sampling temperature. 0.0 = greedy deterministic. */
  temperature?: number;
  /** Temperature increment for fallback retries */
  temperatureInc?: number;
  /** Disable temperature fallback. If true, temperatureInc is ignored. */
  noFallback?: boolean;
  /** Beam search size. -1 uses greedy decoding. >1 enables beam search. */
  beamSize?: number;
  /** Best-of-N sampling candidates for greedy decoding */
  bestOf?: number;
  /** Entropy threshold for decoder fallback */
  entropyThold?: number;
  /** Log probability threshold for decoder fallback */
  logprobThold?: number;
  /** No-speech probability threshold */
  noSpeechThold?: number;
  /** Initial prompt text to condition the decoder */
  prompt?: string;
  /** Don't use previous segment as context for next segment */
  noContext?: boolean;
  /** Suppress blank outputs at the beginning of segments */
  suppressBlank?: boolean;
  /** Suppress non-speech tokens */
  suppressNst?: boolean;
}

export interface AlignedSegment {
  /** Global speaker label (e.g., SPEAKER_00). "UNKNOWN" when diarization could not assign a speaker. Empty string when transcriptionOnly is true. */
  speaker: string;

  /** Segment start time in seconds. */
  start: number;

  /** Segment duration in seconds. */
  duration: number;

  /** Transcribed text for this segment. */
  text: string;
}

export interface TranscriptionResult {
  /** Transcript segments. Speaker-labeled when diarization is enabled; speaker is empty string in transcription-only mode. */
  segments: AlignedSegment[];
  /**
   * Silence-filtered audio (16 kHz mono Float32Array).
   * Present when a VAD model is loaded (`vadModelPath` in config).
   * Silence longer than 2 seconds is compressed to 2 seconds.
   * All segment timestamps are aligned to this audio —
   * save it directly and timestamps will sync correctly.
   */
  filteredAudio?: Float32Array;
}

Usage Examples

import { Pipeline } from 'pyannote-cpp-node';

async function runOffline(audio: Float32Array) {
  const pipeline = await Pipeline.load({
    segModelPath: './models/segmentation.gguf',
    embModelPath: './models/embedding.gguf',
    pldaPath: './models/plda.gguf',
    whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
    backend: {
      type: 'coreml',
      segPath: './models/segmentation.mlpackage',
      embPath: './models/embedding.mlpackage',
      whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
    },
  });

  // Runs Whisper on full audio at once + offline diarization
  const result = await pipeline.transcribeOffline(audio);

  for (const seg of result.segments) {
    const end = seg.start + seg.duration;
    console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
  }

  pipeline.close();
}

Offline transcription with silence filtering

When a VAD model is provided, transcribeOffline automatically compresses silence longer than 2 seconds down to 2 seconds before running Whisper and diarization. The filtered audio is returned alongside segments so you can save it with correctly aligned timestamps.

import { Pipeline } from 'pyannote-cpp-node';
import { writeFileSync } from 'node:fs';

async function runOfflineWithVAD(audio: Float32Array) {
  const pipeline = await Pipeline.load({
    segModelPath: './models/segmentation.gguf',
    embModelPath: './models/embedding.gguf',
    pldaPath: './models/plda.gguf',
    whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
    backend: {
      type: 'coreml',
      segPath: './models/segmentation.mlpackage',
      embPath: './models/embedding.mlpackage',
      whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
    },
    vadModelPath: './models/ggml-silero-v6.2.0.bin', // enables silence filtering
  });

  const result = await pipeline.transcribeOffline(audio);

  // Save the silence-filtered audio — timestamps in result.segments align to this
  if (result.filteredAudio) {
    // filteredAudio is 16 kHz mono Float32Array with silence compressed
    writeFileSync('./output-filtered.pcm', Buffer.from(result.filteredAudio.buffer));
    console.log(`Filtered: ${audio.length} -> ${result.filteredAudio.length} samples`);
  }

  for (const seg of result.segments) {
    const end = seg.start + seg.duration;
    console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
  }

  pipeline.close();
}

Offline transcription with progress and live transcript preview

import { Pipeline } from 'pyannote-cpp-node';

async function runOfflineWithCallbacks(audio: Float32Array) {
  const pipeline = await Pipeline.load({
    segModelPath: './models/segmentation.gguf',
    embModelPath: './models/embedding.gguf',
    pldaPath: './models/plda.gguf',
    whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
    backend: {
      type: 'coreml',
      segPath: './models/segmentation.mlpackage',
      embPath: './models/embedding.mlpackage',
      whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
    },
  });

  const result = await pipeline.transcribeOffline(
    audio,
    // Progress callback — phase 0 is Whisper (0-100%), phase 1 is diarization, phase 2 is alignment
    (phase, progress) => {
      if (phase === 0) updateProgressBar(progress);
      if (phase === 1) showStatus('Identifying speakers...');
      if (phase === 2) showStatus('Aligning speakers to transcript...');
    },
    // Segment callback — each Whisper segment as it's produced (before diarization)
    (start, end, text) => {
      appendToLivePreview(`[${start.toFixed(2)}-${end.toFixed(2)}] ${text}`);
    },
  );

  console.log(`Done: ${result.segments.length} speaker-labeled segments`);
  pipeline.close();
}

One-shot transcription (streaming internals)

import { Pipeline } from 'pyannote-cpp-node';

async function runOneShot(audio: Float32Array) {
  const pipeline = await Pipeline.load({
    segModelPath: './models/segmentation.gguf',
    embModelPath: './models/embedding.gguf',
    pldaPath: './models/plda.gguf',
    whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
    backend: {
      type: 'coreml',
      segPath: './models/segmentation.mlpackage',
      embPath: './models/embedding.mlpackage',
      whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
    },
  });

  // Uses streaming pipeline internally (push 1s chunks + finalize)
  const result = await pipeline.transcribe(audio);

  for (const seg of result.segments) {
    const end = seg.start + seg.duration;
    console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
  }

  pipeline.close();
}

Streaming transcription

import { Pipeline } from 'pyannote-cpp-node';

async function runStreaming(audio: Float32Array) {
  const pipeline = await Pipeline.load({
    segModelPath: './models/segmentation.gguf',
    embModelPath: './models/embedding.gguf',
    pldaPath: './models/plda.gguf',
    whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
    backend: {
      type: 'coreml',
      segPath: './models/segmentation.mlpackage',
      embPath: './models/embedding.mlpackage',
      whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
    },
  });

  const session = pipeline.createSession();
  // Diarization mode (default): each event is cumulative and may relabel earlier segments
  session.on('segments', (segments) => {
    const latest = segments[segments.length - 1];
    if (latest) {
      const end = latest.start + latest.duration;
      console.log(`[live][${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
    }
  });

  session.on('audio', (chunk) => {
    console.log(`silence-filtered audio chunk: ${chunk.length} samples`);
  });

  const chunkSize = 16000;
  for (let i = 0; i < audio.length; i += chunkSize) {
    const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
    const { vad, rms } = await session.push(chunk);
    if (vad.length > 0) {
      const speechFrames = vad.filter(Boolean).length;
      const avgRms = rms.reduce((a, b) => a + b, 0) / rms.length;
      console.log(`VAD: ${vad.length} frames (${speechFrames} speech), avg RMS: ${avgRms.toFixed(4)}`);
    }
  }

  const finalResult = await session.finalize();
  console.log(`Final segments: ${finalResult.segments.length}`);

  await session.close();
  pipeline.close();
}
import { Pipeline, type AlignedSegment } from 'pyannote-cpp-node';

async function runStreamingTranscriptionOnly(audio: Float32Array) {
  const pipeline = await Pipeline.load({
    segModelPath: './models/segmentation.gguf',
    whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
    transcriptionOnly: true,
    backend: {
      type: 'coreml',
      segPath: './models/segmentation.mlpackage',
      embPath: './models/embedding.mlpackage',
      whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
    },
  });

  const session = pipeline.createSession();

  // Transcription-only: each event has only NEW segments
  const allSegments: AlignedSegment[] = [];
  session.on('segments', (newSegments) => {
    allSegments.push(...newSegments);
    for (const seg of newSegments) {
      const end = seg.start + seg.duration;
      console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
    }
  });

  const chunkSize = 16000;
  for (let i = 0; i < audio.length; i += chunkSize) {
    const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
    await session.push(chunk);
  }

  const finalResult = await session.finalize();
  console.log(`Final segments from finalize(): ${finalResult.segments.length}`);
  console.log(`Accumulated from incremental events: ${allSegments.length}`);

  await session.close();
  pipeline.close();
}

Graceful shutdown

Use cancel() to stop in-flight native work before tearing down resources. This is useful for long offline or streaming runs that should exit promptly on Ctrl+C.

const pipeline = await Pipeline.load(config);
const session = pipeline.createSession();

process.once('SIGINT', async () => {
  session.cancel();
  pipeline.cancel();
  await session.close().catch(() => {});
  pipeline.close();
  process.exit(130);
});

Custom Whisper decode options

import { Pipeline } from 'pyannote-cpp-node';

const pipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  embModelPath: './models/embedding.gguf',
  pldaPath: './models/plda.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  backend: {
    type: 'coreml',
    segPath: './models/segmentation.mlpackage',
    embPath: './models/embedding.mlpackage',
    whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
    flashAttn: true,
    gpuDevice: 0,
  },

  // Decode strategy
  nThreads: 8,
  language: 'ko',
  translate: false,
  detectLanguage: false,
  temperature: 0.0,
  temperatureInc: 0.2,
  noFallback: false,
  beamSize: 5,
  bestOf: 5,

  // Thresholds and context
  entropyThold: 2.4,
  logprobThold: -1.0,
  noSpeechThold: 0.6,
  prompt: 'Meeting transcript with technical terminology.',
  noContext: true,
  suppressBlank: true,
  suppressNst: false,
});

Changing language at runtime

import { Pipeline } from 'pyannote-cpp-node';

const pipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  embModelPath: './models/embedding.gguf',
  pldaPath: './models/plda.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  backend: {
    type: 'coreml',
    segPath: './models/segmentation.mlpackage',
    embPath: './models/embedding.mlpackage',
    whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
  },
  language: 'en',
});

// First transcription in English
const result1 = await pipeline.transcribe(englishAudio);

// Switch to Korean for the next transcription
pipeline.setLanguage('ko');
const result2 = await pipeline.transcribe(koreanAudio);

// Or update multiple decode options at once
pipeline.setDecodeOptions({
  language: 'zh',
  temperature: 0.2,
  beamSize: 5,
});
const result3 = await pipeline.transcribe(chineseAudio);

pipeline.close();

Switching execution backend at runtime (macOS)

import { Pipeline } from 'pyannote-cpp-node';

// Start with Metal
const pipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  embModelPath: './models/embedding.gguf',
  pldaPath: './models/plda.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  backend: {
    type: 'metal',
    segPath: './models/segmentation.mlpackage',
    embPath: './models/embedding.mlpackage',
  },
});

// Switch to CoreML
await pipeline.setExecutionBackend({
  type: 'coreml',
  segPath: './models/segmentation.mlpackage',
  embPath: './models/embedding.mlpackage',
  whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
});
const result1 = await pipeline.transcribeOffline(audio);

// Switch back to Metal
await pipeline.setExecutionBackend({
  type: 'metal',
  segPath: './models/segmentation.mlpackage',
  embPath: './models/embedding.mlpackage',
});
const result2 = await pipeline.transcribeOffline(audio);

pipeline.close();

Execution Backends

setExecutionBackend(backend) switches the inference backend at runtime.

  • On macOS: supports metal and coreml
  • On Windows: supports vulkan and openvino-hybrid
  • openvino-hybrid uses OpenVINO for the Whisper encoder and embedding model, and Vulkan for everything else
  • gpuDevice and flashAttn are configured inside the backend object
const pipeline = await Pipeline.load({
  segModelPath: './models/segmentation.gguf',
  embModelPath: './models/embedding.gguf',
  pldaPath: './models/plda.gguf',
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
  language: 'en',
  backend: {
    type: 'openvino-hybrid',
    whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
    embPath: './models/embedding-openvino.xml',
  },
});

// Switch to OpenVINO-hybrid at runtime
await pipeline.setExecutionBackend({
  type: 'openvino-hybrid',
  whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
  embPath: './models/embedding-openvino.xml',
});
const result = await pipeline.transcribeOffline(audio);

// Switch back to Vulkan
await pipeline.setExecutionBackend({ type: 'vulkan' });

Streaming sessions also support runtime changes:

const session = pipeline.createSession();

session.on('segments', (segments) => {
  console.log(segments);
});

// Push English audio
await session.push(englishChunk);

// Switch language mid-stream — takes effect on the next Whisper decode
session.setLanguage('ko');
await session.push(koreanChunk);

const result = await session.finalize();

await session.close();

JSON Output Format

The pipeline returns this JSON shape:

{
  "segments": [
    {
      "speaker": "SPEAKER_00",
      "start": 0.497000,
      "duration": 2.085000,
      "text": "Hello world"
    }
  ]
}

When transcriptionOnly is true, the speaker field is an empty string:

{
  "segments": [
    {
      "speaker": "",
      "start": 0.497000,
      "duration": 2.085000,
      "text": "Hello world"
    }
  ]
}

Audio Format Requirements

  • Input must be Float32Array
  • Sample rate must be 16000 Hz
  • Audio must be mono
  • Recommended amplitude range: [-1.0, 1.0]

All API methods expect decoded PCM samples; file decoding/resampling is handled by the caller.

Architecture

Offline mode (transcribeOffline)

  • VAD silence filter (optional — compresses silence >2s to 2s when vadModelPath provided)
  • Single whisper_full() call on filtered audio
  • Offline diarization (segmentation → powerset → embeddings → PLDA → AHC → VBx) on filtered audio
  • WhisperX-style alignment (speaker assignment by maximum segment overlap)
  • Return segments + filtered audio bytes (timestamps aligned to filtered audio)

In transcription-only mode, steps 3 (diarization) and 4 (alignment) are skipped.

Streaming mode (transcribe / createSession)

The streaming pipeline runs in 8 stages:

  • AGC2 normalization (optional — enabled via agcEnabled: true)
  • VAD silence filter (optional compression of long silence)
  • Audio buffer (stream-safe FIFO with timestamp tracking)
  • Segmentation (speech activity over rolling windows)
  • Transcription (Whisper sentence-level segments)
  • Alignment (segment-level speaker assignment by overlap)
  • Finalize (flush + final recluster + final alignment)
  • Callback/event emission (segments updates + audio chunk streaming)

In transcription-only mode, steps 5 (alignment) and 6 (recluster) are skipped, and segments are emitted with an empty speaker field. Each segments event contains only the new segments from that Whisper call (incremental), unlike diarization mode which re-emits all segments after each recluster (cumulative).

Performance

  • Offline transcription + diarization: ~12x real-time (30s audio in 2.5s)
  • Diarization only: 39x real-time
  • Integrated streaming transcription + diarization: ~14.6x real-time
  • 45-minute Korean meeting test (6 speakers): 2713s audio in 186s
  • Each Whisper segment maps 1:1 to a speaker-labeled segment (no merging)
  • Speaker confusion rate: 2.55%

Warnings and Known Issues

Intel Integrated GPU (Iris Xe) - Vulkan driver memory leak

  • Intel Iris Xe (12th/13th gen) Vulkan drivers have a known memory leak when GPU contexts are repeatedly created and destroyed
  • This is a confirmed Intel driver bug (Intel internal tracking ID: 14022504159), not a bug in this library
  • Affects: repeated Pipeline.load() / pipeline.close() cycles in the same process (crashes after ~8-10 cycles)
  • Affects: repeated setExecutionBackend() calls (each call tears down and rebuilds the full GPU context)
  • Does NOT affect: creating/closing sessions (sessions borrow cached GPU contexts, no new Vulkan allocations)
  • Does NOT affect: NVIDIA or AMD discrete GPUs, or Intel Core Ultra (newer gen) integrated GPUs
  • Workaround: load the pipeline once at application startup and reuse it. Close only at shutdown.
  • Reference: https://community.intel.com/t5/Developing-Games-on-Intel/Memory-leaks-on-Intel-Iris-Xe-graphics/td-p/1585566

setExecutionBackend is a heavy operation

  • Each call fully tears down and reloads the model cache (Whisper context, GGML models, Vulkan/OpenVINO backends)
  • Takes ~5-6 seconds on Intel Iris Xe
  • Treat it as a one-time configuration change, not something to call repeatedly
  • On Intel iGPU: limit to 1-2 switches per process lifetime to avoid the driver leak

One operation at a time

The pipeline enforces exclusive access to its GPU resources. Only one of the following can be active at a time:

  • A streaming session (createSession())
  • A one-shot transcription (transcribe())
  • An offline transcription (transcribeOffline())
  • A backend switch (setExecutionBackend())

Attempting to start a second operation while one is active throws an error. Close the current session or wait for the current operation to complete before starting the next one.

If you need to stop the current operation early, call cancel() and wait for the in-flight promise to reject before starting the next operation.

// CORRECT: sequential operations
const session = pipeline.createSession();
// ... push audio, finalize ...
await session.close();
const result = await pipeline.transcribeOffline(audio); // OK — session is closed

// ERROR: concurrent operations
const session1 = pipeline.createSession();
const session2 = pipeline.createSession(); // throws: "A session is already active"
await pipeline.transcribeOffline(audio);   // throws: "Model is busy"

Session creation is cheap

  • createSession() borrows pre-loaded models and GPU contexts from the cache
  • No new Vulkan backends or model loads occur
  • Close the session when done, then create another — safe to repeat unlimited times
// SAFE: load once, create many sessions sequentially
const pipeline = await Pipeline.load(config);
for (const file of files) {
  const session = pipeline.createSession();
  // ... push audio, finalize ...
  await session.close(); // cheap — no GPU teardown
}
pipeline.close(); // once, at shutdown

// DANGEROUS on Intel iGPU: repeated load/close cycles
for (const file of files) {
  const pipeline = await Pipeline.load(config); // creates Vulkan context
  await pipeline.transcribe(audio);
  pipeline.close(); // destroys Vulkan context - driver leaks ~8th cycle crashes
}

Platform Support

PlatformLow-level (Whisper/VAD)Pipeline (Transcription + Diarization)
macOS arm64 (Apple Silicon)SupportedSupported (CoreML + Metal)
Windows x64SupportedSupported (Vulkan + optional OpenVINO)
macOS x64 (Intel)SupportedNot tested
LinuxNot supportedNot supported

License

MIT

FAQs

Package last updated on 26 Mar 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts