🚀 Big News:Socket Has Acquired Secure Annex.Learn More
Socket
Book a DemoSign in
Socket

@tricoteuses/transcription-videos

Package Overview
Dependencies
Maintainers
4
Versions
5
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@tricoteuses/transcription-videos

Permet d'obtenir la transcription des vidéos de l'assemblée/sénat en fournissant un lien vidéo m3u8 en entrée

latest
npmnpm
Version
1.1.1
Version published
Maintainers
4
Created
Source

tricoteuses-transcription-videos

Node.js/TypeScript pipeline to transcribe French Parliament videos (with speaker diarization), from either a .m3u8 URL or a WAV file extracted via ffmpeg.

  • Output: a JSON array in the Compte-Rendu format of Assemblée's Data:

    [
      {
        "code_grammaire": "PAROLE_GENERIQUE",
        "ordre_absolu_seance": "4",
        "orateurs": {
          "orateur": {
            "nom": "speaker A",
            "id": "",
            "qualite": ""
          }
        },
        "texte": {
          "_": "Merci monsieur le rapporteur général."
        }
      },
      {
        "code_grammaire": "PAROLE_GENERIQUE",
        "ordre_absolu_seance": "5",
        "orateurs": {
          "orateur": {
            "nom": "speaker D",
            "id": "",
            "qualite": ""
          }
        },
        "texte": {
          "_": "Merci monsieur le président, mesdames et messieurs, ..."
        }
      }
    ]
    

    Timestamps are milliseconds. speaker is a letter (A, B, C…).

  • Plug-and-play architecture via providers: currently AssemblyAI and Deepgram. You can plug additional models later without changing application code.

Table of Contents

  • Prerequisites
  • Installation
  • Usage (single video)
  • Usage (batch by reunion UIDs)
  • Code Architecture
  • Swap providers later
  • License

Prerequisites

  • Node.js ≥ 20
  • npm
  • ffmpeg available in your PATH (to extract audio from .m3u8):
    ffmpeg -version
    
  • Assemblée dataset prepared :
    • Must contain: Agenda__nettoye/ for the target legislature.

Installation

npm install
cp .env.example .env   # add your models key

Usage (single video, useful for model testing)

1) Select your provider in .env

Set TRANSCRIPTION_PROVIDER to one of: deepgram or assemblyai

2) From a .m3u8 URL (audio extraction + transcription)

# create the output folders if needed
mkdir -p ./out ./audios

npm run transcribe --   --m3u8 "https://videos-an.vodalys.com/.../master.m3u8"   --out ./audios/reunion.wav   --ss 0   --t 800   --save ./out/transcript-{model_name}.json

CLI Options

  • --ss : start offset (seconds)
  • --t : duration (seconds)
  • --out: WAV path; if omitted, we default to os.tmpdir()
  • --save: output JSON path (default: ./transcript.json)
  • --lang fr to force language (otherwise uses .env default)
  • --diarize false to disable diarization (enabled by default)

3) From an existing audio file WAV

npm run transcribe --   --file C:/path/to/reunion.wav   --save ./out/transcript-{model_name}.json

Usage (batch by reunion UIDs, useful in prod)

Process only specific Assemblée reunion UIDs using the dataset loaders. For each UID:

  • read reunion.urlVideo,
  • extract audio to ./audios/<uid>.wav (skip ffmpeg if the WAV already exists),
  • transcribe + diarize the full video with the current provider,
  • write segments to $ASSEMBLEE_DATA_DIR/Videos_<ROMAN_LEGISLATURE>_nettoye/<uid>/transcript.json
    (+ info.json with basic metadata).

CLI Options

  • --dataDir: Absolute path to Assemblée dataset (or as 1st positional). Required.
  • -l, --legislature: Legislature number (e.g., 16 or 17).
  • -s, --fromSession: Session number to start from (Senat only)
  • --uids: Comma-separated UIDs (e.g., uid1,uid2).
  • --uid: Repeatable UID flag (can be used multiple times).
  • --lang, --language: Language code (e.g., fr).
  • --diarize: Enable diarization (default: true).
  • --no-diarize: Disable diarization.
  • --keepWav: Keep extracted WAV files (default: true).
  • --no-keepWav: Delete WAV after successful transcription.
  • --audioDir: Directory for WAV files (default: ./audios).
  • --reextract: Force re-extraction even if WAV exists (default: false).
  • --ss: Start offset (seconds).
  • --t: Duration (seconds).
  • -p, --provider: Transcription provider (assemblyai | deepgram).

Examples

Transcribe all Reunions from 17th legislature (max 50):

npm run transcribe:reunions ../assemblee-data -- --legislature 17 --provider assemblyai --max 50

Force re-extraction + startTimecode to optimize wav:

npm run transcribe:reunions -- --dataDir /abs/path/assemblee-data -l 17 --uid RUANR5...   --reextract true --ss 796

# Transcribe one AN
npm run transcribe:reunions ../assemblee-data -- --legislature 17 --transcriptsDir ../assemblee-data/transcripts --audioDir ../assemblee-data/audios --provider deepgram --chambre AN --uid RUANR5L17S2025IDC453375
# Transcribe one SN
npm run transcribe:reunions ../senat-data -- --fromSession 2025 --transcriptsDir ../senat-data/transcripts --audioDir ../senat-data/audios --provider deepgram --chambre SN --uid RUSN20251016IDODDF-900

Usage - Transcription Live

Live transcription continuously transcribes an HLS .m3u8 stream, with automatic retries and a clean stop when the stream ends.
It is designed for one job per live (e.g. one Kubernetes pod per debate).

Basic CLI usage

npm run transcribe:live --   --url "https://videos-an.vodalys.com/live/.../index.m3u8"   --out ./live-transcripts/live-$(date +%s).ndjson

Live CLI options

  • --url (required): HLS .m3u8 live URL
  • --out: NDJSON output file (default: ./live-transcripts/live-<timestamp>.ndjson)
  • --lang: language code (default from .env)
  • --diarize / --no-diarize: enable/disable diarization (default: enabled)
  • --provider: transcription provider (deepgram, assemblyai, …)
  • --model: provider-specific model (optional)
  • --punctuate / --no-punctuate: enable/disable punctuation
  • --maxMinutes: stop automatically after N minutes (POC / safety)

Output format (NDJSON)

The output file is append-only, one JSON object per line:

{"type":"meta","msg":"live transcription start","url":"..."}
{"type":"segment","start_ms":123400,"end_ms":127800,"speaker":"Speaker A","text":"Hello everyone"}
{"type":"segment","start_ms":128000,"end_ms":132200,"speaker":"Speaker B","text":"Thank you"}
{"type":"meta","msg":"session ended","durSec":6400}

This format allows streaming ingestion, retries without duplicates, and easy replay.

Error handling & retries

  • On stream errors or disconnects, the script retries automatically with a short backoff.
  • If a session ends too quickly, it is retried until the minimum valid duration is reached.

Production integration

Typical flow:

  • API detects a new DebatDirect
  • A job/pod is started for this live
  • transcribe:live runs for this single stream
  • Segments are pushed incrementally to the API
  • When the live ends, the job exits and the debate is marked TERMINE

Rule: 1 live = 1 process.

Tests

Provider benchmark on fixtures

An opt-in integration test compares deepgram, assemblyai and mistral on fixture videos and prints a ranking based on:

  • average WER (word error rate, lower is better)
  • average transcription latency in milliseconds (tie-breaker)

Run it with:

RUN_PROVIDER_BENCHMARK_TEST=true \
PROVIDER_BENCHMARK_MAX_FIXTURES=2 \
npm run test -- tests/integration/providers.fixtures.benchmark.test.ts

Required API keys:

  • DEEPGRAM_API_KEY
  • ASSEMBLYAI_API_KEY
  • MISTRAL_API_KEY

The test is skipped by default unless RUN_PROVIDER_BENCHMARK_TEST=true.

Code Architecture

src/
├─ config/
│  └─ env.ts                   # .env loading & validation
├─ types/
│  └─ transcription.ts         # common types (segments in ms, speakers, metadata)
├─ providers/
│  ├─ TranscriptionProvider.ts # generic interface
│  ├─ assemblyai.ts            # AssemblyAI implementation
│  ├─ deepgram.ts              # Deepgram implementation
│  ├─ mistral.ts               # Mistral implementation
│  └─ index.ts                 # provider factory based on .env
├─ utils/
│  └─ ffmpeg.ts                # .m3u8 → WAV mono 16k extraction
│  └─ transcribe.ts            # single function used by scripts/services
├─ scripts/
│  └─ transcribe_reunions.ts
└─ └─ transcribe_live.ts

Swap providers later

Application code always calls:

const result = await transcribeVideo({
  filePath: '/tmp/reunion.wav',
  language: 'fr',
  diarize: true,
});

To add another provider:

  • Create src/providers/myProvider.ts implementing TranscriptionProvider.
  • Add a case in src/providers/index.ts and a .env value (TRANSCRIPTION_PROVIDER=myProvider).
  • Map the new API’s response to the same types (segments in ms, letter speakers).

Docker :

Build the image

docker build -t transcriber:dev .

Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)

docker run \
  --env-file .env \
  -e LEGISLATURE=17 \
  -v "/ABSOLUTE_PATH_TO_ASSEMBLEE_DATA:/app/assemblee-data" \
  transcriber:dev

License

AGPL-3.0-or-later

Keywords

Assemblée nationale

FAQs

Package last updated on 30 Apr 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts