🚀 Big News:Socket Has Acquired Secure Annex.Learn More →

Book a Demo Sign in

@tricoteuses/transcription-videos

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

@tricoteuses/transcription-videos

Permet d'obtenir la transcription des vidéos de l'assemblée/sénat en fournissant un lien vidéo m3u8 en entrée

latest

npm

Version: 1.1.1

Version published: yesterday

Maintainers: 4

Created: 2 months ago

Source

tricoteuses-transcription-videos

Node.js/TypeScript pipeline to transcribe French Parliament videos (with speaker diarization), from either a .m3u8 URL or a WAV file extracted via ffmpeg.

Output: a JSON array in the Compte-Rendu format of Assemblée's Data:

[
  {
    "code_grammaire": "PAROLE_GENERIQUE",
    "ordre_absolu_seance": "4",
    "orateurs": {
      "orateur": {
        "nom": "speaker A",
        "id": "",
        "qualite": ""
      }
    },
    "texte": {
      "_": "Merci monsieur le rapporteur général."
    }
  },
  {
    "code_grammaire": "PAROLE_GENERIQUE",
    "ordre_absolu_seance": "5",
    "orateurs": {
      "orateur": {
        "nom": "speaker D",
        "id": "",
        "qualite": ""
      }
    },
    "texte": {
      "_": "Merci monsieur le président, mesdames et messieurs, ..."
    }
  }
]

Timestamps are milliseconds. speaker is a letter (A, B, C…).

Plug-and-play architecture via providers: currently AssemblyAI and Deepgram. You can plug additional models later without changing application code.

Prerequisites

Node.js ≥ 20
npm
ffmpeg available in your PATH (to extract audio from .m3u8):
```
ffmpeg -version
```
Assemblée dataset prepared :
- Must contain: Agenda__nettoye/ for the target legislature.

Installation

npm install
cp .env.example .env   # add your models key

Usage (single video, useful for model testing)

1) Select your provider in `.env`

Set TRANSCRIPTION_PROVIDER to one of: deepgram or assemblyai

2) From a `.m3u8` URL (audio extraction + transcription)

# create the output folders if needed
mkdir -p ./out ./audios

npm run transcribe --   --m3u8 "https://videos-an.vodalys.com/.../master.m3u8"   --out ./audios/reunion.wav   --ss 0   --t 800   --save ./out/transcript-{model_name}.json

CLI Options

--ss : start offset (seconds)
--t : duration (seconds)
--out: WAV path; if omitted, we default to os.tmpdir()
--save: output JSON path (default: ./transcript.json)
--lang fr to force language (otherwise uses .env default)
--diarize false to disable diarization (enabled by default)

3) From an existing audio file WAV

npm run transcribe --   --file C:/path/to/reunion.wav   --save ./out/transcript-{model_name}.json

Usage (batch by reunion UIDs, useful in prod)

Process only specific Assemblée reunion UIDs using the dataset loaders. For each UID:

read reunion.urlVideo,
extract audio to ./audios/<uid>.wav (skip ffmpeg if the WAV already exists),
transcribe + diarize the full video with the current provider,
write segments to $ASSEMBLEE_DATA_DIR/Videos_<ROMAN_LEGISLATURE>_nettoye/<uid>/transcript.json
(+ info.json with basic metadata).

CLI Options

--dataDir: Absolute path to Assemblée dataset (or as 1st positional). Required.
-l, --legislature: Legislature number (e.g., 16 or 17).
-s, --fromSession: Session number to start from (Senat only)
--uids: Comma-separated UIDs (e.g., uid1,uid2).
--uid: Repeatable UID flag (can be used multiple times).
--lang, --language: Language code (e.g., fr).
--diarize: Enable diarization (default: true).
--no-diarize: Disable diarization.
--keepWav: Keep extracted WAV files (default: true).
--no-keepWav: Delete WAV after successful transcription.
--audioDir: Directory for WAV files (default: ./audios).
--reextract: Force re-extraction even if WAV exists (default: false).
--ss: Start offset (seconds).
--t: Duration (seconds).
-p, --provider: Transcription provider (assemblyai | deepgram).

Examples

Transcribe all Reunions from 17th legislature (max 50):

npm run transcribe:reunions ../assemblee-data -- --legislature 17 --provider assemblyai --max 50

Force re-extraction + startTimecode to optimize wav:

npm run transcribe:reunions -- --dataDir /abs/path/assemblee-data -l 17 --uid RUANR5...   --reextract true --ss 796

# Transcribe one AN
npm run transcribe:reunions ../assemblee-data -- --legislature 17 --transcriptsDir ../assemblee-data/transcripts --audioDir ../assemblee-data/audios --provider deepgram --chambre AN --uid RUANR5L17S2025IDC453375
# Transcribe one SN
npm run transcribe:reunions ../senat-data -- --fromSession 2025 --transcriptsDir ../senat-data/transcripts --audioDir ../senat-data/audios --provider deepgram --chambre SN --uid RUSN20251016IDODDF-900

Usage - Transcription Live

Live transcription continuously transcribes an HLS .m3u8 stream, with automatic retries and a clean stop when the stream ends.
It is designed for one job per live (e.g. one Kubernetes pod per debate).

Basic CLI usage

npm run transcribe:live --   --url "https://videos-an.vodalys.com/live/.../index.m3u8"   --out ./live-transcripts/live-$(date +%s).ndjson

Live CLI options

--url (required): HLS .m3u8 live URL
--out: NDJSON output file (default: ./live-transcripts/live-<timestamp>.ndjson)
--lang: language code (default from .env)
--diarize / --no-diarize: enable/disable diarization (default: enabled)
--provider: transcription provider (deepgram, assemblyai, …)
--model: provider-specific model (optional)
--punctuate / --no-punctuate: enable/disable punctuation
--maxMinutes: stop automatically after N minutes (POC / safety)

Output format (NDJSON)

The output file is append-only, one JSON object per line:

{"type":"meta","msg":"live transcription start","url":"..."}
{"type":"segment","start_ms":123400,"end_ms":127800,"speaker":"Speaker A","text":"Hello everyone"}
{"type":"segment","start_ms":128000,"end_ms":132200,"speaker":"Speaker B","text":"Thank you"}
{"type":"meta","msg":"session ended","durSec":6400}

This format allows streaming ingestion, retries without duplicates, and easy replay.

Error handling & retries

On stream errors or disconnects, the script retries automatically with a short backoff.
If a session ends too quickly, it is retried until the minimum valid duration is reached.

Production integration

Typical flow:

API detects a new DebatDirect
A job/pod is started for this live
transcribe:live runs for this single stream
Segments are pushed incrementally to the API
When the live ends, the job exits and the debate is marked TERMINE

Rule: 1 live = 1 process.

Tests

Provider benchmark on fixtures

An opt-in integration test compares deepgram, assemblyai and mistral on fixture videos and prints a ranking based on:

average WER (word error rate, lower is better)
average transcription latency in milliseconds (tie-breaker)

Run it with:

RUN_PROVIDER_BENCHMARK_TEST=true \
PROVIDER_BENCHMARK_MAX_FIXTURES=2 \
npm run test -- tests/integration/providers.fixtures.benchmark.test.ts

Required API keys:

DEEPGRAM_API_KEY
ASSEMBLYAI_API_KEY
MISTRAL_API_KEY

The test is skipped by default unless RUN_PROVIDER_BENCHMARK_TEST=true.

Code Architecture

src/
├─ config/
│  └─ env.ts                   # .env loading & validation
├─ types/
│  └─ transcription.ts         # common types (segments in ms, speakers, metadata)
├─ providers/
│  ├─ TranscriptionProvider.ts # generic interface
│  ├─ assemblyai.ts            # AssemblyAI implementation
│  ├─ deepgram.ts              # Deepgram implementation
│  ├─ mistral.ts               # Mistral implementation
│  └─ index.ts                 # provider factory based on .env
├─ utils/
│  └─ ffmpeg.ts                # .m3u8 → WAV mono 16k extraction
│  └─ transcribe.ts            # single function used by scripts/services
├─ scripts/
│  └─ transcribe_reunions.ts
└─ └─ transcribe_live.ts

Swap providers later

Application code always calls:

const result = await transcribeVideo({
  filePath: '/tmp/reunion.wav',
  language: 'fr',
  diarize: true,
});

To add another provider:

Create src/providers/myProvider.ts implementing TranscriptionProvider.
Add a case in src/providers/index.ts and a .env value (TRANSCRIPTION_PROVIDER=myProvider).
Map the new API’s response to the same types (segments in ms, letter speakers).

Docker :

Build the image

docker build -t transcriber:dev .

Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)

docker run \
  --env-file .env \
  -e LEGISLATURE=17 \
  -v "/ABSOLUTE_PATH_TO_ASSEMBLEE_DATA:/app/assemblee-data" \
  transcriber:dev

License

AGPL-3.0-or-later

Keywords

FAQs

What is @tricoteuses/transcription-videos?

Is @tricoteuses/transcription-videos well maintained?

Package last updated on 30 Apr 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@tricoteuses/transcription-videos

tricoteuses-transcription-videos

Table of Contents

Prerequisites

Installation

Usage (single video, useful for model testing)

1) Select your provider in `.env`

2) From a `.m3u8` URL (audio extraction + transcription)

CLI Options

3) From an existing audio file WAV

Usage (batch by reunion UIDs, useful in prod)

CLI Options

Examples

Usage - Transcription Live

Basic CLI usage

Live CLI options

Output format (NDJSON)

Error handling & retries

Production integration

Tests

Provider benchmark on fixtures

Code Architecture

Swap providers later

Docker :

Build the image

Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)

License

Keywords

Related posts

@tricoteuses/transcription-videos

tricoteuses-transcription-videos

Table of Contents

Prerequisites

Installation

Usage (single video, useful for model testing)

1) Select your provider in .env

2) From a .m3u8 URL (audio extraction + transcription)

CLI Options

3) From an existing audio file WAV

Usage (batch by reunion UIDs, useful in prod)

CLI Options

Examples

Usage - Transcription Live

Basic CLI usage

Live CLI options

Output format (NDJSON)

Error handling & retries

Production integration

Tests

Provider benchmark on fixtures

Code Architecture

Swap providers later

Docker :

Build the image

Run (change the ABSOLUTE_PATH_TO_ASSEMBLEE_DATA)

License

Keywords

Related posts

Mini Shai-Hulud Spreads to Packagist: Malicious Intercom PHP Package Follows npm Compromise

Intercom’s npm Package Compromised in Ongoing Mini Shai-Hulud Worm Attack

1) Select your provider in `.env`

2) From a `.m3u8` URL (audio extraction + transcription)