🚀 DAY 5 OF LAUNCH WEEK:Introducing Webhook Events for Alert Changes.Learn more →

Book a Demo Install Sign in

convokit

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

convokit

A flexible TypeScript framework for ingesting, processing, and exporting chat/conversation data for LLM training and analysis.

latest

Source

npm

Version: 1.0.2

Version published: 7 months ago

Maintainers: 1

Created: 7 months ago

Source

ConvoKit: Flexible Conversation Processing & Export Toolkit

ConvoKit is a TypeScript-based framework for ingesting, normalizing, filtering, sampling, formatting, and exporting chat or conversation data for LLM training and analysis. It provides:

A provider registry to plug in new data sources (Discord, Slack, custom exports, etc.).
A plugin registry for formatters, converters, and filters to transform and export data to ChatML, Gemini JSONL, custom context formats, and more.
A fully configurable, extensible pipeline: ingest → normalize → filter → importance‑score → sample → format → export.

ConvoKit saves you time building data preprocessing pipelines and lets you focus on models and prompts.

Key Features
What It Can & Cannot Do
Who Should Use It
Installation
Quick Start
Configuration
CLI Usage
Provider Registry
- Built‑in Providers
- Writing Your Own Provider
Plugin Registry
- Formatters
- Converters
- Filters
- Writing Your Own Plugin
Contributing
License

Key Features

Dynamic Provider Loading
Automatically discover and load data providers from your project’s providers folder.
Normalized Conversation Format
All data converges to a ConvoKitConversation interface: metadata + message arrays.
Context Formatting
Generate a single, line-delimited training string (CKContext) with options for time‑gaps, new‑conversation markers, and importance scoring.
Turn‑List Conversion
Break context into turn lists (CKTurnListConversation) for sampling or LLM‑specific export.
Weighted Sampling
Sample by conversation importance to focus on high‑value exchanges.
Export Plugins
Export to ChatML JSONL, Gemini JSONL, or add your own converter for other LLM formats.
Filter Plugins
Drop unwanted messages (e.g. links‑only, emoji‑only, code‑only) via a simple plugin API.

What It Can & Cannot Do

Can:

Ingest JSON exports from Discord (via DiscordChatExporter), or any custom source you add via the Provider Registry.
Normalize and filter conversations by message content, length, or custom rules.
Score message & conversation importance automatically based on time, length, and frequency.
Sample highly‑important conversations for training budgets.
Export to popular LLM chat formats (ChatML, Gemini), or easily extendable.

Cannot:

Perform LLM inference or model training directly. - Yet ;)
Resolve references across conversations (thread linking across channels).
Guarantee perfect import schema for every data source—you may need to write a provider to handle custom formats.
Handle binary or non‑JSON data without extending a provider to preprocess it.

Who Should Use It

NLP / ML Engineers preparing chat‑based LLM fine‑tuning or analysis datasets.
Bot / Chat Service Developers needing to transform raw chat logs into structured training data.
Researchers studying conversation dynamics or designing importance‑based sampling strategies.
Community Contributors eager to add support for new platforms or export formats.

Possibly upcomming features

Personality Generate a deep and comprehensive personality prompt based off your output ck_context
Fine-tuning Fine-tune models with exported training data (Currently mainly looking at Gemini) (Contributions welcome!)
Model Testing Test your fine-tuned model via the terminal (Currently mainly looking at Gemini) (Contributions welcome!)
Unit Tests Adding unit tests would help keep everything maintainable and stable (or so i've heard)

Installation

# Install globally (recommended for CLI use)
npm install -g convokit

# Or install locally in your project
npm install convokit

Quick Start (Using the Library)

import { ConvoKit, loadConfig, getConfig } from 'convokit';
import { config } from 'dotenv';

config();
await loadConfig();

async function run() {
  const ck = new ConvoKit();
  await ck.loadProviders(); // This will load all included providers, and the providers in the LocalProvidersDir if set (in config)
  // We also automatically load all included plugins & the plugins in LocalPluginsDir if set (in config)
  const convoData = await ck.processDataFromProviders();

  const context = await ck.parseToContext({ targetUsers: getConfig().targetUsers });
  await ck.convertToCKTurnList();
  await ck.getWeightedSample(getConfig().sampleSize);
  const chatml = await ck.exportToChatML(getConfig().systemPrompt);
  const gemini = await ck.exportToGemini(getConfig().systemPrompt);
  // Do whatever you want with the outputs
}
run();

Make sure you have set up providers and dir structure first

Configuration

By default, ConvoKit reads convokit.config.json or environment variables - Here is an example config file

{
  "inputDataDirName": "InputData",
  "outputDataDirName": "OutputData",
  "targetUsers": [
    { "providerId": "discord", "id": "YOUR_DISCORD_USER_ID" }
  ],
  "sampleSize": 5000,
  "systemPrompt": "You are a helpful assistant.",
  "minImportanceChat": 120,
  "minImportanceMessage": 100,
  "enableDebugging": false,
  "enablePerformanceStats": false,
  "shouldMergeConsecutiveMessages": true,
  "enableWarnings": true,
  "anonymizeProviderConversationIds": false,
  "localProvidersDir": "LocalProviders",
  "localPluginsDir": "LocalPlugins",
}

Key	Description
inputDataDirName	Directory containing raw chat exports (relative to project root).
outputDataDirName	Directory to write formatted outputs.
targetUsers	JSON array mapping each provider to a target user ID for context generation.
sampleSize	Number of conversations to sample by importance.
systemPrompt	System prompt used in ChatML/Gemini exports.
minImportanceChat (optional)	Minimum average importance score for a conversation (default: 120).
minImportanceMessage (optional)	Minimum importance score for a single message (default: 100).
enableDebugging (optional)	Enable or disable debug-level logs.
enablePerformanceStats (optional)	Enable or disable performance stats (timers).
shouldMergeConsecutiveMessages (optional)	Merge consecutive messages when converting to CKTurnList.
enableWarnings (optional)	Toggle the display of warning messages.
anonymizeProviderConversationIds (optional)	Anonymize provider conversation IDs to protect sensitive data.
localProviderDirectory (optional)	Directory name of where to load custom providers from.
localPluginDirectory (optional)	Directory name of where to load custom plugins from. (Contains a folder for each plugin type (formatters, filters, converters)! )

Directory Structure

In your convokit.config.json file you set a inputDataDirName, in here you will need to have a directory for each provider. In there you should store the exported data.

Example for use with the Discord provider, with inputDataDirName set to InputData:

convokit/
├── index.ts
├── convokit.config.json
├── ... other files and folders
└── InputData
    └── discord
        └── Direct Messages - fishylunar [000000000000000].json

Note: the filenames of the exported data doesnt matter, but the extension does.

CLI Usage

ConvoKit provides a command-line interface (CLI) for running the processing pipeline without writing TypeScript code. Ensure you have a valid convokit.config.json file in your project root or have set the corresponding environment variables.

Running Commands:

# If installed globally
convokit <command> [options]

# If installed locally, using npx
npx convokit <command> [options]

# Or via package.json script
# "scripts": { "ck": "convokit" }
# npm run ck -- <command> [options]

Common Options:

-p, --providers <ids>: Specify a comma-separated list of provider IDs (e.g., discord,telegram) to process data from. If omitted, ConvoKit will attempt to use data from all providers found in your inputDataDirName that are registered.
-o, --output <file>: Specify an output file path to save the results of commands like context or export. If omitted, results are generated but not saved to a file (stats/logs will still be shown).

Commands:

create-config (alias: cfg): Creates an example convokit.config.json file in the current directory. Run this first if you don't have a config file.
```
convokit create-config
```
providers: Lists all registered providers (built-in and local) found by ConvoKit, including their ID, name, version, and expected input directory/extension. Useful for verifying provider setup and getting IDs for the --providers option.
```
convokit providers
```
plugins: Lists all registered plugins (formatters, converters, filters), including built-in and local ones. Shows plugin ID, name, and version. Useful for finding the <converter_id> for the export command.
```
convokit plugins
```

context: Processes data from specified (or all) providers and generates the CKContext output based on your configuration (targetUsers, importance scores, etc.).

# Generate context from all providers and save to context.txt
convokit context -o context.txt

# Generate context using only 'discord' provider data and save
convokit context --providers discord -o discord_context.txt

# Generate context from all providers and save to context.json including stats
convokit context -o context.json --stats

export <converter_id>: Runs the full pipeline: loads data, processes it, generates context, converts to turn list, performs weighted sampling (using sampleSize from config), and finally exports the data using the specified <converter_id>.

# Export data using the 'chatml' converter, save to chatml_export.jsonl
convokit export chatml -o chatml_export.jsonl

# Export using 'gemini' converter from 'telegram' provider only, save output
convokit export gemini --providers telegram -o telegram_gemini.jsonl

Example Workflow:

# 1. Create a config file if you don't have one
convokit create-config
# (Edit convokit.config.json with your settings: input dir, target users, etc.)

# 2. Check which providers are available
convokit providers
# Output might show: ID: discord, ID: telegram

# 3. Check available export formats (converters)
convokit plugins
# Output might show Converters: ID: chatml, ID: gemini

# 4. Run the full export pipeline for ChatML using all providers
convokit export chatml -o training_data.jsonl

# 5. (Alternative) Generate only the CKContext for analysis
convokit context -o analysis_context.json

Provider Registry

ConvoKit discovers providers from providers via ProviderRegistry. Each provider must:

Implement ConvoKitProvider with Test() and Convert().
Export a static ProviderInfo object.
Register itself via ProviderRegistry.register(id, ProviderClass, ProviderInfo).

Built‑in Providers

Discord (providers/discord.ts): Reads JSON exports from DiscordChatExporter.
Telegram (providers/telegram.ts): Reads JSON exports from the Telegram Desktop app.

Contributions are more than welcome! <3

Writing Your Own Provider

Create /providers/MyPlatform.ts.

To make a local provider, put the MyPlatform.ts file in the LocalProvidersDir you specified in your config. If you are contributing and making a provider to be included in ConvoKit, put it in /providers/MyPlatform.ts

Define your data schema, compatibility check, and conversion:

export const ProviderInfo = {
  name: "MyPlatform Exporter",
  description: "Imports MyPlatform chat JSON.",
  version: "1.0.0",
  author: "You",
  InputDataInfo: { directoryName: "MyPlatform", fileExtension: ".json" }
};

export class Provider implements ConvoKitProvider {
  constructor(private raw: any) {}
  Test(): boolean {
    // return true if raw matches your schema
  }
  Convert(): ConvoKitConversation {
    // transform raw → ConvoKitConversation
  }
}

// Self-register
ProviderRegistry.register("myplatform", Provider, ProviderInfo);

Place your exports in InputData/MyPlatform/*.json.
Run ck.loadProviders() and ck.processDataFromProviders() to include your data.

Plugin Registry

Plugins extend ConvoKit’s pipeline at three points:

Formatters (formatters)
Converters (converters)
Filters (filters)

They self‑register via PluginRegistry.registerFormatter/Converter/Filter().

Formatters

Context Formatter (id: context): Builds the CKContext string with importance and markers.

Converters

ChatML Converter (id: chatml): Exports LLM chatml JSONL.
Gemini Converter (id: gemini): Exports Gemini‑style JSONL.

Filters

LinkOnlyFilter (id: link-only): Excludes messages that are URLs only.

Writing Your Own Plugin

Formatters

export class MyFormatter implements FormatterPluginClass {
  PluginInfo = { id: "myfmt", name: "...", type: "formatter", version: "1.0.0" };
  apply(data, options) { /* return CKContextResult */ }
}
PluginRegistry.registerFormatter(MyFormatter);

Converters

export class MyConverter implements ConverterPluginClass {
  PluginInfo = { id: "myconv", name: "...", type: "converter", version: "1.0.0" };
  async apply(convs, prompt) { /* return string[] */ }
}
PluginRegistry.registerConverter(MyConverter);

Filters

export class MyFilter implements FilterPluginClass {
  PluginInfo = { id: "myfilter", name: "...", type: "filter", version: "1.0.0" };
  filterType: 'MUST' | 'MUST_NOT' = 'MUST_NOT';
  apply(content) { /* return boolean */ }
}
PluginRegistry.registerFilter(MyFilter);

Contributing

Contributions are very welcome!

Suggest a feature via GitHub Issues.
Report bugs or raise PRs to fix them.
Add new providers (Slack, Teams, custom exports).
Write plugins for new formats or filters.

License

This project is licensed under the MIT License.
Feel free to use, modify, and distribute as you see fit!

Keywords

FAQs

What is convokit?

Is convokit well maintained?

Package last updated on 20 Apr 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

convokit

ConvoKit: Flexible Conversation Processing & Export Toolkit

Table of Contents

Key Features

What It Can & Cannot Do

Who Should Use It

Possibly upcomming features

Installation

Quick Start (Using the Library)

Configuration

Directory Structure

CLI Usage

Provider Registry

Built‑in Providers

Writing Your Own Provider

Plugin Registry

Formatters

Converters

Filters

Writing Your Own Plugin

Contributing

License

Keywords

Related posts

convokit

ConvoKit: Flexible Conversation Processing & Export Toolkit

Table of Contents

Key Features

What It Can & Cannot Do

Who Should Use It

Possibly upcomming features

Installation

Quick Start (Using the Library)

Configuration

Directory Structure

CLI Usage

Provider Registry

Built‑in Providers

Writing Your Own Provider

Plugin Registry

Formatters

Converters

Filters

Writing Your Own Plugin

Contributing

License

Keywords

Related posts

ENISA Becomes a CVE Root, Expanding Its Role in Europe’s Vulnerability Ecosystem

Introducing Socket Scanning for OpenVSX Extensions