🚀 DAY 5 OF LAUNCH WEEK:Introducing Webhook Events for Alert Changes.Learn more →
Socket
Book a DemoInstallSign in
Socket

convokit

Package Overview
Dependencies
Maintainers
1
Versions
3
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

convokit

A flexible TypeScript framework for ingesting, processing, and exporting chat/conversation data for LLM training and analysis.

latest
Source
npmnpm
Version
1.0.2
Version published
Maintainers
1
Created
Source

ConvoKit: Flexible Conversation Processing & Export Toolkit

ConvoKit is a TypeScript-based framework for ingesting, normalizing, filtering, sampling, formatting, and exporting chat or conversation data for LLM training and analysis. It provides:

  • A provider registry to plug in new data sources (Discord, Slack, custom exports, etc.).
  • A plugin registry for formatters, converters, and filters to transform and export data to ChatML, Gemini JSONL, custom context formats, and more.
  • A fully configurable, extensible pipeline: ingest → normalize → filter → importance‑score → sample → format → export.

ConvoKit saves you time building data preprocessing pipelines and lets you focus on models and prompts.

Table of Contents

  • Key Features
  • What It Can & Cannot Do
  • Who Should Use It
  • Installation
  • Quick Start
  • Configuration
  • CLI Usage
  • Provider Registry
    • Built‑in Providers
    • Writing Your Own Provider
  • Plugin Registry
    • Formatters
    • Converters
    • Filters
    • Writing Your Own Plugin
  • Contributing
  • License

Key Features

  • Dynamic Provider Loading
    Automatically discover and load data providers from your project’s providers folder.

  • Normalized Conversation Format
    All data converges to a ConvoKitConversation interface: metadata + message arrays.

  • Context Formatting
    Generate a single, line-delimited training string (CKContext) with options for time‑gaps, new‑conversation markers, and importance scoring.

  • Turn‑List Conversion
    Break context into turn lists (CKTurnListConversation) for sampling or LLM‑specific export.

  • Weighted Sampling
    Sample by conversation importance to focus on high‑value exchanges.

  • Export Plugins
    Export to ChatML JSONL, Gemini JSONL, or add your own converter for other LLM formats.

  • Filter Plugins
    Drop unwanted messages (e.g. links‑only, emoji‑only, code‑only) via a simple plugin API.

What It Can & Cannot Do

Can:

  • Ingest JSON exports from Discord (via DiscordChatExporter), or any custom source you add via the Provider Registry.
  • Normalize and filter conversations by message content, length, or custom rules.
  • Score message & conversation importance automatically based on time, length, and frequency.
  • Sample highly‑important conversations for training budgets.
  • Export to popular LLM chat formats (ChatML, Gemini), or easily extendable.

Cannot:

  • Perform LLM inference or model training directly. - Yet ;)
  • Resolve references across conversations (thread linking across channels).
  • Guarantee perfect import schema for every data source—you may need to write a provider to handle custom formats.
  • Handle binary or non‑JSON data without extending a provider to preprocess it.

Who Should Use It

  • NLP / ML Engineers preparing chat‑based LLM fine‑tuning or analysis datasets.
  • Bot / Chat Service Developers needing to transform raw chat logs into structured training data.
  • Researchers studying conversation dynamics or designing importance‑based sampling strategies.
  • Community Contributors eager to add support for new platforms or export formats.

Possibly upcomming features

  • Personality Generate a deep and comprehensive personality prompt based off your output ck_context
  • Fine-tuning Fine-tune models with exported training data (Currently mainly looking at Gemini) (Contributions welcome!)
  • Model Testing Test your fine-tuned model via the terminal (Currently mainly looking at Gemini) (Contributions welcome!)
  • Unit Tests Adding unit tests would help keep everything maintainable and stable (or so i've heard)

Installation

# Install globally (recommended for CLI use)
npm install -g convokit

# Or install locally in your project
npm install convokit

Quick Start (Using the Library)

import { ConvoKit, loadConfig, getConfig } from 'convokit';
import { config } from 'dotenv';

config();
await loadConfig();

async function run() {
  const ck = new ConvoKit();
  await ck.loadProviders(); // This will load all included providers, and the providers in the LocalProvidersDir if set (in config)
  // We also automatically load all included plugins & the plugins in LocalPluginsDir if set (in config)
  const convoData = await ck.processDataFromProviders();

  const context = await ck.parseToContext({ targetUsers: getConfig().targetUsers });
  await ck.convertToCKTurnList();
  await ck.getWeightedSample(getConfig().sampleSize);
  const chatml = await ck.exportToChatML(getConfig().systemPrompt);
  const gemini = await ck.exportToGemini(getConfig().systemPrompt);
  // Do whatever you want with the outputs
}
run();

Make sure you have set up providers and dir structure first

Configuration

By default, ConvoKit reads convokit.config.json or environment variables - Here is an example config file

{
  "inputDataDirName": "InputData",
  "outputDataDirName": "OutputData",
  "targetUsers": [
    { "providerId": "discord", "id": "YOUR_DISCORD_USER_ID" }
  ],
  "sampleSize": 5000,
  "systemPrompt": "You are a helpful assistant.",
  "minImportanceChat": 120,
  "minImportanceMessage": 100,
  "enableDebugging": false,
  "enablePerformanceStats": false,
  "shouldMergeConsecutiveMessages": true,
  "enableWarnings": true,
  "anonymizeProviderConversationIds": false,
  "localProvidersDir": "LocalProviders",
  "localPluginsDir": "LocalPlugins",
}
KeyDescription
inputDataDirNameDirectory containing raw chat exports (relative to project root).
outputDataDirNameDirectory to write formatted outputs.
targetUsersJSON array mapping each provider to a target user ID for context generation.
sampleSizeNumber of conversations to sample by importance.
systemPromptSystem prompt used in ChatML/Gemini exports.
minImportanceChat (optional)Minimum average importance score for a conversation (default: 120).
minImportanceMessage (optional)Minimum importance score for a single message (default: 100).
enableDebugging (optional)Enable or disable debug-level logs.
enablePerformanceStats (optional)Enable or disable performance stats (timers).
shouldMergeConsecutiveMessages (optional)Merge consecutive messages when converting to CKTurnList.
enableWarnings (optional)Toggle the display of warning messages.
anonymizeProviderConversationIds (optional)Anonymize provider conversation IDs to protect sensitive data.
localProviderDirectory (optional)Directory name of where to load custom providers from.
localPluginDirectory (optional)Directory name of where to load custom plugins from. (Contains a folder for each plugin type (formatters, filters, converters)! )

Directory Structure

In your convokit.config.json file you set a inputDataDirName, in here you will need to have a directory for each provider. In there you should store the exported data.

Example for use with the Discord provider, with inputDataDirName set to InputData:

convokit/
├── index.ts
├── convokit.config.json
├── ... other files and folders
└── InputData
    └── discord
        └── Direct Messages - fishylunar [000000000000000].json

Note: the filenames of the exported data doesnt matter, but the extension does.

CLI Usage

ConvoKit provides a command-line interface (CLI) for running the processing pipeline without writing TypeScript code. Ensure you have a valid convokit.config.json file in your project root or have set the corresponding environment variables.

Running Commands:

# If installed globally
convokit <command> [options]

# If installed locally, using npx
npx convokit <command> [options]

# Or via package.json script
# "scripts": { "ck": "convokit" }
# npm run ck -- <command> [options]

Common Options:

  • -p, --providers <ids>: Specify a comma-separated list of provider IDs (e.g., discord,telegram) to process data from. If omitted, ConvoKit will attempt to use data from all providers found in your inputDataDirName that are registered.
  • -o, --output <file>: Specify an output file path to save the results of commands like context or export. If omitted, results are generated but not saved to a file (stats/logs will still be shown).

Commands:

  • create-config (alias: cfg): Creates an example convokit.config.json file in the current directory. Run this first if you don't have a config file.
    convokit create-config
    
  • providers: Lists all registered providers (built-in and local) found by ConvoKit, including their ID, name, version, and expected input directory/extension. Useful for verifying provider setup and getting IDs for the --providers option.
    convokit providers
    
  • plugins: Lists all registered plugins (formatters, converters, filters), including built-in and local ones. Shows plugin ID, name, and version. Useful for finding the <converter_id> for the export command.
    convokit plugins
    
  • context: Processes data from specified (or all) providers and generates the CKContext output based on your configuration (targetUsers, importance scores, etc.).
    # Generate context from all providers and save to context.txt
    convokit context -o context.txt
    
    # Generate context using only 'discord' provider data and save
    convokit context --providers discord -o discord_context.txt
    
    # Generate context from all providers and save to context.json including stats
    convokit context -o context.json --stats
    
  • export <converter_id>: Runs the full pipeline: loads data, processes it, generates context, converts to turn list, performs weighted sampling (using sampleSize from config), and finally exports the data using the specified <converter_id>.
    # Export data using the 'chatml' converter, save to chatml_export.jsonl
    convokit export chatml -o chatml_export.jsonl
    
    # Export using 'gemini' converter from 'telegram' provider only, save output
    convokit export gemini --providers telegram -o telegram_gemini.jsonl
    

Example Workflow:

# 1. Create a config file if you don't have one
convokit create-config
# (Edit convokit.config.json with your settings: input dir, target users, etc.)

# 2. Check which providers are available
convokit providers
# Output might show: ID: discord, ID: telegram

# 3. Check available export formats (converters)
convokit plugins
# Output might show Converters: ID: chatml, ID: gemini

# 4. Run the full export pipeline for ChatML using all providers
convokit export chatml -o training_data.jsonl

# 5. (Alternative) Generate only the CKContext for analysis
convokit context -o analysis_context.json

Provider Registry

ConvoKit discovers providers from providers via ProviderRegistry. Each provider must:

  • Implement ConvoKitProvider with Test() and Convert().
  • Export a static ProviderInfo object.
  • Register itself via ProviderRegistry.register(id, ProviderClass, ProviderInfo).

Built‑in Providers

  • Discord (providers/discord.ts): Reads JSON exports from DiscordChatExporter.
  • Telegram (providers/telegram.ts): Reads JSON exports from the Telegram Desktop app.

Contributions are more than welcome! <3

Writing Your Own Provider

  • Create /providers/MyPlatform.ts.

To make a local provider, put the MyPlatform.ts file in the LocalProvidersDir you specified in your config. If you are contributing and making a provider to be included in ConvoKit, put it in /providers/MyPlatform.ts

  • Define your data schema, compatibility check, and conversion:
export const ProviderInfo = {
  name: "MyPlatform Exporter",
  description: "Imports MyPlatform chat JSON.",
  version: "1.0.0",
  author: "You",
  InputDataInfo: { directoryName: "MyPlatform", fileExtension: ".json" }
};

export class Provider implements ConvoKitProvider {
  constructor(private raw: any) {}
  Test(): boolean {
    // return true if raw matches your schema
  }
  Convert(): ConvoKitConversation {
    // transform raw → ConvoKitConversation
  }
}

// Self-register
ProviderRegistry.register("myplatform", Provider, ProviderInfo);
  • Place your exports in InputData/MyPlatform/*.json.
  • Run ck.loadProviders() and ck.processDataFromProviders() to include your data.

Plugin Registry

Plugins extend ConvoKit’s pipeline at three points:

  • Formatters (formatters)
  • Converters (converters)
  • Filters (filters)

They self‑register via PluginRegistry.registerFormatter/Converter/Filter().

Formatters

  • Context Formatter (id: context): Builds the CKContext string with importance and markers.

Converters

  • ChatML Converter (id: chatml): Exports LLM chatml JSONL.
  • Gemini Converter (id: gemini): Exports Gemini‑style JSONL.

Filters

  • LinkOnlyFilter (id: link-only): Excludes messages that are URLs only.

Writing Your Own Plugin

  • Formatters

    export class MyFormatter implements FormatterPluginClass {
      PluginInfo = { id: "myfmt", name: "...", type: "formatter", version: "1.0.0" };
      apply(data, options) { /* return CKContextResult */ }
    }
    PluginRegistry.registerFormatter(MyFormatter);
    
  • Converters

    export class MyConverter implements ConverterPluginClass {
      PluginInfo = { id: "myconv", name: "...", type: "converter", version: "1.0.0" };
      async apply(convs, prompt) { /* return string[] */ }
    }
    PluginRegistry.registerConverter(MyConverter);
    
  • Filters

    export class MyFilter implements FilterPluginClass {
      PluginInfo = { id: "myfilter", name: "...", type: "filter", version: "1.0.0" };
      filterType: 'MUST' | 'MUST_NOT' = 'MUST_NOT';
      apply(content) { /* return boolean */ }
    }
    PluginRegistry.registerFilter(MyFilter);
    

Contributing

Contributions are very welcome!

  • Suggest a feature via GitHub Issues.
  • Report bugs or raise PRs to fix them.
  • Add new providers (Slack, Teams, custom exports).
  • Write plugins for new formats or filters.

License

This project is licensed under the MIT License.
Feel free to use, modify, and distribute as you see fit!

Keywords

llm

FAQs

Package last updated on 20 Apr 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts