
Product
Introducing Webhook Events for Alert Changes
Add real-time Socket webhook events to your workflows to automatically receive software supply chain alert changes in real time.
A flexible TypeScript framework for ingesting, processing, and exporting chat/conversation data for LLM training and analysis.
ConvoKit is a TypeScript-based framework for ingesting, normalizing, filtering, sampling, formatting, and exporting chat or conversation data for LLM training and analysis. It provides:
ConvoKit saves you time building data preprocessing pipelines and lets you focus on models and prompts.
Dynamic Provider Loading
Automatically discover and load data providers from your project’s providers folder.
Normalized Conversation Format
All data converges to a ConvoKitConversation interface: metadata + message arrays.
Context Formatting
Generate a single, line-delimited training string (CKContext) with options for time‑gaps, new‑conversation markers, and importance scoring.
Turn‑List Conversion
Break context into turn lists (CKTurnListConversation) for sampling or LLM‑specific export.
Weighted Sampling
Sample by conversation importance to focus on high‑value exchanges.
Export Plugins
Export to ChatML JSONL, Gemini JSONL, or add your own converter for other LLM formats.
Filter Plugins
Drop unwanted messages (e.g. links‑only, emoji‑only, code‑only) via a simple plugin API.
Can:
Cannot:
# Install globally (recommended for CLI use)
npm install -g convokit
# Or install locally in your project
npm install convokit
import { ConvoKit, loadConfig, getConfig } from 'convokit';
import { config } from 'dotenv';
config();
await loadConfig();
async function run() {
const ck = new ConvoKit();
await ck.loadProviders(); // This will load all included providers, and the providers in the LocalProvidersDir if set (in config)
// We also automatically load all included plugins & the plugins in LocalPluginsDir if set (in config)
const convoData = await ck.processDataFromProviders();
const context = await ck.parseToContext({ targetUsers: getConfig().targetUsers });
await ck.convertToCKTurnList();
await ck.getWeightedSample(getConfig().sampleSize);
const chatml = await ck.exportToChatML(getConfig().systemPrompt);
const gemini = await ck.exportToGemini(getConfig().systemPrompt);
// Do whatever you want with the outputs
}
run();
Make sure you have set up providers and dir structure first
By default, ConvoKit reads convokit.config.json or environment variables - Here is an example config file
{
"inputDataDirName": "InputData",
"outputDataDirName": "OutputData",
"targetUsers": [
{ "providerId": "discord", "id": "YOUR_DISCORD_USER_ID" }
],
"sampleSize": 5000,
"systemPrompt": "You are a helpful assistant.",
"minImportanceChat": 120,
"minImportanceMessage": 100,
"enableDebugging": false,
"enablePerformanceStats": false,
"shouldMergeConsecutiveMessages": true,
"enableWarnings": true,
"anonymizeProviderConversationIds": false,
"localProvidersDir": "LocalProviders",
"localPluginsDir": "LocalPlugins",
}
| Key | Description |
|---|---|
| inputDataDirName | Directory containing raw chat exports (relative to project root). |
| outputDataDirName | Directory to write formatted outputs. |
| targetUsers | JSON array mapping each provider to a target user ID for context generation. |
| sampleSize | Number of conversations to sample by importance. |
| systemPrompt | System prompt used in ChatML/Gemini exports. |
| minImportanceChat (optional) | Minimum average importance score for a conversation (default: 120). |
| minImportanceMessage (optional) | Minimum importance score for a single message (default: 100). |
| enableDebugging (optional) | Enable or disable debug-level logs. |
| enablePerformanceStats (optional) | Enable or disable performance stats (timers). |
| shouldMergeConsecutiveMessages (optional) | Merge consecutive messages when converting to CKTurnList. |
| enableWarnings (optional) | Toggle the display of warning messages. |
| anonymizeProviderConversationIds (optional) | Anonymize provider conversation IDs to protect sensitive data. |
| localProviderDirectory (optional) | Directory name of where to load custom providers from. |
| localPluginDirectory (optional) | Directory name of where to load custom plugins from. (Contains a folder for each plugin type (formatters, filters, converters)! ) |
In your convokit.config.json file you set a inputDataDirName, in here you will need to have a directory for each provider. In there you should store the exported data.
Example for use with the Discord provider, with inputDataDirName set to InputData:
convokit/
├── index.ts
├── convokit.config.json
├── ... other files and folders
└── InputData
└── discord
└── Direct Messages - fishylunar [000000000000000].json
Note: the filenames of the exported data doesnt matter, but the extension does.
ConvoKit provides a command-line interface (CLI) for running the processing pipeline without writing TypeScript code. Ensure you have a valid convokit.config.json file in your project root or have set the corresponding environment variables.
Running Commands:
# If installed globally
convokit <command> [options]
# If installed locally, using npx
npx convokit <command> [options]
# Or via package.json script
# "scripts": { "ck": "convokit" }
# npm run ck -- <command> [options]
Common Options:
-p, --providers <ids>: Specify a comma-separated list of provider IDs (e.g., discord,telegram) to process data from. If omitted, ConvoKit will attempt to use data from all providers found in your inputDataDirName that are registered.-o, --output <file>: Specify an output file path to save the results of commands like context or export. If omitted, results are generated but not saved to a file (stats/logs will still be shown).Commands:
create-config (alias: cfg): Creates an example convokit.config.json file in the current directory. Run this first if you don't have a config file.
convokit create-config
providers: Lists all registered providers (built-in and local) found by ConvoKit, including their ID, name, version, and expected input directory/extension. Useful for verifying provider setup and getting IDs for the --providers option.
convokit providers
plugins: Lists all registered plugins (formatters, converters, filters), including built-in and local ones. Shows plugin ID, name, and version. Useful for finding the <converter_id> for the export command.
convokit plugins
context: Processes data from specified (or all) providers and generates the CKContext output based on your configuration (targetUsers, importance scores, etc.).
# Generate context from all providers and save to context.txt
convokit context -o context.txt
# Generate context using only 'discord' provider data and save
convokit context --providers discord -o discord_context.txt
# Generate context from all providers and save to context.json including stats
convokit context -o context.json --stats
export <converter_id>: Runs the full pipeline: loads data, processes it, generates context, converts to turn list, performs weighted sampling (using sampleSize from config), and finally exports the data using the specified <converter_id>.
# Export data using the 'chatml' converter, save to chatml_export.jsonl
convokit export chatml -o chatml_export.jsonl
# Export using 'gemini' converter from 'telegram' provider only, save output
convokit export gemini --providers telegram -o telegram_gemini.jsonl
Example Workflow:
# 1. Create a config file if you don't have one
convokit create-config
# (Edit convokit.config.json with your settings: input dir, target users, etc.)
# 2. Check which providers are available
convokit providers
# Output might show: ID: discord, ID: telegram
# 3. Check available export formats (converters)
convokit plugins
# Output might show Converters: ID: chatml, ID: gemini
# 4. Run the full export pipeline for ChatML using all providers
convokit export chatml -o training_data.jsonl
# 5. (Alternative) Generate only the CKContext for analysis
convokit context -o analysis_context.json
ConvoKit discovers providers from providers via ProviderRegistry. Each provider must:
ConvoKitProvider with Test() and Convert().ProviderInfo object.ProviderRegistry.register(id, ProviderClass, ProviderInfo).providers/discord.ts): Reads JSON exports from DiscordChatExporter.providers/telegram.ts): Reads JSON exports from the Telegram Desktop app.Contributions are more than welcome! <3
/providers/MyPlatform.ts.To make a local provider, put the
MyPlatform.tsfile in the LocalProvidersDir you specified in your config. If you are contributing and making a provider to be included in ConvoKit, put it in/providers/MyPlatform.ts
export const ProviderInfo = {
name: "MyPlatform Exporter",
description: "Imports MyPlatform chat JSON.",
version: "1.0.0",
author: "You",
InputDataInfo: { directoryName: "MyPlatform", fileExtension: ".json" }
};
export class Provider implements ConvoKitProvider {
constructor(private raw: any) {}
Test(): boolean {
// return true if raw matches your schema
}
Convert(): ConvoKitConversation {
// transform raw → ConvoKitConversation
}
}
// Self-register
ProviderRegistry.register("myplatform", Provider, ProviderInfo);
InputData/MyPlatform/*.json.ck.loadProviders() and ck.processDataFromProviders() to include your data.Plugins extend ConvoKit’s pipeline at three points:
They self‑register via PluginRegistry.registerFormatter/Converter/Filter().
id: context): Builds the CKContext string with importance and markers.id: chatml): Exports LLM chatml JSONL.id: gemini): Exports Gemini‑style JSONL.id: link-only): Excludes messages that are URLs only.Formatters
export class MyFormatter implements FormatterPluginClass {
PluginInfo = { id: "myfmt", name: "...", type: "formatter", version: "1.0.0" };
apply(data, options) { /* return CKContextResult */ }
}
PluginRegistry.registerFormatter(MyFormatter);
Converters
export class MyConverter implements ConverterPluginClass {
PluginInfo = { id: "myconv", name: "...", type: "converter", version: "1.0.0" };
async apply(convs, prompt) { /* return string[] */ }
}
PluginRegistry.registerConverter(MyConverter);
Filters
export class MyFilter implements FilterPluginClass {
PluginInfo = { id: "myfilter", name: "...", type: "filter", version: "1.0.0" };
filterType: 'MUST' | 'MUST_NOT' = 'MUST_NOT';
apply(content) { /* return boolean */ }
}
PluginRegistry.registerFilter(MyFilter);
Contributions are very welcome!
This project is licensed under the MIT License.
Feel free to use, modify, and distribute as you see fit!
FAQs
A flexible TypeScript framework for ingesting, processing, and exporting chat/conversation data for LLM training and analysis.
We found that convokit demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Add real-time Socket webhook events to your workflows to automatically receive software supply chain alert changes in real time.

Security News
ENISA has become a CVE Program Root, giving the EU a central authority for coordinating vulnerability reporting, disclosure, and cross-border response.

Product
Socket now scans OpenVSX extensions, giving teams early detection of risky behaviors, hidden capabilities, and supply chain threats in developer tools.