
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
create-llm
Advanced tools
The fastest way to start training your own Language Model. Create production-ready LLM training projects in seconds.
CLI tool for scaffolding LLM training projects
Create production-ready LLM training projects in seconds. Similar to create-next-app but for training custom language models.
npm Package • Documentation • Report Bug • Request Feature
npx @theanikrtgiri/create-llm my-awesome-llm
cd my-awesome-llm
pip install -r requirements.txt
python training/train.py
Training a language model from scratch requires:
create-llm provides all of this in one command.
Choose from 4 templates optimized for different use cases:
Everything you need out of the box:
Intelligent configuration that:
Optional integrations:
# Using npx (recommended - no installation needed)
npx @theanikrtgiri/create-llm my-llm
# Or install globally
npm install -g @theanikrtgiri/create-llm
create-llm my-llm
npx @theanikrtgiri/create-llm
You'll be prompted for:
# Specify everything upfront
npx create-llm my-llm --template tiny --tokenizer bpe --skip-install
For learning and quick experiments
Parameters: ~1M
Hardware: Any CPU (2GB RAM)
Time: 1-2 minutes
Data: 100+ examples
Use: Learning, testing, demos
When to use:
For prototyping and small projects
Parameters: ~6M
Hardware: CPU or basic GPU (4GB RAM)
Time: 5-15 minutes
Data: 1,000+ examples
Use: Prototypes, small projects
When to use:
For production applications
Parameters: ~100M
Hardware: RTX 3060+ (12GB VRAM)
Time: 1-3 hours
Data: 10,000+ examples
Use: Production, real apps
When to use:
For research and high-quality models
Parameters: ~1B
Hardware: A100 or multi-GPU
Time: 1-3 days
Data: 100,000+ examples
Use: Research, high-quality
When to use:
npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe
cd my-llm
pip install -r requirements.txt
Place your text files in data/raw/:
# Example: Download Shakespeare
curl https://www.gutenberg.org/files/100/100-0.txt > data/raw/shakespeare.txt
# Or add your own files
cp /path/to/your/data.txt data/raw/
Tip: Start with at least 1MB of text for meaningful results
python tokenizer/train.py --data data/raw/
This creates a vocabulary from your data.
python data/prepare.py
This tokenizes and prepares your data for training.
# Basic training
python training/train.py
# With live dashboard
python training/train.py --dashboard
# Then open http://localhost:5000
# Resume from checkpoint
python training/train.py --resume checkpoints/checkpoint-1000.pt
python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt
python evaluation/generate.py \
--checkpoint checkpoints/checkpoint-best.pt \
--prompt "Once upon a time" \
--temperature 0.8
python chat.py --checkpoint checkpoints/checkpoint-best.pt
# To Hugging Face
python deploy.py --to huggingface --repo-id username/my-model
# To Replicate
python deploy.py --to replicate --model-name my-model
my-llm/
├── data/
│ ├── raw/ # Your training data goes here
│ ├── processed/ # Tokenized data (auto-generated)
│ ├── dataset.py # PyTorch dataset classes
│ └── prepare.py # Data preprocessing script
│
├── models/
│ ├── architectures/ # Model implementations
│ │ ├── gpt.py # GPT architecture
│ │ ├── nano.py # 1M parameter model
│ │ ├── tiny.py # 6M parameter model
│ │ ├── small.py # 100M parameter model
│ │ └── base.py # 1B parameter model
│ ├── __init__.py
│ └── config.py # Configuration loader
│
├── tokenizer/
│ ├── train.py # Tokenizer training script
│ └── tokenizer.json # Trained tokenizer (auto-generated)
│
├── training/
│ ├── train.py # Main training script
│ ├── trainer.py # Trainer class
│ ├── callbacks/ # Training callbacks
│ └── dashboard/ # Live training dashboard
│
├── evaluation/
│ ├── evaluate.py # Model evaluation
│ └── generate.py # Text generation
│
├── plugins/ # Optional integrations
├── checkpoints/ # Saved models (auto-generated)
├── logs/ # Training logs (auto-generated)
│
├── llm.config.js # Main configuration file
├── requirements.txt # Python dependencies
├── chat.py # Interactive chat interface
├── deploy.py # Deployment script
└── README.md # Project documentation
Everything is controlled via llm.config.js:
module.exports = {
// Model architecture
model: {
type: 'gpt',
size: 'tiny',
vocab_size: 10000, // Auto-detected from tokenizer
max_length: 512,
layers: 4,
heads: 4,
dim: 256,
dropout: 0.2,
},
// Training settings
training: {
batch_size: 16,
learning_rate: 0.0006,
warmup_steps: 500,
max_steps: 10000,
eval_interval: 500,
save_interval: 2000,
},
// Plugins
plugins: [
// 'wandb',
// 'huggingface',
],
};
npx @theanikrtgiri/create-llm [project-name] [options]
| Option | Description | Default |
|---|---|---|
--template <name> | Template to use (nano, tiny, small, base, custom) | Interactive |
--tokenizer <type> | Tokenizer type (bpe, wordpiece, unigram) | Interactive |
--skip-install | Skip npm/pip installation | false |
-y, --yes | Skip all prompts, use defaults | false |
-h, --help | Show help | - |
-v, --version | Show version | - |
# Interactive mode (recommended for first time)
npx @theanikrtgiri/create-llm
# Quick start with defaults
npx @theanikrtgiri/create-llm my-project
# Specify everything
npx @theanikrtgiri/create-llm my-project --template nano --tokenizer bpe --skip-install
# Skip prompts
npx @theanikrtgiri/create-llm my-project -y
Minimum Data Requirements:
Data Quality:
Avoid Overfitting:
Optimize Training:
mixed_precision: true)gradient_accumulation_steps if OOM"Vocab size mismatch detected"
"Position embedding index error" or sequences too long
max_length in config if you need longer sequences."Model may be too large for dataset"
"CUDA out of memory"
batch_size in llm.config.jsmixed_precision: truegradient_accumulation_steps"Training loss not decreasing"
See DEVELOPMENT.md for development setup and guidelines.
We welcome contributions! See CONTRIBUTING.md for guidelines.
| Area | Description | Difficulty |
|---|---|---|
| Bug Fixes | Fix issues and improve stability | Easy |
| Documentation | Improve guides and examples | Easy |
| New Templates | Add BERT, T5, custom architectures | Medium |
| Plugins | Integrate new services | Medium |
| Testing | Increase test coverage | Medium |
| i18n | Internationalization support | Hard |
MIT © Aniket Giri
See LICENSE for more information.
Built with:
If you find this project useful, please consider giving it a star!
FAQs
The fastest way to start training your own Language Model. Create production-ready LLM training projects in seconds.
We found that create-llm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.