Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement
Sign In

@synsci/cli-linux-x64

Package Overview
Dependencies
Maintainers
1
Versions
80
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@synsci/cli-linux-x64

npmnpm
Version
1.1.57
Version published
Maintainers
1
Created
Source

GRPO/RL Training Skill

Expert-level guidance for Group Relative Policy Optimization with TRL

📁 Skill Structure

grpo-rl-training/
├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
├── README.md                             # This file
├── templates/
│   └── basic_grpo_training.py            # Production-ready training template
└── examples/
    └── reward_functions_library.py       # 20+ reward function examples

🚀 Quick Start

  • Read SKILL.md - Comprehensive guide with all concepts and patterns
  • Copy templates/basic_grpo_training.py - Start with working code
  • Browse examples/reward_functions_library.py - Pick reward functions for your task
  • Modify for your use case - Adapt dataset, rewards, and config

💡 What's Inside

SKILL.md (Main Documentation)

  • Core GRPO concepts and algorithm fundamentals
  • Complete implementation workflow (dataset → rewards → training → deployment)
  • 10+ reward function examples with code
  • Hyperparameter tuning guide
  • Training insights (loss behavior, metrics, debugging)
  • Troubleshooting guide
  • Production best practices

Templates

  • basic_grpo_training.py: Minimal, production-ready training script
    • Uses Qwen 2.5 1.5B Instruct
    • 3 reward functions (format + correctness)
    • LoRA for efficient training
    • Fully documented and ready to run

Examples

  • reward_functions_library.py: 20+ battle-tested reward functions
    • Correctness rewards (exact match, fuzzy match, numeric, code execution)
    • Format rewards (XML, JSON, strict/soft)
    • Length rewards (ideal length, min/max)
    • Style rewards (reasoning quality, citations, repetition penalty)
    • Combined rewards (multi-objective optimization)
    • Preset collections for common tasks

📖 Usage for Agents

When this skill is loaded in your agent's context:

  • Always read SKILL.md first before implementing
  • Start simple - Use length-based reward to validate setup
  • Build incrementally - Add one reward function at a time
  • Reference examples - Copy patterns from reward_functions_library.py
  • Monitor training - Watch reward metrics (not loss!)

🎯 Common Use Cases

Task TypeRecommended RewardsTemplate
Math reasoningMATH_REASONING_REWARDS presetbasic_grpo_training.py
Code generationCODE_GENERATION_REWARDS presetModify dataset in template
SummarizationSUMMARIZATION_REWARDS presetAdjust prompts + rewards
Q&AQA_REWARDS presetUse fuzzy match + citations

⚠️ Critical Reminders

  • Loss goes UP during training - This is normal (it's KL divergence)
  • Use 3-5 reward functions - Single rewards often fail
  • Test rewards before training - Debug each function independently
  • Monitor reward_std - Should stay > 0.1 (avoid mode collapse)
  • Start with num_generations=4-8 - Scale up if GPU allows

🔗 External Resources

  • TRL Documentation
  • DeepSeek R1 Paper
  • Open R1 Implementation
  • Unsloth (2-3x faster)

📝 Version

v1.0.0 - Initial release (January 2025)

👨‍💻 Maintained By

Synthetic Sciences For questions or improvements, see https://orchestra.com

License: MIT Last Updated: January 2025

FAQs

Package last updated on 11 Feb 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts