GRPO

← Series hub ← Previous | Next → Supervised Fine-Tuning (SFT) is the stepping stone that introduces domain knowledge and tone to a model, but it does not instruct the model on handling complex preference tradeoffs: identifying safe vs. toxic generation boundaries, formatting alignment, or self-correcting logic errors during reasoning cycles. To ensure small models align with human intent, safety guidelines, and logical correctness, we execute a Preference Alignment phase. This article details the mechanics of reinforcement learning for LLM alignment. We compare the mathematical objectives of DPO and KTO, and dissect GRPO (Group Relative Policy Optimization)—the breakthrough algorithm powering DeepSeek-R1 that frees up over 50% of training memory. ...