Tag Archives: incentivizes

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

GRPO GRPO9 is the RL algorithm that we use to train DeepSeek-R1-Zero and DeepSeek-R1. It was originally proposed to simplify the training process and reduce the resource consumption of proximal policy optimization (PPO)31, which is widely used in the RL stage of LLMs32. The pipeline of GRPO is shown in Extended Data Fig. 2. For each question q, GRPO samples …

Read More »