全网都在教你如何使用 ollama 部署 DeepSeek-R1？
作为一个码农这怎么足够，手把手教你基于Qwen2.5-0.5B，规则奖励模型，GRPO 实现 DeepSeek-R1 Demo。
女朋友都看得懂～

创新点

如果说 DeepSeekV3 展示了在优化下小成本也能做到不差的结果。

那么 R1-Zero 展示了自我验证、反思和生产长 CoT （Chain of Thought，思维链）等功能，验证了 LLM 的推理能力能够纯粹通过 RL 来激励，而无需 SFT。
最终得到的 R1 模型通过两个 RL 阶段，旨在发现改进的推理模式并与人类偏好保持一致，以及两个 SFT 阶段，作为模型推理和非推理能力的种子。这种流水线创造更好的模型可能是行业发展方向。

同时 DeepSeek 证明了较大模型的推理模式可以提炼为较小的模型，对比直接在小型模型上直接使用强化学习，其性能更佳。开源的 DeepSeek-R1 及其 API 将有利于研究界在未来提炼出更好的小型模型

ps：具体对比我就不放了，可以直接在 Github 上看。https://github.com/deepseek-ai/DeepSeek-R1

原理简述

通用的大模型训练过程

Pretrain
SFT （supervised fine tune）：有监督微调
RL（强化学习）：基于人类反馈的强化学习微调

DeepSeek-R1-Zero

Pretrain
RL
R1-Zero 没有经过 SFT 直接做 RL，会产生无休止重复、可读性差和语言混合等问题

DeepSeek-R1

Pretrain
SFT 第一阶段，冷启动
- 操作：引入数千条高质量Cot数据对基础模型进行微调，强制规范输出格式（如<think>推理过程</think>），提升逻辑过程以及可读性。
- 数据来源：收集 DeepSeek-R1-Zero的输出结果，以可读形式呈现，最后通过人工标注进行后处理，以优化输出结果。仅纳入了可以基于规则奖励进行评估的数据。
RL 第一阶段，推理导向，关键步骤生产思维链
- 使用 GRPO（Group Relative Policy Optimization 一种优化策略）对 SFT 进行强化学习。
- 奖励模型：基于规则的奖励，保证答案准确性和语言一致性，针对代码、数学、编程等有固定答案的任务设计奖励函数。

ps：这也是为什么 DeepSeek 对数学、代码能力很强的原因

SFT 第二阶段，合并数据
- 目的：对齐推理数据和非推理数据
- 数据来源
  - 推理数据
    - RL一阶段 checkpoint 的输出数据。
    - 纳入更多数据来扩展数据集，其中一些数据生成式奖励模型，通过真实答案和模型预测输入 DeepSeek-v3 进行判断。
    - 此外，由于模型输出有时混乱难读，过滤掉了混合语言的思维链、长段落和代码块。对于每个提示，会对多个回答进行抽样，只保留正确的回答。
    - 综上共 600k 个与推理相关的训练样本
  - 非推理数据：
    - 如写作、事实问答、自我认知和翻译等，重用 DeepSeek-V3 监督微调数据集的部份内容
    - 共 200k 个与推理无关的训练样本
RL 第二阶段，通用 RL 对齐
- 使用 RLHF （Reinforcement Learning from Human Feedback）：基于人力反馈的强化学习，融入人类偏好奖励模型（Helpfulness & Harmless），确保模型在开放域任务中的安全性和实用性。

推理过程具体实现

使用 Qwen/Qwen2.5-0.5B-Instruct 作为基础模型，swulling/gsm8k_chinese 作为数据集，GRPOTrainer 作为训练器，核心代码如下

import re

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import GRPOConfig, GRPOTrainer

SYSTEM_PROMPT = """
按照如下格式生成：
<think>
...
</think>
<answer>
...
</answer>
"""


def process_data(data):
    return data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question_zh-cn"]},
            ],
            "answer": x["answer_only"],
        }
    )


def extract_answer(text):
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def correctness_reward(prompts, completions, answer, **kwargs):
    # 生成答案是否正确的奖励
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_answer(r) for r in responses]
    print(
        f"问题:\n{prompts[0][-1]['content']}",
        f"\n答案:\n{answer[0]}",
        f"\n模型输出:\n{responses[0]}",
        f"\n提取后的答案:\n{extracted_responses[0]}",
    )
    return [
        2.0 if response == str(ans) else 0.0
        for response, ans in zip(extracted_responses, answer)
    ]


def digit_reward(completions, **kwargs):
    # 生成答案是否是数字的奖励
    """
    单纯依赖结果是否正确进行奖励，条件很苛刻，会导致奖励比较稀疏，模型难以收敛
    所以加上答案是否是数字的奖励，虽然答案错误，但是至少生成的是数字（对于数学问题），也要给予适当奖励
    """
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_answer(r) for r in responses]
    return [0.5 if response.isdigit() else 0.0 for response in extracted_responses]


def hard_format_reward(completions, **kwargs):
    # 格式奖励
    pattern = r"^<think>\n\.*?n</think>\n<answer>\n.*?\n</answer>\n

quot; responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, response) for response in responses] return [0.5 if match else 0.0 for match in matches] def soft_format_reward(completions, **kwargs): # 格式奖励 pattern = r"<think>.*?</think>\s*<answer>.*?</answer>" responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, response) for response in responses] return [0.5 if match else 0.0 for match in matches] def mark_reward(completions, **kwargs): # 标记奖励（改善格式奖励稀疏问题） def mark_num(text): reward = 0 if text.count("<think>\n") == 1: reward += 0.125 if text.count("</think>\n") == 1: reward += 0.125 if text.count("<answer>\n") == 1: reward += 0.125 if text.count("</answer>\n") == 1: reward += 0.125 return reward responses = [completion[0]["content"] for completion in completions] return [mark_num(response) for response in responses] if __name__ == "__main__": model_name = "Qwen/Qwen2.5-0.5B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir="./model") #! 如果使用lora方法训练，取消如下注释 # from peft import LoraConfig, TaskType, get_peft_model # lora_config = LoraConfig( # r=8, # lora_alpha=256, # target_modules=[ # "q_proj", # "k_proj", # "v_proj", # "o_proj", # "gate_proj", # "up_proj", # "down_proj", # ], # lora_dropout=0.1, # task_type=TaskType.CAUSAL_LM, # ) # # 使用lora方法训练 # model = get_peft_model(model, lora_config) model.cuda() tokenizer = AutoTokenizer.from_pretrained(model_name) ds = load_dataset("swulling/gsm8k_chinese", cache_dir="./dataset") data = process_data(ds["train"]) output_dir = "output" training_args = GRPOConfig( output_dir=output_dir, learning_rate=5e-6, adam_beta1=0.9, adam_beta2=0.99, weight_decay=0.1, warmup_ratio=0.1, lr_scheduler_type="cosine", logging_steps=1, bf16=True, per_device_train_batch_size=1, gradient_accumulation_steps=4, num_generations=16, max_prompt_length=256, max_completion_length=200, num_train_epochs=1, save_steps=100, max_grad_norm=0.1, log_on_each_node=False, use_vllm=False, report_to="tensorboard", ) trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[ mark_reward, soft_format_reward, hard_format_reward, digit_reward, correctness_reward, ], args=training_args, train_dataset=data, ) trainer.train() trainer.save_model(output_dir)

结论

练完发现表现其实不太好

>>> 5分钟多少秒
<think>
1分钟60秒，所以 5分钟 = 60秒
</think>
<answer>
60
</answer>

对比真实的 DeepSeek-R1

>>> 5分钟多少秒
<think>
首先，我需要将5分钟转换为秒。
我知道1分钟等于60秒。
因此，5分钟就是5乘以60秒。
计算得出5分钟等于300秒。
</think>
要将5分钟转换为秒，可以按照以下步骤进行：
**步骤一：知道每分钟有多少秒**
\[ 1 \text{ 分钟} = 60 \text{ 秒} \]
**步骤二：计算5分钟是多少秒**
\[ 5 \text{ 分钟} = 5 \times 60 \text{ 秒} = 300 \text{ 秒} \]
**最终答案：**
\[
\boxed{300}
\]

可能是因为奖励设计的不太完善，当然也和基础模型比较小有关系。
所以这恰恰证明了 DeepSeek成功是多方面的，精调数据集、大模型、细致的优化缺一不可。

资料引用

RLHF：https://arxiv.org/pdf/2203.02155
Cot：https://arxiv.org/pdf/2201.11903
DeepSeekMath（GPRO）：https://arxiv.org/pdf/2402.03300
DeepSeekR1：https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
DeepSeekV3：https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
Github: https://github.com/wyf3/llm_related/tree/main/deepseek_learn/deepseek_r1_train
视频大佬：https://b23.tv/DqAbFBR

源码级实现 DeepSeek-R1 推理过程

创新点

原理简述

通用的大模型训练过程

DeepSeek-R1-Zero

DeepSeek-R1

推理过程具体实现

结论

资料引用

发表回复取消回复

源码级实现 DeepSeek-R1 推理过程

创新点

原理简述

通用的大模型训练过程

DeepSeek-R1-Zero

DeepSeek-R1

推理过程具体实现

结论

资料引用

发表回复 取消回复

发表回复取消回复