深圳Yolo 视觉
深圳
立即加入
词汇表

人类反馈强化学习 (RLHF)

了解人类反馈强化学习 (RLHF) 如何通过使模型与人类价值观保持一致来优化 AI 性能,从而实现更安全、更智能的 AI。

Reinforcement Learning from Human Feedback (RLHF) is an advanced machine learning technique that refines artificial intelligence models by incorporating direct human input into the training loop. Unlike standard supervised learning, which relies solely on static labeled datasets, RLHF introduces a dynamic feedback mechanism where human evaluators rank or rate the model's outputs. This process allows the AI to capture complex, subjective, or nuanced goals—such as "helpfulness," "safety," or "creativity"—that are difficult to define with a simple mathematical loss function. RLHF has become a cornerstone in the development of modern large language models (LLMs) and generative AI, ensuring that powerful foundation models align effectively with human values and user intent.

RLHF的核心组件

RLHF流程通常遵循三步管道设计,旨在弥合原始预测能力与符合人类期望的行为之间的差距。

  1. 监督式微调(SFT):工作流通常以预训练的基础模型为起点。开发者使用较小规模的高质量示范数据集(例如专家撰写的问答对)进行初始微调。此步骤建立基准策略,使模型掌握任务所需的通用格式与语体规范。
  2. Reward Model Training: This phase is the distinguishing feature of RLHF. Human annotators review multiple outputs generated by the model for the same input and rank them from best to worst. This data labeling effort generates a dataset of preferences. A separate neural network, called the reward model, is trained on this comparison data to predict a scalar score that reflects human judgment. Tools available on the Ultralytics Platform can streamline the management of such annotation workflows.
  3. Reinforcement Learning Optimization: Finally, the original model acts as an AI agent within a reinforcement learning environment. Using the reward model as a guide, optimization algorithms like Proximal Policy Optimization (PPO) adjust the model's parameters to maximize the expected reward. This step aligns the model's policy with the learned human preferences, encouraging behaviors that are helpful and safe while discouraging toxic or nonsensical outputs.

实际应用

RLHF has proven critical in deploying AI systems that require high safety standards and a nuanced understanding of human interaction.

  • 对话式人工智能与聊天机器人:RLHF最突出的应用在于引导聊天机器人保持有益、无害且诚实的特性。通过惩罚存在偏见、事实错误或危险倾向的输出结果,RLHF能有效缓解大型语言模型的幻觉问题,降低算法偏见的风险。这确保虚拟助手在拒绝有害指令的同时,仍能为正当查询提供有效服务。
  • 机器人学与物理控制:RLHF不仅适用于文本领域,更可拓展至机器人学的人工智能领域。在该领域,为复杂物理任务定义完美的奖励函数极具挑战性。例如,当机器人学习在拥挤仓库中导航时,可通过人类监督者反馈哪些轨迹安全、哪些会引发干扰。这种反馈机制比仅基于目标完成的简单深度强化学习更能有效优化机器人的控制策略。

RLHF 与标准强化学习对比

区分RLHF与传统强化学习(RL)有助于理解其特定用途。

  • 标准RL:在传统场景中,奖励函数通常由环境硬编码设定。 例如在电子游戏中,环境会提供明确信号(获胜+1,失败-1)。智能体 在此定义的马尔可夫决策过程(MDP)框架内优化其行为。
  • RLHF:在许多现实场景中,例如创作故事或礼貌驾驶, "成功"具有主观性。RLHF通过用基于人类偏好的学习型奖励模型 替代硬编码奖励来解决此问题。这使得能够优化"质量"或"恰当性"等抽象概念, 这些概念无法通过显式编程实现。

感知与反馈回路的整合

在视觉应用中,RLHF对齐的智能体通常依赖计算机视觉(CV)来感知环境状态后再采取行动。一个健壮的检测器(如YOLO26)作为感知层,提供结构化观测结果(例如"在3米处检测到障碍物"),策略网络据此选择行动方案。

The following Python example illustrates a simplified concept where a YOLO model provides the environmental state. In a full RLHF loop, the "reward" signal would come from a model trained on human feedback regarding the agent's decisions based on this detection data.

from ultralytics import YOLO

# Load YOLO26n to act as the perception layer for an intelligent agent
model = YOLO("yolo26n.pt")

# The agent observes the environment (an image) to determine its state
results = model("https://ultralytics.com/images/bus.jpg")

# In an RL context, the 'state' is derived from detections
# A reward model (trained via RLHF) would evaluate the action taken based on this state
detected_objects = len(results[0].boxes)

print(f"Agent Observation: Detected {detected_objects} objects.")
# Example output: Agent Observation: Detected 4 objects.

通过将强大的感知模型与经人类反馈优化策略相结合,开发者能够构建出既具备智能又严格遵循人工智能安全原则的系统。当前针对可扩展监督机制(如宪法式人工智能)的持续研究正不断推动该领域发展,旨在缓解大规模人工标注的瓶颈问题,同时保持模型的高性能表现。

加入Ultralytics 社区

加入人工智能的未来。与全球创新者联系、协作和共同成长

立即加入