Steering Vectors

探索转向向量 (Steering Vectors) 如何在无需重新训练的情况下实现对神经网络的实时控制。学习使用 Ultralytics YOLO26 进行激活工程。

Steering vectors represent meaningful, mathematical directions within the hidden activation space of a neural network that correspond to high-level concepts, such as "politeness," "truthfulness," or specific visual features. By artificially injecting or subtracting these vectors from the model's internal states during the forward pass, developers can predictably control and alter the model's behavior without updating any underlying weights. This technique, fundamentally rooted in Activation Engineering, provides zero-cost, inference-time control over deep learning systems ranging from large language models to vision architectures.

Link to this section引导向量的工作原理#

为了创建引导向量，研究人员通常使用一种称为对比激活加法 (CAA) 的方法。这包括将一组对比数据对——例如要求模型“乐于助人”的提示与要求模型“有害”的提示——通过网络进行处理。通过计算这些数据对在激活函数输出上的差异，并在多个样本中取平均值，从而隔离出在张量空间中代表该概念的特定几何方向。

在实时推理期间，通过简单的PyTorch张量加法，将该向量在特定层添加到隐藏状态或从中减去。调整向量的强度可以让从业者微调注入行为的程度。

Link to this section区分引导向量与相关概念#

要理解引导向量如何融入更广泛的机器学习领域，需要将其与类似方法区分开来：

任务向量： 任务向量通过在训练后修改实际的模型权重来合并能力，从而在权重空间中运行；而引导向量严格在运行时在激活空间中运行，完全不触及原始权重。
表征工程 (RepE)： RepE 是一种阅读和控制内部认知状态的整体方法框架，由人工智能安全中心等机构深入研究。引导向量是在 RepE 控制阶段中使用的具体数学工具。
提示工程： 提示工程试图通过修改用户的输入文本或图像来引导行为。引导向量则绕过了输入瓶颈，直接操纵模型的内部认知处理过程。
微调： 传统的对齐方法（如来自人类反馈的强化学习 RLHF）通过梯度下降永久改变模型，这需要大量的计算资源，通常通过Ultralytics Platform等云工具进行管理。引导向量完全避免了这种计算开销。

Link to this section人工智能的实际应用#

动态引导模型的能力在现代人工智能管道中带来了重大进步：

增强AI安全： 通过隔离与“拒绝”或“无害性”相关的引导向量，工程师可以强制模型拒绝恶意指令。在OpenAI的对齐研究和Anthropic的可解释性研究的支持下，引导特定特征可以剧烈改变AI的对话风格，并确保严格的安全护栏。
控制推理模型： 最近关于先进思维架构的研究表明，引导向量可以调节内部推理链。在复杂的问题解决过程中，从业者可以增加模型表达不确定性或纠正错误的倾向。
缓解AI偏见： 通过提取代表特定社会偏见的向量，开发者可以在生成过程中减去该方向。这有效地中和了偏见并提高了公平性，而无需重新训练，同时降低了LLM幻觉的可能性。
引导计算机视觉系统： 在视觉模型中，引导向量可以应用于特征图，以人为地增强网络对关键目标的敏感度。例如，可以引导目标检测模型在恶劣天气条件下优先查找行人。

Link to this section使用 PyTorch 应用引导向量#

以下是在正向传递过程中对Ultralytics YOLO26模型应用激活引导干预的可运行示例。通过利用PyTorch正向钩子，你可以直接将自定义向量注入到隐藏层中。

import torch
from ultralytics import YOLO

# Load the recommended Ultralytics YOLO26 model for state-of-the-art vision tasks
model = YOLO("yolo26n.pt")


# Define a hook function to steer the internal activations
def steer_activations_hook(module, input, output):
    # Create a steering vector matching the output shape (for demonstration purposes)
    # In practice, this vector is pre-computed via Contrastive Activation Addition (CAA)
    steering_vector = torch.ones_like(output) * 0.1

    # Add the steering vector to the model's hidden states to alter behavior at inference
    return output + steering_vector


# Attach the hook to a middle layer (e.g., layer index 5) to inject the vector
handle = model.model.model[5].register_forward_hook(steer_activations_hook)

# Run inference on an image with the dynamically steered activations
results = model("https://ultralytics.com/images/bus.jpg")

# Remove the hook to restore the model to its original unsteered state
handle.remove()