深圳Yolo 视觉
深圳
立即加入
词汇表

胶囊网络(CapsNet)

Explore Capsule Networks (CapsNets) and how they preserve spatial hierarchies to solve the "Picasso problem" in AI. Learn about dynamic routing and vector neurons.

Capsule Networks, often abbreviated as CapsNets, represent an advanced architecture in the field of deep learning designed to overcome specific limitations found in traditional neural networks. Introduced by Geoffrey Hinton and his team, CapsNets attempt to mimic the biological neural organization of the human brain more closely than standard models. Unlike a typical convolutional neural network (CNN), which excels at detecting features but often loses spatial relationships due to downsampling, a Capsule Network organizes neurons into groups called "capsules." These capsules encode not just the probability of an object's presence, but also its specific properties, such as orientation, size, and texture, effectively preserving the hierarchical spatial relationships within visual data.

The Limitation of Traditional CNNs

To understand the innovation of CapsNets, it is helpful to look at how standard computer vision models operate. A conventional CNN uses layers of feature extraction followed by pooling layers—specifically max pooling—to reduce computational load and achieve translational invariance. This means a CNN can identify a "cat" regardless of where it sits in the image.

However, this process often discards precise location data, leading to the "Picasso problem": a CNN might classify a face correctly even if the mouth is on the forehead, simply because all the necessary features are present. CapsNets address this by removing pooling layers and replacing them with a process that respects the spatial hierarchies of objects.

How Capsule Networks Work

The core building block of this architecture is the capsule, a nested set of neurons that outputs a vector rather than a scalar value. In vector mathematics, a vector has both magnitude and direction. In a CapsNet:

  • Magnitude (Length): Represents the probability that a specific entity exists in the current input.
  • Direction (Orientation): Encodes the instantiation parameters, such as the object's pose estimation, scale, and rotation.

Capsules in lower layers (detecting simple shapes like edges) predict the output of capsules in higher layers (detecting complex objects like eyes or tires). This communication is managed by an algorithm called "dynamic routing" or "routing by agreement." If a lower-level capsule's prediction aligns with the higher-level capsule's state, the connection between them is strengthened. This allows the network to recognize objects from different 3D viewpoints without requiring the massive data augmentation usually needed to teach CNNs about rotation and scale.

关键差异:CapsNets 与卷积神经网络(CNNs)

虽然这两种架构都是计算机视觉(CV)的基础, 它们在处理和表示视觉数据的方式上存在差异:

  • Scalar vs. Vector: CNN neurons use scalar outputs to signify feature presence. CapsNets use vectors to encode presence (length) and pose parameters (orientation).
  • Routing vs. Pooling: CNNs use pooling to downsample data, often losing location details. CapsNets use dynamic routing to preserve spatial data, making them highly effective for tasks requiring precise object tracking.
  • Data Efficiency: Because capsules implicitly understand 3D viewpoints and affine transformations, they can often generalize from less training data compared to CNNs, which may require extensive examples to learn every possible rotation of an object.

实际应用

尽管CapsNets在计算成本上通常高于YOLO26等优化模型,但在特定领域中具有显著优势:

  1. 医学图像分析:在医疗领域,病变的精确方位与形态至关重要。研究人员已将胶囊网络应用于脑肿瘤分割,该模型需基于标准卷积神经网络可能忽略的细微空间层次结构,将肿瘤与周围组织区分开来。您可探索医学影像中胶囊网络的相关研究。
  2. Overlapping Digit Recognition: CapsNets achieved state-of-the-art results on the MNIST dataset specifically in scenarios where digits overlap. Because the network tracks the "pose" of each digit, it can disentangle two overlapping numbers (e.g., a '3' on top of a '5') as distinct objects rather than merging them into a single confused feature map.

实践背景与实施

Capsule Networks(胶囊网络)主要是一种分类架构。虽然它们在理论上具有鲁棒性,但现代工业应用往往更青睐高速卷积神经网络(CNNs)或Transformer模型以实现实时性能。然而,理解胶囊网络所使用的分类基准数据集(MNIST)仍具有重要价值。

以下示例演示了如何训练现代 YOLO 在MNIST 上使用 ultralytics 该任务与用于验证囊状网络的主要基准任务相类似。

from ultralytics import YOLO

# Load a YOLO26 classification model (optimized for speed and accuracy)
model = YOLO("yolo26n-cls.pt")

# Train the model on the MNIST dataset
# This dataset helps evaluate how well a model learns handwritten digit features
results = model.train(data="mnist", epochs=5, imgsz=32)

# Run inference on a sample image
# The model predicts the digit class (0-9)
predict = model("https://docs.ultralytics.com/datasets/classify/mnist/")

胶囊与视觉人工智能的未来

胶囊网络背后的原理持续影响着 人工智能安全与可解释性研究。通过显式建模 部分-整体关系,胶囊为深度神经网络的"黑箱"特性提供了"玻璃箱"替代方案, 使决策过程更具可解释性。未来发展将致力于融合胶囊的空间鲁棒性 与YOLO11等架构的推理速度, YOLO11 或更新的YOLO26等架构的推理速度,以提升三维物体检测与机器人系统的性能。研究人员还正探索采用EM路由的矩阵胶囊,进一步降低协议算法的计算成本。

For developers looking to manage datasets and train models efficiently, the Ultralytics Platform provides a unified environment to annotate data, train in the cloud, and deploy models that balance the speed of CNNs with the accuracy required for complex vision tasks.

加入Ultralytics 社区

加入人工智能的未来。与全球创新者联系、协作和共同成长

立即加入