Yolo 深圳
深セン
今すぐ参加
用語集

Capsule Networks(CapsNet)

Explore Capsule Networks (CapsNets) and how they preserve spatial hierarchies to solve the "Picasso problem" in AI. Learn about dynamic routing and vector neurons.

Capsule Networks, often abbreviated as CapsNets, represent an advanced architecture in the field of deep learning designed to overcome specific limitations found in traditional neural networks. Introduced by Geoffrey Hinton and his team, CapsNets attempt to mimic the biological neural organization of the human brain more closely than standard models. Unlike a typical convolutional neural network (CNN), which excels at detecting features but often loses spatial relationships due to downsampling, a Capsule Network organizes neurons into groups called "capsules." These capsules encode not just the probability of an object's presence, but also its specific properties, such as orientation, size, and texture, effectively preserving the hierarchical spatial relationships within visual data.

The Limitation of Traditional CNNs

To understand the innovation of CapsNets, it is helpful to look at how standard computer vision models operate. A conventional CNN uses layers of feature extraction followed by pooling layers—specifically max pooling—to reduce computational load and achieve translational invariance. This means a CNN can identify a "cat" regardless of where it sits in the image.

However, this process often discards precise location data, leading to the "Picasso problem": a CNN might classify a face correctly even if the mouth is on the forehead, simply because all the necessary features are present. CapsNets address this by removing pooling layers and replacing them with a process that respects the spatial hierarchies of objects.

How Capsule Networks Work

The core building block of this architecture is the capsule, a nested set of neurons that outputs a vector rather than a scalar value. In vector mathematics, a vector has both magnitude and direction. In a CapsNet:

  • Magnitude (Length): Represents the probability that a specific entity exists in the current input.
  • Direction (Orientation): Encodes the instantiation parameters, such as the object's pose estimation, scale, and rotation.

Capsules in lower layers (detecting simple shapes like edges) predict the output of capsules in higher layers (detecting complex objects like eyes or tires). This communication is managed by an algorithm called "dynamic routing" or "routing by agreement." If a lower-level capsule's prediction aligns with the higher-level capsule's state, the connection between them is strengthened. This allows the network to recognize objects from different 3D viewpoints without requiring the massive data augmentation usually needed to teach CNNs about rotation and scale.

主な相違点:CapsNetsとCNNs

両アーキテクチャはコンピュータビジョン(CV)の基盤となるが、 視覚データの処理と表現方法において差異がある:

  • Scalar vs. Vector: CNN neurons use scalar outputs to signify feature presence. CapsNets use vectors to encode presence (length) and pose parameters (orientation).
  • Routing vs. Pooling: CNNs use pooling to downsample data, often losing location details. CapsNets use dynamic routing to preserve spatial data, making them highly effective for tasks requiring precise object tracking.
  • Data Efficiency: Because capsules implicitly understand 3D viewpoints and affine transformations, they can often generalize from less training data compared to CNNs, which may require extensive examples to learn every possible rotation of an object.

実際のアプリケーション

CapsNetsはYOLO26のような最適化モデルよりも計算コストが高い場合が多いが、 特定の領域では明確な利点を提供する:

  1. 医療画像解析:医療分野では、異常病変の正確な方位と形状が極めて重要である。研究者らはカプセルネットワークを脳腫瘍のセグメンテーションに応用しており、このモデルは標準的な畳み込みニューラルネットワーク(CNN)では平滑化されがちな微妙な空間階層に基づいて、腫瘍を周囲組織から識別しなければならない。医療画像におけるカプセルネットワークに関する関連研究を探索できる。
  2. Overlapping Digit Recognition: CapsNets achieved state-of-the-art results on the MNIST dataset specifically in scenarios where digits overlap. Because the network tracks the "pose" of each digit, it can disentangle two overlapping numbers (e.g., a '3' on top of a '5') as distinct objects rather than merging them into a single confused feature map.

実践的背景と実装

カプセルネットワークは主に分類アーキテクチャである。理論的な頑健性を提供する一方で、現代の産業アプリケーションではリアルタイム性能のために高速なCNNやトランスフォーマーが好まれることが多い。しかし、MNISTなどのカプセルネットワークに用いられる分類ベンチマークを理解することは有用である。

以下の例は、現代的なモデルを訓練する方法を示しています。 YOLO MNIST を用いて ultralytics パッケージ。これはカプセルネットワークの検証に使用される主要なベンチマークタスクと並行する。

from ultralytics import YOLO

# Load a YOLO26 classification model (optimized for speed and accuracy)
model = YOLO("yolo26n-cls.pt")

# Train the model on the MNIST dataset
# This dataset helps evaluate how well a model learns handwritten digit features
results = model.train(data="mnist", epochs=5, imgsz=32)

# Run inference on a sample image
# The model predicts the digit class (0-9)
predict = model("https://docs.ultralytics.com/datasets/classify/mnist/")

カプセルとビジョンAIの未来

カプセルネットワークの原理は、AIの安全性および解釈可能性に関する研究に引き続き影響を与えている。部分と全体の関係を明示的にモデル化することで、カプセルは深層ニューラルネットワークの「ブラックボックス」的性質に対する「ガラス箱」的代替案を提供し、意思決定の透明性を高める。今後の発展では、カプセルの空間的頑健性と、YOLO11などのアーキテクチャの推論速度を組み合わせることが検討されている。 YOLO11 や新世代のYOLO26といったアーキテクチャの推論速度を組み合わせ、3D物体検出やロボティクス分野での性能向上を図る。研究者らはさらに、合意アルゴリズムの計算コスト削減を目的として、EMルーティングを用いたマトリックスカプセルの研究も進めている。

For developers looking to manage datasets and train models efficiently, the Ultralytics Platform provides a unified environment to annotate data, train in the cloud, and deploy models that balance the speed of CNNs with the accuracy required for complex vision tasks.

Ultralytics コミュニティに参加する

AIの未来を共に切り開きましょう。グローバルなイノベーターと繋がり、協力し、成長を。

今すぐ参加