Yolo Vision Shenzhen
Shenzhen
Rejoindre maintenant
Glossaire

Réseaux de capsules (CapsNet)

Explore Capsule Networks (CapsNets) and how they preserve spatial hierarchies to solve the "Picasso problem" in AI. Learn about dynamic routing and vector neurons.

Capsule Networks, often abbreviated as CapsNets, represent an advanced architecture in the field of deep learning designed to overcome specific limitations found in traditional neural networks. Introduced by Geoffrey Hinton and his team, CapsNets attempt to mimic the biological neural organization of the human brain more closely than standard models. Unlike a typical convolutional neural network (CNN), which excels at detecting features but often loses spatial relationships due to downsampling, a Capsule Network organizes neurons into groups called "capsules." These capsules encode not just the probability of an object's presence, but also its specific properties, such as orientation, size, and texture, effectively preserving the hierarchical spatial relationships within visual data.

The Limitation of Traditional CNNs

To understand the innovation of CapsNets, it is helpful to look at how standard computer vision models operate. A conventional CNN uses layers of feature extraction followed by pooling layers—specifically max pooling—to reduce computational load and achieve translational invariance. This means a CNN can identify a "cat" regardless of where it sits in the image.

However, this process often discards precise location data, leading to the "Picasso problem": a CNN might classify a face correctly even if the mouth is on the forehead, simply because all the necessary features are present. CapsNets address this by removing pooling layers and replacing them with a process that respects the spatial hierarchies of objects.

How Capsule Networks Work

The core building block of this architecture is the capsule, a nested set of neurons that outputs a vector rather than a scalar value. In vector mathematics, a vector has both magnitude and direction. In a CapsNet:

  • Magnitude (Length): Represents the probability that a specific entity exists in the current input.
  • Direction (Orientation): Encodes the instantiation parameters, such as the object's pose estimation, scale, and rotation.

Capsules in lower layers (detecting simple shapes like edges) predict the output of capsules in higher layers (detecting complex objects like eyes or tires). This communication is managed by an algorithm called "dynamic routing" or "routing by agreement." If a lower-level capsule's prediction aligns with the higher-level capsule's state, the connection between them is strengthened. This allows the network to recognize objects from different 3D viewpoints without requiring the massive data augmentation usually needed to teach CNNs about rotation and scale.

Différences clés : CapsNets vs CNNs

Bien que ces deux architectures soient fondamentales pour la vision par ordinateur (CV), elles diffèrent dans la manière dont elles traitent et représentent les données visuelles :

  • Scalar vs. Vector: CNN neurons use scalar outputs to signify feature presence. CapsNets use vectors to encode presence (length) and pose parameters (orientation).
  • Routing vs. Pooling: CNNs use pooling to downsample data, often losing location details. CapsNets use dynamic routing to preserve spatial data, making them highly effective for tasks requiring precise object tracking.
  • Data Efficiency: Because capsules implicitly understand 3D viewpoints and affine transformations, they can often generalize from less training data compared to CNNs, which may require extensive examples to learn every possible rotation of an object.

Applications concrètes

Bien que les CapsNets soient souvent plus coûteux en termes de calcul que les modèles optimisés tels que YOLO26, ils offrent des avantages distincts dans des domaines spécialisés :

  1. Analyse d'images médicales : dans le domaine de la santé, l'orientation et la forme précises d'une anomalie sont essentielles. Les chercheurs ont appliqué les CapsNets à la segmentation des tumeurs cérébrales, où le modèle doit distinguer une tumeur des tissus environnants en se basant sur des hiérarchies spatiales subtiles que les CNN standard pourraient lisser . Vous pouvez explorer les recherches connexes sur les réseaux de capsules en imagerie médicale.
  2. Overlapping Digit Recognition: CapsNets achieved state-of-the-art results on the MNIST dataset specifically in scenarios where digits overlap. Because the network tracks the "pose" of each digit, it can disentangle two overlapping numbers (e.g., a '3' on top of a '5') as distinct objects rather than merging them into a single confused feature map.

Contexte pratique et mise en œuvre

Les réseaux Capsule sont avant tout une architecture de classification. Bien qu'ils offrent une robustesse théorique, les applications industrielles modernes privilégient souvent les CNN ou les Transformers à haute vitesse pour des performances en temps réel. Cependant, il est utile de comprendre les benchmarks de classification utilisés pour les CapsNets, tels que MNIST.

L'exemple suivant montre comment former un modèle moderne. Modèle YOLO sur l'ensemble MNIST à l'aide du ultralytics paquet. Cela correspond à la tâche de référence principale utilisée pour valider les réseaux de capsules.

from ultralytics import YOLO

# Load a YOLO26 classification model (optimized for speed and accuracy)
model = YOLO("yolo26n-cls.pt")

# Train the model on the MNIST dataset
# This dataset helps evaluate how well a model learns handwritten digit features
results = model.train(data="mnist", epochs=5, imgsz=32)

# Run inference on a sample image
# The model predicts the digit class (0-9)
predict = model("https://docs.ultralytics.com/datasets/classify/mnist/")

L'avenir des capsules et de la vision artificielle

Les principes qui sous-tendent les réseaux de capsules continuent d'influencer la recherche sur la sécurité et l'interprétabilité de l'IA. En modélisant explicitement les relations entre les parties et le tout, les capsules offrent une alternative « boîte transparente » à la nature « boîte noire » des réseaux neuronaux profonds, rendant les décisions plus explicables. Les développements futurs visent à combiner la robustesse spatiale des capsules avec la vitesse d'inférence d'architectures telles que YOLO11 ou la plus récente YOLO26 afin d'améliorer les performances en matière de détection d'objets 3D et de robotique. Les chercheurs explorent également les capsules matricielles avec routage EM afin de réduire davantage le coût de calcul de l'algorithme d'accord.

For developers looking to manage datasets and train models efficiently, the Ultralytics Platform provides a unified environment to annotate data, train in the cloud, and deploy models that balance the speed of CNNs with the accuracy required for complex vision tasks.

Rejoindre la communauté Ultralytics

Rejoignez le futur de l'IA. Connectez-vous, collaborez et évoluez avec des innovateurs mondiaux.

Rejoindre maintenant