深圳Yolo 视觉
深圳
立即加入

Understanding promptable concept segmentation

Explore promptable concept segmentation, how it differs from traditional methods, and how related models such as YOLOE-26 enable open-vocabulary capabilities.

Vision AI is advancing rapidly and is being widely used to analyze images and videos in real-world environments. For instance, applications ranging from traffic management systems to retail analytics are being integrated with computer vision models.

In many of these applications, vision models, such as object detection models, are trained to recognize a predefined set of objects, including vehicles, people, and equipment. During training, these models are shown many labeled examples so they can learn how each object appears and how to distinguish it from others in a scene.

For segmentation tasks, models go a step further by producing precise pixel-level outlines around these objects. It enables systems to understand exactly where each object is located in an image.

This works well as long as the system only needs to recognize what it was trained on. However, in real-world environments, that is rarely the case. 

Visual scenes are typically dynamic. New objects and visual concepts appear, conditions change, and users often want to segment objects that weren’t part of the original training setup.

These limitations are especially clear when it comes to segmentation. As Vision AI continues to evolve, there is a growing need for more flexible segmentation models that can adapt to new concepts without repeated retraining. That is why promptable concept segmentation (PCS) is gaining attention.

Instead of relying on a fixed list of object categories, users can describe what they want to segment using text, visual prompts, or example images. These models can then identify and segment all regions that match the described concept, even if that concept wasn’t explicitly included during training.

In this article, we’ll explore how promptable concept segmentation works, how it differs from traditional approaches, and where it is used today.

What is promptable concept segmentation?

In most cases, segmentation models are trained to recognize a short list of object types. This works well when a Vision AI system only needs to detect and segment a specific set of objects.

In real-world applications, however, visual scenes are dynamic. New objects appear, task requirements change, and users often need to segment concepts that weren’t included in the original label set. Supporting these situations typically means collecting new high-quality data and annotations and retraining the model, which increases cost and slows deployment.

Promptable concept segmentation solves this issue by letting users tell the model what to look for instead of choosing from a fixed list of labels. Users describe the object or idea they are looking for, and the model highlights all matching areas in the image. This makes it much easier to connect a user’s intent to the actual pixels in an image.

Fig 1. A look at using concept prompts for segmentation (Source)

Guiding segmentation with different types of prompts

Models that support promptable concept segmentation are flexible because they can take different types of input. In other words, there is more than one way to tell the model what to look for, such as text descriptions, visual hints, or through example images.

Here’s a closer look at each approach:

  • Text prompts: Short phrases like “school bus” or “tumor region” can be used to describe the concept to segment. The model interprets the meaning of the words and identifies matching regions.
  • Visual prompts: These prompts use points, boxes, or rough sketches inside the image as hints. These cues guide where to look and help shape the final boundary.
  • Image exemplars: Reference images or small crops represent the concept of interest. The model searches for visually similar regions and segments them based on visual appearance.

The difference between PCS and traditional segmentation

Before we dive into how promptable concept segmentation works, let’s first compare it to various traditional object segmentation methods.

PCS enables models that are open-vocabulary and prompt-driven. It can work with new ideas described through prompts, but traditional segmentation can’t. There are several different types of traditional segmentation approaches, each with its own assumptions and limitations.

Here’s a glimpse of some key types of traditional segmentation:

  • Semantic segmentation: Every pixel in the image is labeled as part of a category like road, building, or person. All pixels with the same label are grouped together, so the model doesn’t separate individual object instances.
  • Instance segmentation: The model identifies and segments individual objects, so two people or two cars are treated as separate items. 
  • Panoptic segmentation: This technique combines semantic and instance segmentation to provide a complete view of the scene, covering both background regions and individual objects. 

All of these approaches rely on a predefined list of object categories. They work well within that scope, but they don’t handle concepts outside it very well. When a new specific object needs to be segmented, additional training data and model fine-tuning are usually required.

PCS aims to change that. Instead of being locked into predefined categories, it lets you describe what you want to segment in an image at inference time.

The evolution of PCS models

Next, let’s walk through how segmentation models evolved toward promptable concept segmentation.

A popular foundation model that marked a shift in segmentation was SAM, or the Segment Anything Model. It was introduced in 2023. Instead of relying on predefined object categories, SAM allowed users to guide segmentation using simple visual prompts such as points or bounding boxes. 

With SAM, users no longer had to select a label. They could simply indicate where an object was, and the model would generate a mask for it. This made segmentation more flexible, but users still needed to show the model where to look. 

SAM 2, released in 2024, built on this idea by handling more complex scenes and extending promptable segmentation to video. It improved robustness across different lighting conditions, object shapes, and motion, while still relying mainly on visual prompts to guide segmentation.

The SAM 3 model is the latest step in this evolution. It was released last year and is a unified model that combines visual understanding with language guidance, enabling consistent behavior across image and video segmentation tasks. 

With SAM 3, users aren’t limited to pointing or drawing prompts. Instead, they can describe what they want to segment using text, and the model searches the image or video frames for regions that match that description. 

Segmentation is guided by concepts rather than fixed object categories, supporting open vocabulary use across different scenes and over time. In fact, SAM 3 operates over a large, learned concept space that is grounded in an ontology derived from sources such as Wikidata and expanded through large-scale training data.

Fig 2. An example of prompting SAM 3 and segmenting a single image (Source)

Compared to earlier versions that relied mostly on geometric prompts, SAM 3 represents a step toward more flexible, concept-driven segmentation. This makes it better suited for real-world applications where the objects or ideas of interest may change and can’t always be defined in advance.

Exploring how promptable visual segmentation works

So, how does promptable concept segmentation work? It builds on large pretrained vision and vision language models, which are models trained on massive collections of images and, in many cases, paired text. This training lets them learn general visual patterns and semantic meaning.

Most PCS models use transformer-based architectures, which process an entire image at once to understand how different regions relate to each other. A vision transformer extracts visual features from the image, while a text encoder converts words into numerical representations that the model can work with.

During training, these models can learn from different types of supervision, including pixel-level masks that define exact object boundaries, bounding boxes that roughly locate objects, and image-level labels that describe what appears in an image. Training using different kinds of labeled data helps the model capture both fine details and broader visual concepts.

At inference time, meaning when the model is actually used to make predictions, PCS follows a prompt-driven process. A user provides guidance through text descriptions, visual hints like points or boxes, or example images. The model encodes both the prompt and the image into a shared internal representation or embeddings and identifies regions that align with the described concept.

A mask decoder then converts this shared representation into precise pixel-level segmentation masks. Since the model links visual features with semantic meaning, it can segment new concepts even if they weren't explicitly included during training.

Also, oftentimes the output can be refined by adjusting the prompt or adding additional guidance, which helps the model handle complex or ambiguous scenes. This iterative process supports practical optimization during deployment.

Promptable concept segmentation models are typically evaluated on how well they segment previously unseen concepts and how robustly they perform across different scenes. Benchmarks often focus on mask quality, generalization, and computational efficiency, reflecting real-world deployment requirements.

Real-world use cases of PCS

Next, let’s look at where promptable concept segmentation is already being used and starting to make a real impact.

Flexible image segmentation for medical imaging

Medical imaging involves many biological structures, diseases, and scan types, and new cases pop up every day. Traditional segmentation models struggle to keep up with this variety. 

PCS fits naturally into this space because it lets clinicians describe what they want to find instead of choosing from a short, rigid list. With text phrases or visual prompts, PCS can be used to segment organs or areas of concern directly, without retraining the model for each new task. This makes it easier to handle diverse clinical needs, reduces the need for manual mask drawing, and works across many imaging types.

One great example is MedSAM-3, which adapts the SAM 3 architecture for text-promptable PCS in medical imaging. This model can be prompted with explicit anatomical and pathological terms, such as organ names like liver or kidney and lesion-related concepts like tumor or lesion. Given a prompt, the model directly segments the corresponding region in the medical image.

MedSAM-3 also integrates multimodal large language models (MLLMs or multimodal LLMs), which can reason over both text and images. These models operate in an agent-in-the-loop setup, where results are refined iteratively to improve accuracy in more challenging cases.

Fig 3. A MedSAM-3 pipeline for text-prompted tumor segmentation in medical images (Source)

MedSAM-3 performs well across X-ray, MRI, CT, ultrasound, and video data, highlighting how PCS can enable more flexible and efficient medical imaging workflows in real clinical settings.

Adaptive segmentation for robotic surgery and automation

Robotic surgery relies on vision systems to track tools and understand rapidly changing surgical scenes. Instruments move quickly, lighting varies, and new tools can appear at any time, which makes predefined label systems hard to maintain.

With PCS, robots can track tools, guide cameras, and follow surgical steps in real time. This reduces manual labeling and makes systems easier to adapt to different procedures. Surgeons or automated systems can use text prompts such as “gripper,” “scalpel,” or “camera tool” to indicate what should be segmented in an image.

Fig 4. Segmentation of surgical instruments used during robotic surgery (Source)

Open-vocabulary segmentation with Ultralytics YOLOE-26

Another interesting state-of-the-art model related to promptable concept segmentation is our Ultralytics YOLOE-26. Our model brings open-vocabulary, prompt-driven segmentation to the Ultralytics YOLO model family.

YOLOE-26 is built on the Ultralytics YOLO26 architecture and supports open-vocabulary instance segmentation. YOLOE-26 allows users to guide segmentation in several ways. 

It supports text prompts, where short, visually grounded phrases can specify the target object, as well as visual prompts, which provide additional guidance based on image cues. In addition, YOLOE-26 includes a prompt-free mode for zero-shot inference, where the model detects and segments objects from a built-in vocabulary without requiring user prompts.

YOLOE-26 is great for applications such as video analytics, robotics perception, and edge-based systems, where object categories may change but low latency and reliable throughput remain essential. It is also especially useful for data labeling and dataset curation, as it streamlines workflows by automating parts of the annotation process.

Pros and cons of promptable concept segmentation

Here are some of the main benefits of using promptable concept segmentation:

  • Faster iteration and prototyping: New segmentation tasks can be tested quickly by changing prompts rather than rebuilding datasets or retraining models, which speeds up experimentation and development.
  • Adaptability across domains: The same PCS model can often be applied to different domains, such as medical imaging, robotics, or video analytics, with minimal changes to the workflow.
  • Interactive refinement: Users can iteratively adjust prompts or add guidance to improve results, making it easier to handle ambiguous scenes or edge cases without retraining.

While PCS has clear advantages, here are some limitations to consider:

  • Prompt sensitivity: Small changes in how a prompt is written or provided can affect the output. Prompts that are too vague or too specific may lead to incomplete or incorrect segmentation.
  • Less predictable behavior: Since the model interprets prompts rather than selecting from fixed labels, results can vary more across scenes and inputs, which may be an issue for tightly controlled pipelines.
  • Ambiguous concept interpretation: Some concepts are subjective or loosely defined, which can lead to inconsistent segmentation results between users or across images.
  • Limited reliability for highly specific targets: Prompt-based models are generally less reliable for narrowly defined, instance-specific tasks, such as defect detection, where precise, consistent identification of subtle features is required.

Choosing between promptable and traditional segmentation

As you explore promptable segmentation, you might wonder which applications it is best suited for and when a traditional computer vision model like YOLO26 is a better fit for the problem you are trying to solve. Promptable segmentation works well for general objects, but it’s not a good fit for use cases that require very precise and consistent results.

Defect detection is a good example. In manufacturing, defects are often tiny and subtle, such as small scratches, dents, misalignments, or surface irregularities. They can also vary widely depending on materials, lighting, and production conditions. 

These issues are difficult to describe with a simple prompt and even harder for a general-purpose model to detect reliably. Overall, prompt-based models tend to miss defects or produce unstable results, while models trained specifically on defect data are much more reliable for real-world inspection systems.

主要要点

Promptable concept segmentation makes vision systems easier to adapt to the real world, where new objects and ideas show up all the time. Instead of being locked into fixed labels, users can simply describe what they want to segment and let the model do the rest, which saves time and reduces manual work. While it still has limitations, PCS is already changing how segmentation is used in practice and is likely to become a core part of future vision systems.

Explore more about AI by visiting our GitHub repository and joining our community. Check out our solution pages to learn about AI in robotics and computer vision in manufacturing. Discover our licensing options to get started with Vision AI today!

让我们一起构建人工智能的未来!

开启您的机器学习未来之旅

免费开始