Explore visual prompting to guide AI models with points and boxes. Learn how Ultralytics YOLO and SAM enable precise segmentation and faster data annotation.
Visual prompting is an emerging technique in computer vision where users provide spatial or visual cues—such as points, bounding boxes, or scribbles—to guide an AI model's focus toward specific objects or regions within an image. Unlike traditional prompt engineering which relies primarily on text descriptions, visual prompting allows for more precise and intuitive interaction with Artificial Intelligence (AI) systems. This method leverages the capabilities of modern foundation models to perform tasks like segmentation and detection without requiring extensive retraining or large labeled datasets. By effectively "pointing" at what matters, users can adapt general-purpose models to novel tasks instantaneously, bridging the gap between human intent and machine perception.
At its core, visual prompting works by injecting spatial information directly into the model's processing pipeline. When a user clicks on an object or draws a box, these inputs are converted into coordinate-based embeddings that the neural network integrates with the image features. This process is central to interactive architectures like the Segment Anything Model (SAM), where the model predicts masks based on geometric prompts.
The flexibility of visual prompting allows for various interaction types:
Recent research presented at CVPR 2024 highlights how visual prompting significantly reduces the time required for data annotation, as human annotators can correct model predictions in real-time with simple clicks rather than manually tracing polygons.
While both techniques aim to guide model behavior, it is important to distinguish visual prompting from text-based methods. Text-to-image generation or zero-shot detection relies on natural language processing (NLP) to interpret semantic descriptions (e.g., "find the red car"). However, language can be ambiguous or insufficient for describing precise spatial locations or abstract shapes.
Visual prompting resolves this ambiguity by grounding the instruction in the pixel space itself. For instance, in medical image analysis, it is far more accurate for a radiologist to click on a suspicious nodule than to attempt to describe its exact coordinates and irregular shape via text. Often, the most powerful workflows combine both approaches—using text for semantic filtering and visual prompts for spatial precision—a concept known as multi-modal learning.
The adaptability of visual prompting has led to its rapid adoption across diverse industries:
The Ultralytics ecosystem supports visual prompting workflows, particularly through models like FastSAM and SAM. These models allow developers to pass point or box coordinates programmatically to retrieve segmentation masks.
The following example demonstrates how to use the ultralytics package to apply a point prompt to an
image, instructing the model to segment the object located at specific coordinates.
from ultralytics import SAM
# Load the Segment Anything Model (SAM)
model = SAM("sam2.1_b.pt")
# Apply a visual point prompt to the image
# The 'points' argument accepts [x, y] coordinates
# labels: 1 indicates a foreground point (include), 0 indicates background
results = model("https://ultralytics.com/images/bus.jpg", points=[[300, 350]], labels=[1])
# Display the segmented result
results[0].show()
Visual prompting represents a shift towards "promptable" computer vision, where models are no longer static "black boxes" but interactive tools. This capability is essential for active learning loops, where models rapidly improve by incorporating user feedback.
For developers looking to integrate these capabilities into production, the Ultralytics Platform offers tools to manage datasets and deploy models that can handle dynamic inputs. As research progresses, we expect to see even tighter integration between visual prompts and large language models (LLMs), enabling systems that can reason about visual inputs with the same fluency they currently handle text.