Green check
Link copied to clipboard

What is YOLOE? Taking computer vision models further

Learn how YOLOE lets you find objects using a simple prompt or photo. It enables smarter, faster computer vision without retraining or fine-tuning models.

Object detection is a key computer vision task, where the goal is to identify and locate objects in images or videos. It’s a crucial part of computer vision, a field of artificial intelligence (AI) that enables machines to understand and interpret visual data. For example, object detection can help identify a car in a picture or spot a person in a video feed.

One of the most well-known series of models supporting computer vision tasks like object detection is the YOLO (You Only Look Once) model series. Designed for speed and accuracy, YOLO models have continually improved over time. For instance, one of the latest versions, Ultralytics YOLO11, performs well in real-world situations, providing accurate results even in more complex environments.

Driving this progress further, a new model called YOLOE aims to expand the capabilities of YOLO models. Unlike traditional models that require retraining to recognize new objects, YOLOE can follow simple text or image prompts to detect objects it hasn’t seen before, making it much more adaptable to changing environments.

In this article, we’ll take a closer look at what makes YOLOE unique, how it compares to previous YOLO models, and how you can start using it today. Let's get started!

An overview of YOLOE

YOLOE is a computer vision model that takes object detection a step further. It was introduced in March 2025 by researchers at Tsinghua University. What sets YOLOE apart from traditional models is its use of open-vocabulary detection. 

While most models are trained to recognize a fixed list of objects, YOLOE allows you to specify what to look for using a short description or an example image. For example, if you're searching for a “green backpack,” you can either type that description or show the model a photo, and YOLOE will locate it within the scene.

Additionally, even without any prompt, YOLOE can detect many everyday objects on its own. This ability to recognize objects it has never seen before is called zero-shot detection. It’s particularly useful in dynamic environments where the task or objects of interest may change unexpectedly.

Fig 1. A look at YOLOE’s capabilities.

Key features of YOLOE

YOLOE supports a wide range of features designed to enhance its performance in real-world applications. With its ability to handle both structured and unstructured input, YOLOE opens up new possibilities for object detection and segmentation. 

Here are some of the key features the model brings to the table:

  • Prompt-based detection: YOLOE can search for objects based on a short text prompt or an example image. This means you don’t need to retrain the model every time your task changes; just describe or show the model what you’re looking for.
  • Instance segmentation: In addition to drawing bounding boxes around objects, YOLOE can outline their exact shape using instance segmentation. This is especially helpful when objects are overlapping or when you need to know the precise boundaries of an object.
  • Prompt-free object recognition: YOLOE can recognize objects even without specific instructions. It uses a set of pre-learned descriptions to quickly identify objects, making the process faster and more efficient.

Comparing YOLOE with other YOLO models

Now that we have a better understanding of what YOLOE is, let's take a look at some of the models in the YOLO family that are similar. 

As computer vision has progressed, so have the YOLO models. For instance, Ultralytics YOLOv8 brought support for new tasks like segmentation and classification, while later versions, like Ultralytics YOLO11, have focused on improving accuracy and performance for a wider range of tasks.

Also, YOLO-World was released in January 2024 and introduced the ability to use written prompts, letting users describe the objects they want to find. While YOLO-World was a great option for zero-shot detection, it lacked features like instance segmentation and visual prompt support. 

YOLOE builds on YOLO-World by adding these capabilities, improving flexibility and performance, and offering a more impactful tool for real-world computer vision applications.

Fig 2. YOLO-World and YOLOE both support zero-shot detection.

Using YOLOE with the Ultralytics Python package

Whether you want to detect specific objects or explore everything in an image, getting started with YOLOE is simple. This model is supported by the Ultralytics Python package, making it easy to integrate into your projects. Next, let’s walk through how to use it.

Installing the Ultralytics package

The first step is to install the Ultralytics Python package using a package manager such as ‘pip.’ You can do this by running the command “pip install ultralytics” in your terminal or command prompt.

Once the package is installed, you'll have everything you need to load the model, make predictions, and experiment with different detection modes. If you encounter any issues during installation, the official Ultralytics documentation offers a helpful troubleshooting section

There are a few different ways to use YOLOE to run predictions. Running predictions means using the trained model to identify and locate objects within images or videos. These different methods allow you to customize how you interact with the model based on your specific needs.

Let’s discuss each of these methods, one at a time.

Detecting specific objects with text or image prompts

YOLOE can detect objects based on a short text description. For example, if you are looking for a horse in motion, you can use a prompt like "horse walking."

To get started, first, load the pre-trained YOLOE model and set your prompt (the description of what you want the model to look for) as shown in the code snippet below.

from ultralytics import YOLOE

model = YOLOE("yoloe-11l-seg.pt")
prompt = ["horse walking"]
model.set_classes(prompt, model.get_text_pe(prompt))

Once your model and prompt are set, you can run the model on an image or video. Replace the file path in the code with the path to your image or video file:

results = model.predict("path/to/your/image.jpg")
results[0].show()

This will display the image with the detected object clearly marked based on your prompt. You can change the prompt to search for different objects, such as "red suitcase," "bicycle," or "zebra," depending on what you're looking for.

Fig 3. An example of using YOLOE to detect specific objects using a text prompt.

Similarly, you can use an image to prompt YOLOE with the Ultralytics Python package. In visual prompt mode, the model uses the image to find similar-looking items in another scene. This is particularly useful for objects that are difficult to describe or lack clear labels. 

To explore the code for this in more detail, you can check out the Ultralytics documentation.

General object detection using YOLOE

In some cases, you might not know exactly what to search for, or you may not be looking for a particular object. That’s where the prompt-free mode comes in handy. 

With this option, you don’t need to type a description or provide an example image. YOLOE simply analyzes images on its own and detects everything it can recognize, such as people, animals, furniture, or everyday objects.

It is a helpful way to explore a scene without giving the model any specific instructions. Whether you are scanning a crowded room or reviewing footage with lots of activity, prompt-free mode gives you a quick look at what is present in an image. 

You can use the following code to run YOLOE in prompt-free mode. First, the model is loaded, then it processes the image and automatically detects the objects in it. Finally, the results are displayed, and the detected objects are highlighted. 

Be sure to replace the file path with the actual path to your image.

from ultralytics import YOLOE
model = YOLOE("yoloe-11l-seg-pf.pt")
results = model.predict("path/to/image.jpg")
results[0].show()

The image shown below is an example of what YOLOE can detect in prompt-free mode.

Fig 4. Using YOLOE in prompt-free mode.

Real-time applications of YOLOE

YOLOE’s ability to respond to both text and image prompts makes it a reliable tool for real-time applications. Its flexibility is particularly useful in fast-paced environments where timing and accuracy are essential. 

Let’s explore some real-world examples of how YOLOE can be used.

Improving baggage handling: Real-time luggage detection

In busy airports, locating specific luggage can be challenging, especially when dealing with missing bags. YOLOE can streamline this process by helping scan live videos and quickly identifying items based on simple prompts like “red bag.” 

If a bag is missing or misplaced, staff can easily change the prompt to search for a different item, such as a “black suitcase.” This ability to adapt instantly can help airport staff quickly locate the right luggage without reviewing long hours of footage or retraining the model, making baggage handling and the resolution of missing luggage issues much faster and more efficient.

Monitoring public spaces with YOLOE

Surveillance footage of public spaces, such as crowded markets and cafés, often includes a mix of people, objects, and activities that change throughout the day. YOLOE can analyze this footage in real-time using prompt-free mode, automatically detecting items like bags, tables, or bicycles without needing specific instructions.

Fig 5. YOLOE can detect various objects in a busy public space.

This is particularly useful for security teams to spot unattended items or track crowd movement. YOLOE’s ability to detect multiple objects at once makes it easier to manage public spaces during events or busy periods, helping teams stay informed and responsive.

Pros and cons of YOLOE

Here are some of the key benefits of using YOLOE for computer vision applications:

  • Real-time performance: YOLOE is optimized for fast and efficient processing, enabling real-time detection, even in dynamic environments like live video streams or busy public spaces.
  • Scalability: YOLOE is scalable and works well for a wide variety of applications, from security and surveillance to retail, healthcare, and autonomous vehicles.
  • Easy to use: Since YOLOE is supported by the Ultralytics Python package, it’s easy to integrate into your existing computer vision projects.

However, there are a few limitations to keep in mind when using YOLOE. Here are a couple of factors to consider:

  • Requires sufficient training data: Although YOLOE supports zero-shot detection, its performance on unseen objects depends on how well it generalizes from its training data. In some cases, it may need additional data or fine-tuning to perform well in highly specialized tasks.
  • Sensitive to input quality: The model’s accuracy can be affected by low-quality images or video. Blurry or poorly lit input can reduce the model's ability to accurately detect objects, so high-quality input is important for optimal performance.

Key takeaways

YOLOE brings more flexibility to computer vision by allowing users to guide detection with text or image prompts. It works well in real-world situations where scenes change quickly, and retraining is not an option.

From baggage handling to public space monitoring, YOLOE adapts to new tasks with ease. As AI becomes more accessible, models like YOLOE are helping more industries use vision technology in practical, efficient ways.

Join our community and explore our GitHub repository to learn more about AI innovations. Discover the latest advancements in areas like AI in retail and computer vision in healthcare on our solution pages. Check out our licensing options and get started with computer vision today!

LinkedIn logoTwitter logoFacebook logoCopy-link symbol

Read more in this category

Let’s build the future
of AI together!

Begin your journey with the future of machine learning