Vision AI

Florence-2: Microsoft's latest vision-language model

Meet Florence-2, Microsoft's visual language model that offers improved object detection, segmentation, and zero-shot performance with great efficiency.

ABAbirami Vina

6 min readJuly 26, 2024

Florence-2 vision-language model by Microsoft

In June 2024, Microsoft introduced Florence-2, a multi-modal visual language model (VLM) that is designed to handle a wide range of tasks including object detection, segmentation, image captioning, and grounding. Florence-2 sets a new benchmark for zero-shot performance, meaning it can perform tasks without prior specific training, and boasts a smaller model size than other state-of-the-art vision-language models.

It’s more than just another model, Florence-2's versatility and improved performance have the potential to significantly impact various industries by improving accuracy and reducing the need for extensive training. In this article, we will explore the innovative features of Florence-2, compare its performance with other VLMs, and discuss its potential applications.

Link to this sectionWhat is Florence-2?#

Florence-2 can handle a variety of tasks within a single unified framework. The model's impressive capabilities are partly thanks to its massive training dataset called FLD-5B. FLD-5B includes 5.4 billion annotations across 126 million images. This comprehensive dataset was created specifically to enable Florence-2 with the capabilities needed to handle a wide range of vision tasks with high accuracy and efficiency.

Here’s a closer look at the tasks Florence-2 supports:

Object Detection: It can identify and locate objects within images with high precision.
Segmentation: This task involves dividing an image into meaningful segments for easier analysis and interpretation.
Image Captioning: Florence-2 is capable of generating descriptive captions for images that provide context and details.
Visual Grounding: The model can associate specific phrases or words in a caption with the corresponding regions in the image.
Zero-shot Performance: It can perform tasks without specific training.

Diagram of how Florence-2 was trained

Fig 1. Understanding How Florence-2 Was Trained.

The model supports both text-based and region-based tasks. Special location tokens are added to the model's vocabulary for tasks involving specific regions of an image. These tokens help the model understand different shapes, such as rectangles around objects (box representation), four-sided shapes (quad box representation), and many-sided shapes (polygon representation). The model is trained using a method called cross-entropy loss, which helps it learn by comparing its predictions to the correct answers and adjusting its internal parameters accordingly.

Link to this sectionCreating the FLD-5B dataset#

The FLD-5B dataset includes different types of annotations: text descriptions, pairs of regions and text, and combinations of text, phrases, and regions. It was created through a two-step process involving data collection and annotation. Images were sourced from popular datasets like ImageNet-22k, Object 365, Open Images, Conceptual Captions, and LAION. The annotations in the FLD-5B dataset are mostly synthetic, meaning they were automatically generated rather than manually labeled.

Diagram of creating the FLD-5B dataset

Fig 2. Creating the FLD-5B Dataset.

Initially, specialist models skilled at specific tasks, like object detection or segmentation, created these annotations. Then, a filtration and enhancement process was used to make sure that the annotations were detailed and accurate. After removing any noise, the dataset went through iterative refinement, where Florence-2's outputs were used to continuously update and improve the annotations.

Link to this sectionUnderstanding Florence-2’s model architecture#

Florence-2's model architecture follows a sequence-to-sequence learning approach. This means that the model processes an input sequence (like an image with a text prompt) and generates an output sequence (like a description or a label) in a step-by-step manner. In the sequence-to-sequence framework, each task is treated as a translation problem: the model takes an input image and a task-specific prompt and generates the corresponding output.

Diagram of the Florence-2 vision-language model architecture

Fig 3. Florence-2’s Vision-Language Model Architecture.

At the core of the model architecture is a multi-modality encoder-decoder transformer, which combines an image encoder and a multi-modality encoder-decoder. The image encoder, called DaViT (Data-efficient Vision Transformer), processes input images by converting them into visual token embeddings - compact representations of the image that capture both spatial (where things are) and semantic (what things are) information. These visual tokens are then combined with text embeddings (representations of the text), allowing the model to seamlessly merge textual and visual data.

Link to this sectionComparing Florence-2 with other VLMs#

Florence-2 stands out from other visual language models due to its impressive zero-shot capabilities. Unlike models like PaliGemma, which rely on extensive fine-tuning to adapt to various tasks, Florence-2 works well right out of the box. Also, Florence-2 is able to compete with larger models like GPT-4V and Flamingo, which often have many more parameters but don't always match Florence-2's performance. For example, Florence-2 achieves better zero-shot results than Kosmos-2, despite Kosmos-2 having over twice the number of parameters.

In benchmark tests, Florence-2 has shown remarkable performance in tasks like COCO captioning and referring expression comprehension. It outperformed models like PolyFormer and UNINEXT in object detection and segmentation tasks on the COCO dataset. It is a highly competitive choice for real-world applications where both performance and resource efficiency are crucial.

Link to this sectionApplications of Florence-2#

Florence-2 can be used in many different industries, such as entertainment, accessibility, education, etc. Let’s walk through a few examples to get a better understanding.

Link to this sectionApplications of image captioning#

When you are on a streaming platform trying to decide what to watch, you might read a summary of a movie to help you choose. What if the platform could also provide a detailed description of the movie poster? Florence-2 can make that possible through image captioning, which generates descriptive text for images. Florence-2 can generate detailed descriptions of movie posters, making streaming platforms more inclusive for visually impaired users. By analyzing the visual elements of a poster, such as characters, scenery, and text, Florence-2 can create detailed descriptions that convey the poster's content and mood. The image below shows the level of detail that Florence-2 can provide in its description.

Example of an image caption generated by Florence-2

Fig 4. An example of an image caption generated by Florence-2.

Here are some other examples of where image captioning can be helpful:

E-commerce: Image captioning can provide detailed descriptions of product images, helping customers understand product features and details more clearly.
Travel and Tourism: It can provide detailed descriptions of landmarks and attractions in travel guides and apps.
Education: Image captioning can label and describe educational images and diagrams, aiding in teaching and learning.
Real Estate: It can provide detailed descriptions of property images that highlight features and amenities for potential buyers.

Link to this sectionUsing visual grounding while cooking#

Florence-2 can also be used to enrich culinary experiences. For instance, an online cookbook could use Florence-2 to visually ground and label parts of a complex recipe image. Visual grounding helps here by linking specific parts of the image to corresponding descriptive text. Each ingredient and step can be accurately labeled and explained, making it easier for home cooks to follow the recipe and understand each component's role in the dish.

Example of visual grounding using Florence-2

Fig 5. An example of visual grounding using Florence-2.

Link to this sectionRegion-based OCR for financial documents#

OCR with region-based processing, which focuses on extracting text from specific areas within a document, can come in handy when it comes to fields like accounting. Designated areas of financial documents can be analyzed to automatically extract important information such as transaction details, account numbers, and due dates. By reducing the need for manual data entry, it minimizes errors and speeds up processing times. Financial institutions can use it to streamline tasks such as invoice processing, receipt reconciliation, and check clearing, leading to faster transactions and better customer service.

Example of OCR with region using Florence-2

Fig 6. An example of extracting OCR with region using Florence-2.

Link to this sectionRegion-based segmentation in industrial applications#

Region-based segmentation, which involves dividing an image into meaningful parts for focused analysis and detailed inspection, can fuel industrial applications that improve precision and efficiency in various processes. By focusing on specific areas within an image, this technology allows for detailed inspection and analysis of components and products. With respect to quality control, it can identify defects or inconsistencies in materials, such as cracks or misalignments, ensuring that only top-quality products reach the market.

Example of region-based segmentation using Florence-2

Fig 7. An example of segmentation based on regions using Florence-2.

It also improves automated assembly lines by guiding robotic arms to specific parts and optimizing the placement and assembly of components. Similarly, in inventory management, it helps track and monitor the condition and location of goods, leading to more efficient logistics and reduced downtime. Overall, region-based segmentation boosts accuracy and productivity, leading to cost savings and higher product quality in industrial settings.

Link to this sectionKey takeaways#

We are starting to see a trend where AI models are becoming lighter while still maintaining high performance. Florence-2 marks a major step forward in terms of visual language models. It can handle various tasks like object detection, segmentation, image captioning, and grounding with impressive zero-shot performance. Despite its smaller size, Florence-2 is efficient and multi-functional, which makes it extremely useful in terms of applications across different industries. Models like Florence-2 are bringing more possibilities to the table, expanding the potential for AI innovations.

Explore more on AI by visiting our GitHub repository and joining our community. Check out our solutions pages to read about AI applications in manufacturing and agriculture. 🚀