深圳Yolo 视觉
深圳
立即加入
词汇表

多模态学习

探索人工智能中多模态学习的力量!了解模型如何整合不同的数据类型,以实现更丰富、更实际的问题解决。

Multi-modal learning is a sophisticated approach in artificial intelligence (AI) that trains algorithms to process, understand, and correlate information from multiple distinct types of data, or "modalities." Unlike traditional systems that specialize in a single input type—such as text for translation or pixels for image recognition—multi-modal learning mimics human cognition by integrating diverse sensory inputs like visual data, spoken audio, textual descriptions, and sensor readings. This holistic approach allows machine learning (ML) models to develop a deeper, context-aware understanding of the world, leading to more robust and versatile predictions.

多模态学习的工作原理

The core challenge in multi-modal learning is translating different data types into a shared mathematical space where they can be compared and combined. This process generally involves three main stages: encoding, alignment, and fusion.

  1. Feature Extraction: Specialized neural networks process each modality independently. For instance, convolutional neural networks (CNNs) or Vision Transformers (ViTs) might extract features from images, while Recurrent Neural Networks (RNNs) or Transformers process text.
  2. Embeddings Alignment: The model learns to map these diverse features into shared high-dimensional vectors. In this shared space, the vector for the word "cat" and the vector for an image of a cat are brought close together. Techniques like contrastive learning, popularized by papers such as OpenAI's CLIP, are essential here.
  3. 数据融合最终,信息被整合以执行任务。融合可发生在早期(合并原始数据)、后期(合并最终预测结果),或通过采用注意力机制动态权衡各模态重要性的中间混合方法实现。

实际应用

Multi-modal learning is the engine behind many of today's most impressive AI breakthroughs, bridging the gap between distinct data silos to solve complex problems.

利用Ultralytics进行多模态物体检测

标准物体检测器依赖预定义类别,而多模态方法(YOLO)则允许用户通过开放词汇文本提示detect 。这充分展现了在Ultralytics 将文本概念与视觉特征相融合的强大能力。

The following Python code snippet shows how to use a pre-trained YOLO-World model to detect objects based on custom text inputs.

from ultralytics import YOLOWorld

# Load a pretrained YOLO-World model (Multi-Modal: Text + Vision)
model = YOLOWorld("yolov8s-world.pt")

# Define custom text prompts (modalities) for the model to identify
model.set_classes(["person", "bus", "traffic light"])

# Run inference: The model aligns the text prompts with visual features
results = model.predict("https://ultralytics.com/images/bus.jpg")

# Show the results
results[0].show()

区分关键术语

要想了解现代人工智能的发展状况,将 "多模态学习 "与相关概念区分开来是很有帮助的:

  • 多模态模型 "多模态学习"指的是方法论及研究领域。而"多模态模型"( Google)则是该训练过程所产生的具体成果或软件产品。
  • 单模态人工智能传统计算机视觉通常属于单模态范畴,仅专注于视觉数据。Ultralytics 模型是检测物体的尖端计算机视觉工具,但它通常仅处理视觉输入,除非作为更大型多模态管道的一部分。
  • 大型语言模型(LLM) 传统的 LLM 是单模态的,只针对文本进行训练。然而,业界正在向 "大型多模态 模型"(LMMs),这种模型可以处理图像和文本。 PyTorchTensorFlow.

未来展望

多模态学习的发展轨迹指向拥有以下功能的系统 人工通用智能(AGI) 特征的系统。通过成功地将语言建立在视觉和物理现实的基础上,这些模型正在超越统计相关性,走向真正的推理。 统计相关性,走向真正的推理。麻省理工学院 CSAIL 和 麻省理工学院 CSAIL斯坦福基础模型研究中心等机构的研究,不断推动着 机器如何感知复杂的多感官环境并与之互动。

Ultralytics我们正将这些技术进步整合Ultralytics , 使用户能够管理数据、训练模型并部署解决方案, 充分利用所有可用模态技术——从YOLO26的速度优势 到开放词汇检测的多功能特性。

加入Ultralytics 社区

加入人工智能的未来。与全球创新者联系、协作和共同成长

立即加入