Yolo فيجن شنتشن
شنتشن
انضم الآن
مسرد المصطلحات

التخزين المؤقت للموجه

Discover how prompt caching optimizes generative AI by reducing latency and costs. Learn how to speed up LLM inference and vision models like YOLO26.

Prompt caching is an advanced optimization strategy used primarily in generative AI to significantly reduce costs and improve response times during inference. In the realm of Large Language Models (LLMs), processing text requires converting inputs into numerical sequences known as tokens. Often, a large portion of the input data—such as a detailed system instruction, a lengthy legal document, or a codebase—remains static across many different user queries. Instead of re-processing these unchanging sections for every new request, prompt caching stores the pre-computed mathematical states (often called the Key-Value cache) in memory. This allows the inference engine to skip redundant calculations, focusing computational power only on the new, dynamic parts of the user's prompt.

الآليات والفوائد

The fundamental mechanics of prompt caching rely on the architecture of Transformers, which process data sequentially. By identifying the repetitive prefix of a prompt, the system can load the corresponding attention mechanism states directly from high-speed memory.

  • Reduced Latency: Caching dramatically lowers the inference latency, specifically the Time to First Token (TTFT). This ensures that real-time applications, such as interactive chatbots, feel instantaneous to the user.
  • Cost Efficiency: Since Cloud Computing providers often bill based on compute duration or token processing, skipping the heavy lifting for the static context leads to substantial savings.
  • Increased Throughput: By freeing up GPU resources, servers can handle a higher volume of concurrent requests, making the entire model serving infrastructure more scalable.

تطبيقات واقعية

Prompt caching is transforming industries that rely on heavy data context.

  1. Coding Assistants: In software development, tools like GitHub Copilot utilize vast amounts of context from the user's open files and repository structure. By caching the embeddings of the codebase, the model can provide real-time code completion suggestions without re-analyzing the entire project file structure for every keystroke.
  2. Legal and Medical Analysis: Professionals often query AI Agents against massive static documents, such as case law archives or patient history records. Using Retrieval-Augmented Generation (RAG), the system retrieves relevant chunks of text. Prompt caching ensures that the foundational context of these retrieved documents does not need to be recomputed for follow-up questions, streamlining the Question Answering workflow.

الأهمية في رؤية الكمبيوتر

While traditionally associated with text, the concept of caching is vital in multi-modal Computer Vision (CV). Models like YOLO-World allow users to detect objects using open-vocabulary text prompts. When a user defines a list of classes (e.g., "person, backpack, car"), the model computes text embeddings for these classes. Caching these embeddings prevents the model from needing to re-encode the text prompts for every single video frame, enabling high-speed Real-Time Inference.

التمييز بين المصطلحات ذات الصلة

  • Vs. Prompt Engineering: Prompt engineering involves the human effort of designing the optimal text input to guide the model. Prompt caching is a backend computational optimization that stores the machine's processing of that text.
  • Vs. Prompt Tuning: Prompt tuning is a Transfer Learning technique that updates specific Model Weights (soft prompts) to adapt a model to a task. Caching does not change the model's parameters; it only memorizes activation states during runtime.

Code Example: Caching Text Embeddings in Vision

ما يلي Python snippet demonstrates the concept of "caching" a prompt in a vision context using the ultralytics package. By setting the classes once in a YOLO model, the text embeddings are computed and stored (persisted), allowing the model to efficiently predict on multiple images without re-processing the text description.

from ultralytics import YOLOWorld

# Load a YOLO-World model capable of open-vocabulary detection
model = YOLOWorld("yolov8s-world.pt")

# "Cache" the prompt: Define classes once.
# The model computes and stores text embeddings for these specific terms.
model.set_classes(["helmet", "reflective vest", "gloves"])

# Run inference repeatedly. The text prompt is not re-computed for each call.
# This mimics the efficiency gains of prompt caching in LLMs.
results_1 = model.predict("construction_site_1.jpg")
results_2 = model.predict("construction_site_2.jpg")

For managing datasets and deploying these optimized models, the Ultralytics Platform provides a comprehensive environment for annotating data, training state-of-the-art models like YOLO26, and monitoring deployment performance across various Edge AI devices.

انضم إلى مجتمع Ultralytics

انضم إلى مستقبل الذكاء الاصطناعي. تواصل وتعاون وانمو مع المبتكرين العالميين

انضم الآن