Semantic Caching

セマンティックキャッシングがどのようにAIのレイテンシとコストを削減するかを確認してください。LLMやビジョンパイプラインにおける仕組みを、実践的なUltralytics YOLO26の例で学びましょう。

セマンティックキャッシングは、Generative AIやLarge Language Models (LLMs)において主に使用される高度な最適化手法であり、クエリの正確なテキストではなくその「意味（セマンティクス）」に基づいて応答を保存および取得します。新しいプロンプトが以前に回答したものと同じ根本的な問いかけをしているかどうかを識別することで、セマンティックキャッシングはAIモデルを再呼び出しする必要を回避し、処理時間とAPIコストを大幅に削減します。

セマンティックキャッシングの仕組み#

同一の文字列一致を必要とする従来のcachingとは異なり、セマンティックキャッシュは受信したクエリをembeddingsとして知られる高次元の数値ベクトルに変換します。ユーザーがプロンプトを送信すると、Redis semantic cachingや類似のin-memory storesを利用するシステムはvector searchを実行し、新しいベクトルをvector database内の以前に保存されたベクトルと比較します。

This comparison relies on mathematical distance metrics, most commonly cosine similarity. If the similarity score between the new query and a cached query exceeds a predefined threshold (e.g., 0.95), it registers as a "cache hit." The system instantly returns the stored response, entirely skipping the inference engine. If the score falls below the threshold, it results in a "cache miss," prompting the model to generate a new response and store the new embedding-answer pair for future interactions. This workflow is highly effective in modern cloud architectures for scaling AI applications.

実社会での応用#

セマンティックキャッシングは、さまざまなドメインにおいて費用対効果の高いAIソリューションを展開する上で極めて重要です。

Customer Support Chatbots: In an IT support desk, hundreds of users might ask variations of the same question (e.g., "How do I reset my password?" vs. "Forgot password steps"). Semantic caching recognizes these intents as identical, ensuring the model only computes the answer once. This drastically lowers inference latency and reduces token usage for API management solutions.
Visual Discovery and RAG: マルチモーダルパイプラインにおいて、プラットフォームは特徴抽出を使用して参照画像の埋め込みをキャッシュします。ユーザーが画像をアップロードして視覚的に類似したアイテムを見つけようとする際、システムは意味的に一致するキャッシュされた結果を即座に取得できるため、大規模な視覚的入力を繰り返しエンコードすることなく、視覚的なレコメンデーションシステムを急速に加速させることができます。開発者は、こうしたキャッシングレイヤーを調整するためにLangChainのようなツールを頻繁に統合しています。

ビジョンにおけるセマンティックキャッシングのシミュレーション#

以下のPythonスニペットは、PyTorchとultralyticsパッケージを使用して、セマンティックキャッシュの中核となるメカニズムをシミュレートする方法を示しています。Ultralytics YOLO26分類モデルを使用して、以前にキャッシュされた画像と新しいクエリ画像との間の類似性を計算することで、システムは完全な推論パスが必要かどうかを判断できます。

import torch
from ultralytics import YOLO

# Load an Ultralytics YOLO26 classification model for embedding generation
model = YOLO("yolo26n-cls.pt")

# Extract the embedding for a previously 'cached' reference image
cached_embed = model.embed("reference_shoe.jpg")[0].flatten()

# Extract the embedding for a new user query image
new_embed = model.embed("user_uploaded_shoe.jpg")[0].flatten()

# Calculate cosine similarity to check for a semantic cache hit
similarity = torch.nn.functional.cosine_similarity(cached_embed, new_embed, dim=0)

# Apply a threshold to determine if the images are semantically equivalent
if similarity > 0.90:
    print(f"Cache hit! Similarity: {similarity.item():.2f}. Returning cached response.")
else:
    print(f"Cache miss! Similarity: {similarity.item():.2f}. Running full inference.")

データセットを管理し、高度なキャッシングアーキテクチャとシームレスに統合できる高度に最適化されたコンピュータビジョンモデルを展開したいチーム向けに、Ultralytics Platformは、モデルを大規模にトレーニング、追跡、提供するための直感的でエンドツーエンドの環境を提供します。

Semantic Caching

セマンティックキャッシングの仕組み#

実社会での応用#

関連するキャッシング用語の区別#

ビジョンにおけるセマンティックキャッシングのシミュレーション#

Explore solutions

ロボティクスにおけるAI

物流におけるAI

小売業界におけるAI

ヘルスケアにおけるAI

製造におけるAI

自動車におけるAI

農業におけるAI

ロボティクスにおけるAI

物流におけるAI

小売業界におけるAI

ヘルスケアにおけるAI

製造におけるAI

自動車におけるAI

農業におけるAI

ロボティクスにおけるAI

物流におけるAI

小売業界におけるAI

ヘルスケアにおけるAI

製造におけるAI

自動車におけるAI

農業におけるAI

AIの未来を共に築き上げましょう！