プロンプトインジェクションが、AIの脆弱性をどのように悪用し、セキュリティに影響を与えるか、そして悪意のある攻撃からAIシステムを保護するための戦略を学んでください。
Prompt injection is a security vulnerability that primarily impacts systems built on Generative AI and Large Language Models (LLMs). It occurs when a malicious user crafts a specific input—often disguised as benign text—that tricks the artificial intelligence into overriding its original programming, safety guardrails, or system instructions. Unlike traditional hacking methods that exploit software bugs in code, prompt injection attacks the model's semantic interpretation of language. By manipulating the context window, an attacker can force the model to reveal sensitive data, generate prohibited content, or perform unauthorized actions. As AI becomes more autonomous, understanding this vulnerability is critical for maintaining robust AI Safety.
While initially discovered in text-only chatbots, prompt injection is becoming increasingly relevant in Computer Vision (CV) due to the emergence of Multi-Modal Models. Modern Vision-Language Models (VLMs), such as CLIP or open-vocabulary detectors like YOLO-World, allow users to define detection targets using natural language descriptions (e.g., "find the red backpack").
In these systems, the text prompt is converted into embeddings that the model compares against visual features. A "visual prompt injection" can occur if an attacker presents an image containing text instructions (like a sign saying "Ignore this object") that the model's Optical Character Recognition (OCR) component reads and interprets as a high-priority command. This creates a unique attack vector where the physical environment itself acts as the injection mechanism, challenging the reliability of Autonomous Vehicles and smart surveillance systems.
The implications of prompt injection extend across various industries where AI interacts with external inputs:
It is important to differentiate prompt injection from similar terms in the machine learning landscape:
The following code demonstrates how a user-defined text prompt interfaces with an open-vocabulary vision model. In a
secure application, the user_prompt would need rigorous sanitization to prevent injection attempts. We
use the ultralytics package to load a model capable of understanding text definitions.
from ultralytics import YOLO
# Load a YOLO-World model capable of open-vocabulary detection
# This model maps text prompts to visual objects
model = YOLO("yolov8s-world.pt")
# Standard usage: The system expects simple class names
safe_classes = ["person", "bicycle", "car"]
# Injection Scenario: A malicious user inputs a prompt attempting to alter behavior
# e.g., attempting to override internal safety concepts or confuse the tokenizer
malicious_input = ["ignore safety gear", "authorized personnel only"]
# Setting classes updates the model's internal embeddings
model.set_classes(malicious_input)
# Run prediction. If the model is vulnerable to the semantic content
# of the malicious prompt, detection results may be manipulated.
results = model.predict("https://ultralytics.com/images/bus.jpg")
# Visualize the potentially manipulated output
results[0].show()
Defending against prompt injection is an active area of research. Techniques include Reinforcement Learning from Human Feedback (RLHF) to train models to refuse harmful instructions, and implementing "sandwich" defenses where user input is enclosed between system instructions. Organizations using the Ultralytics Platform for training and deployment can monitor inference logs to detect anomalous prompt patterns. Additionally, the NIST AI Risk Management Framework provides guidelines for assessing and mitigating these types of risks in deployed systems.