Yolo 深圳
深セン
今すぐ参加
用語集

F1スコア

機械学習におけるF1スコアの重要性をご覧ください!最適なモデル評価のために、適合率と再現率のバランスをどのように取るかを学びます。

The F1-Score is a critical performance metric in machine learning that combines precision and recall into a single harmonic mean. It is particularly useful for evaluating classification models where the dataset is imbalanced or where false positives and false negatives carry different costs. Unlike straightforward accuracy, which can be misleading if one class dominates the dataset, the F1-Score provides a more balanced view of a model's ability to identify relevant instances correctly while minimizing errors. By penalizing extreme values, it ensures that a high score is only achieved when both precision and recall are reasonably high, making it a staple metric in fields ranging from medical diagnostics to information retrieval.

Why F1-Score Matters in Machine Learning

In many real-world scenarios, simply knowing the percentage of correct predictions (accuracy) is insufficient. For example, in anomaly detection, normal cases far outnumber anomalies. A model that predicts "normal" for every single input might achieve 99% accuracy but would be useless for detecting actual issues. The F1-Score addresses this by balancing two competing metrics:

  • Precision: This measures the quality of positive predictions. It answers the question, "Of all the instances the model labeled as positive, how many were actually positive?"
  • Recall: This measures the quantity of positive predictions. It answers, "Of all the actual positive instances, how many did the model correctly identify?"

Because there is often a trade-off—improving precision tends to lower recall and vice versa—the F1-Score acts as a unified metric to find an optimal balance point. This is crucial when tuning models using hyperparameter optimization to ensure robust performance across diverse conditions.

実際のアプリケーション

The utility of the F1-Score extends across various industries where the cost of error is significant.

  • Medical Diagnostics: In AI in healthcare, specifically for tasks like tumor detection, a false negative (missing a tumor) is life-threatening, while a false positive (flagging benign tissue) causes unnecessary anxiety. The F1-Score helps researchers optimize models like YOLO26 to ensure that the system is sensitive enough to catch diseases without overwhelming doctors with false alarms.
  • Information Retrieval and Search: Search engines and document classification systems use F1-Score to evaluate relevance. Users want to see all relevant documents (high recall) but do not want to wade through irrelevant results (high precision). A high F1-Score indicates the engine is effectively retrieving the right information without clutter.
  • Spam Filtering: Email services use text classification to segregate spam. The system must catch spam emails (recall) but crucially must not label important work emails as junk (precision). The F1-Score serves as the primary benchmark for these filters.

Calculating F1-Score with Ultralytics

Modern computer vision frameworks simplify the calculation of these metrics. When training object detection models, the F1-Score is automatically computed during the validation phase. The Ultralytics Platform visualizes these metrics in real-time charts, allowing users to see the curve of F1-Score against different confidence thresholds.

Here is how you can access validation metrics, including components of the F1-Score, using the Python API:

from ultralytics import YOLO

# Load a pre-trained YOLO26 model
model = YOLO("yolo26n.pt")

# Validate the model on a dataset (metrics are computed automatically)
# This returns a validator object containing precision, recall, and mAP
metrics = model.val(data="coco8.yaml")

# Print the Mean Average Precision (mAP50-95), which correlates with F1 performance
print(f"mAP50-95: {metrics.box.map}")

# Access precision and recall arrays to manually inspect the balance
print(f"Precision: {metrics.box.p}")
print(f"Recall: {metrics.box.r}")

F1-Score vs. Related Metrics

Understanding how the F1-Score differs from other evaluation criteria is essential for selecting the right tool for your project.

  • Difference from Accuracy: Accuracy treats all errors equally. F1-Score is superior for imbalanced datasets because it focuses on the performance of the positive class (the minority class of interest).
  • Relation to mAP: Mean Average Precision (mAP) is the standard for comparing object detection models across all confidence thresholds. However, F1-Score is often used to determine the optimal confidence threshold for deployment. You might pick the threshold where the F1 curve peaks to deploy your application.
  • Confusion Matrix: The confusion matrix provides the raw counts (True Positives, False Positives, etc.) from which the F1-Score is derived. While the matrix gives granular detail, the F1-Score provides a single summary statistic for quick comparison.
  • ROC-AUC: The Area Under the Curve (AUC) measures separability across all thresholds. F1-Score is generally preferred over ROC-AUC when you have a highly skewed class distribution (e.g., fraud detection where fraud is rare).

Improving Your F1-Score

If your model suffers from a low F1-Score, several strategies can help. Data augmentation can increase the variety of positive examples, helping the model generalize better. Employing transfer learning from robust foundation models allows the network to leverage pre-learned features. Additionally, adjusting the confidence threshold during inference can manually shift the balance between precision and recall to maximize the F1-Score for your specific use case.

Ultralytics コミュニティに参加する

AIの未来を共に切り開きましょう。グローバルなイノベーターと繋がり、協力し、成長を。

今すぐ参加