深圳Yolo 视觉
深圳
立即加入
词汇表

K-均值聚类

了解 K-均值聚类,这是一种用于将数据分组到聚类中的关键无监督学习算法。 探索它的过程、应用和比较!

K-Means Clustering is a fundamental and widely used algorithm in the field of unsupervised learning designed to uncover hidden structures within unlabeled data. Its primary objective is to partition a dataset into distinct subgroups, known as clusters, such that data points within the same group are as similar as possible, while those in different groups are distinct. As a cornerstone of data mining and exploratory analysis, K-Means empowers data scientists to automatically organize complex information into manageable categories without the need for predefined labels or human supervision.

算法如何运行

The operation of K-Means is iterative and relies on distance metrics to determine the optimal grouping of the training data. The algorithm operates by organizing items into K clusters, where each item belongs to the cluster with the nearest mean, or centroid. This process minimizes the variance within each group. The workflow generally follows these steps:

  1. 初始化:算法选取K个初始点作为聚类中心。这些中心点可随机选取,或通过k-means++等优化方法加速收敛过程
  2. 任务:数据集中的每个数据点都根据特定距离度量(最常见的是欧几里得距离)分配到最近的质心。
  3. 更新:通过计算分配到该簇的所有数据点的平均值(均值),重新计算质心。
  4. 迭代:重复步骤2和3,直至质心不再显著移动或达到最大迭代次数。

确定正确的聚类数(K)是使用该算法的关键环节。实践者常采用肘部法等技术,或分析轮廓分数,以评估所得聚类的分离程度。

人工智能在现实世界中的应用

K均值聚类法具有高度的灵活性,在简化数据和数据预处理方面被广泛应用于各个行业。

  • Image Compression and Color Quantization: In computer vision (CV), K-Means helps reduce the file size of images by clustering pixel colors. By grouping thousands of colors into a smaller set of dominant colors, the algorithm effectively performs dimensionality reduction while preserving the visual structure of the image. This technique is often used before training advanced object detection models to normalize input data.
  • Customer Segmentation: Businesses leverage clustering to group customers based on purchasing history, demographics, or website behavior. This allows for targeted marketing strategies, a key component of AI in retail solutions. By identifying high-value shoppers or churn risks, companies can tailor their messaging effectively.
  • Anomaly Detection: By learning the structure of "normal" data clusters, systems can identify outliers that fall far from any centroid. This is valuable for fraud detection in finance and anomaly detection in network security, helping to flag suspicious activities that deviate from standard patterns.
  • Anchor Box Generation: Historically, object detectors like older YOLO versions utilized K-Means to calculate optimal anchor boxes from training datasets. While modern models like YOLO26 utilize advanced anchor-free methods, understanding K-Means remains relevant to the evolution of detection architectures.

实施实例

While deep learning frameworks like the Ultralytics Platform handle complex training pipelines, K-Means is often used for analyzing dataset statistics. The following Python snippet demonstrates how to cluster 2D coordinates—simulating object centroids—using the popular Scikit-learn library.

import numpy as np
from sklearn.cluster import KMeans

# Simulated coordinates of detected objects (e.g., from YOLO26 inference)
points = np.array([[10, 10], [12, 11], [100, 100], [102, 101], [10, 12], [101, 102]])

# Initialize K-Means to find 2 distinct groups (clusters)
kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(points)

# Output the cluster labels (0 or 1) for each point
print(f"Cluster Labels: {kmeans.labels_}")
# Output: [1 1 0 0 1 0] -> Points near (10,10) are Cluster 1, near (100,100) are Cluster 0

与相关算法的比较

区分K均值算法与其他名称或功能相似的算法至关重要,以确保为项目选择正确的工具。

  • K均值与K最近邻(KNN)算法:因名称中均含"K"字母,二者常被混淆。 K均值是一种用于聚类无标签数据的无监督算法。而K最近邻(KNN)则是基于标签数据的监督学习算法,主要应用于图像分类和回归任务,通过分析邻域数据中多数类别的分布来进行预测。
  • K-Means vs. DBSCAN: While both cluster data, K-Means assumes clusters are spherical and requires the number of clusters to be defined beforehand. DBSCAN groups data based on density, can find clusters of arbitrary shapes, and handles noise better. This makes DBSCAN superior for complex spatial data found in datasets with irregular structures where the number of clusters is unknown.

加入Ultralytics 社区

加入人工智能的未来。与全球创新者联系、协作和共同成长

立即加入