Seeding Reference¶
K-Means++¶
K-Means++ is the original seeding algorithm for K-Means as proposed by Arthur and Vassilvitskii in 2007 [kmpp].
It chooses one center uniformly, then computes distance of every datapoint to already chosen centers in order to use distance as weights when sampling next center. These steps are repeated until k centers are chosen.
Chosing good seedings speeds up convergence for K-Means, but extra time cost is occurred calculating all distances.
K-Means Markov Chain Monte Carlo¶
KMC^2 was proposed as an improvement over K-Means++ in 2016 [kmc2]. While K-Means++ requires k full passes over the dataset, KMC^2 replaces the D^2 sampling step with Markov Chain Monte Carlo sampling. Runtime is no longer tied to number of datapoints while new centers will be chosen far from current centers.
-
afkmc2.
kmc2
(X, k, m=200)[source]¶ - KMC^2 Seeding as described by Bachem, Lucic, Hassani and Krause (2016)Runtime
O(mk^2d)
Parameters: - X (np.array) – Datapoints. Shape: (n, d)
- k (int) – Number cluster centers.
- m (int) – Length of Markov Chain. Default 200
Returns: Cluster centers for seeding. Shape: (k, d)
Return type: np.array
Example: seeds = afkmc2.kmc2(X, 3)
Assumption Free KMC^2¶
AFKMC^2 is an improvement proposed by the same authors [afkmc2]. While KMC^2 requires assumptions about the data generating distribution (in our implementation uniformity), this algorithm works without such assumptions. It the true D^2-sampling distribution with regards to the first center c_1
as a proposal distribution that can approximate nonuniform distributions.
This means an added runtime cost of O(nd)
to calculate the initial distribution, but performance improvements for nonuniform samples.
-
afkmc2.
afkmc2
(X, k, m=200)[source]¶ - AFKMC^2 Seeding as described by Bachem, Lucic, Hassani and Krause (2016)Runtime
O(nd + mk^2d)
Parameters: - X (np.array) – Datapoints. Shape: (n, d)
- k (int) – Number cluster centers.
- m (int) – Length of Markov Chain. Default 200
Returns: Cluster centers for seeding. Shape: (k, d)
Return type: np.array
Example: seeds = afkmc2.afkmc2(X, 3)
Cached AFKMC^2¶
The author of this package proposed this slight runtime improvement for AFKMC^2 (it could also be applied to KMC^2). Since the first O(nd)
pass already calculates all distances between X
and c_1
we can at minimum save ourselves k*m distance calculations by storing these results to be reused in the Markov Chain. This comes with an additional space cost of O(nk)
.
The savings are highest for small datasets but can yield significant runtime improvement for very large/high-dimensional ones as well.
-
afkmc2.
afkmc2_c
(X, k, m=200)[source]¶ - Cached AFKMC^2 Seeding based on AFKMCas described by Bachem, Lucic, Hassani and Krause (2016)Caching addition to prevent duplicate work for small datasetsAdditional space cost during execution
O(nk)
RuntimeO(nd + mk^2d)
Parameters: - X (np.array) – Datapoints. Shape: (n, d)
- k (int) – Number cluster centers.
- m (int) – Length of Markov Chain. Default 200
Returns: Cluster centers for seeding. Shape: (k, d)
Return type: np.array
Example: seeds = afkmc2.afkmc2_c(X, 3)
References¶
[kmpp] | Arthur, D., & Vassilvitskii, S. (2007, January). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics. |
[kmc2] | Bachem, O., Lucic, M., Hassani, S. H., & Krause, A. (2016, February). Approximate K-Means++ in Sublinear Time. In AAAI (pp. 1459-1467). |
[afkmc2] | Bachem, O., Lucic, M., Hassani, H., & Krause, A. (2016). Fast and Provably Good Seedings for k-Means. In Advances in Neural Information Processing Systems (pp. 55-63). |