[1811.00102] On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset
Our simulations demonstrate the efficacy of this uncomplicated approach and experimental results shows that our method outperforms many other existing methods in literature.

Abstract: Typically clustering algorithms provide clustering solutions with
prespecified number of clusters. The lack of a priori knowledge on the true
number of underlying clusters in the dataset makes it important to have a
metric to compare the clustering solutions with different number of clusters.
This article quantifies a notion of persistence of clustering solutions that
enables comparing solutions with different number of clusters. The persistence
relates to the range of data-resolution scales over which a clustering solution
persists; it is quantified in terms of the maximum over two-norms of all the
associated cluster-covariance matrices. Thus we associate a persistence value
for each element in a set of clustering solutions with different number of
clusters. We show that the datasets where natural clusters are a priori known,
the clustering solutions that identify the natural clusters are most persistent
- in this way, this notion can be used to identify solutions with true number
of clusters. Detailed experiments on a variety of standard and synthetic
datasets demonstrate that the proposed persistence-based indicator outperforms
the existing approaches, such as, gap-statistic method, $X$-means, $G$-means,
$PG$-means, dip-means algorithms and information-theoretic method, in
accurately identifying the clustering solutions with true number of clusters.
Interestingly, our method can be explained in terms of the phase-transition
phenomenon in the deterministic annealing algorithm, where the number of
distinct cluster centers changes (bifurcates) with respect to an annealing
parameter.

Figure 1: Illustration of a mixture of nine Gaussian distributions arranged in groups of three superclusters. In (a1) the three superclusters are well separated from each other while in (b1) they are closer to each other. Observe that in (a2) for a large range in resolution scales (radii r) within the blue annulus, each supercluster appears as a single cluster, and only for a small range of resolution scale depicted by green annulus, each Gaussian distribution is identifiable separately. In other words for a large range of resolution the only the three superclusters are distinguishable from one another while for a smaller range of resolution each Gaussian distribution is identified separately. In (b1) since the three superclusters are closer to each other, the range of resolution scales within the blue annulus gets reduced, thereby indicating existence of three natural clusters in (a1) and nine natural clusters in (b1). (Introduction and Related Work)Figure 2: Evaluation of our method on a variety of synthetic datasets - (a) Low variance Gaussian, kt = 4, (b) High variance Gaussian, kt = 4, (c) ComboSetting, kt = 8, (d) Synthetic-15 (S15), kt = 15, (e) Concentric rings, kt = 3, and (f) Spirals, kt = 3. Our method predicts the correct number of clusters in each of these scenarios. (Persistence of a Clustering Solution and its Quantification)Figure 3: Illustrates two circular clusters of radius R with uniformly distributed data-points. (Persistence of a Clustering Solution and its Quantification)Figure 4: Illustration of performance of our method on highdimensional datasets - (a) Wisconsin (d = 9, N = 681, kt = 2), (b) Yeast (d = 8, N = 1484, kt = 10), (c) Glass (d = 9, N = 214, kt = 6) (d) Leaves (d = 64, N = 1600, kt = 100) (e) Wine (d = 13, N = 178, kt = 3) (f) Iris (d = 4, N = 150, kt = 3) (g) Banknote Authentication (d = 4, N = 1372, kt = 2) (h) Thyroid (d = 5, N = 215, kt = 3) . Note that, except for the Iris dataset, v(kt) is maximum in all the plots. (Persistence of a Clustering Solution and its Quantification)Figure 5: (a) Synthetic 2-d data with N = 5000 vectors and kt = 15 Gaussian clusters (Fränti and Virmajoki 2006) with different degree of overlapping. The corresponding v(k) versus k plot for both the cases show a peak at k = 15. (b) The v(k) versus k plot indicates 3 to be true number of clusters. (c) The v(k) versus k plot indicates 9 to be true number of clusters. (d) The v(k) versus k plot indicates kt = 100 for Birch1 dataset. (Additional Experimental Results)Figure 7: The Figure illustrates the optimal clustering at each value of k ∈ {1, 2, 3, 4, 5, 6} for the dataset in the Figure ?? (Proof of Lemma1)Figure 6: Illustrates the v(k) versus k plot for (a) the Wisconsin breast cancer dataset which comprises of N = 699 vectors in d = 9 dimension with kt = 2 (b) Banknote authentication dataset with N = 1372, d = 4 and kt = 2. The v(k) versus k plot gives the correct estimate of true number of clusters in both the scenarios. (c) The v(k) versus k plot for the Iris dataset that contains N = 150, d = 4 dimensional vectors from 3 classes. As noted earlier 2 of the 3 classes have a considerable overlap with one another. The plot estimates kt = 2. Also note that v(3) is considerably high suggesting a possibility of 3 clusters or a hierarchical structure in the dataset. (Additional Experimental Results)Figure 8: Illustrates the v(k) vs k plot for the analytically calculated v(k) at various k’s for the dataset in Figure ??. As shown, k = 2 is a more natural number of cluster than k ∈ {3, 4, 5, 6}. (Proof for v(2) > v(k) ∀ k ∈ {3, 4, 5, 6})›