OTclust is an R package for computing a mean partition of an ensemble of clustering results by optimal transport alignment (OTA) and for assessing uncertainty at the levels of both partition and individual clusters. To measure uncertainty, set relationships between clusters in multiple clustering results are revealed. Functions are provided to compute the Covering Point Set (CPS), Cluster Alignment and Points based (CAP) separability, and Wasserstein distance between partitions.
Here, we illustrate the usage of OTclust for an ensemble
clustering based on a simulated toy example,
# the number of clusters.
C = 4
# generate an ensemble of perturbed partitions.
# if perturb_method is 1 then perturbed by bootstrap resampling, it it is 0, then perturbed by adding Gaussian noise.
ens.data = ensemble(sim1$X, nbs=100, clust_param=C, clustering="kmeans", perturb_method=1)
To find a consensus partition, the function
# calculate baseline method for comparison.
kcl = kmeans(sim1$X,C)
# align clustering results for convenience of comparison.
compar = align(cbind(sim1$z,kcl$cluster,ota$meanpart))
lab.match = lapply(compar$weight,function(x) apply(x,2,which.max))
kcl.algnd = match(kcl$cluster,lab.match[[1]])
ota.algnd = match(ota$meanpart,lab.match[[2]])
Here, as cluster-wise uncertainty measures, we briefly introduce the usage of topological relationship statistics of mean partitions, cluster alignment and points based (CAP) separability, and covering point sets (CPS). The detailed definition of the above statistics can be found in [1]. Moreover, if you want to carry out CPS Analysis, please next two sections.
# distance between ground truth and each partition
wassDist(sim1$z,kmeans(sim1$X,C)$cluster) # baseline method
#> [1] 0.2506715
wassDist(sim1$z,ota$meanpart) # mean partition by OTclust
#> [1] 0.2501118
# Topological relationships between mean partition and ensemble clusters
t(ota$match)
#> C1 C2 C3 C4
#> match 99 88 90 89
#> split 0 0 0 0
#> merge 0 0 0 0
#> l.c. 1 12 10 11
# Cluster Alignment and Points based (CAP) separability
ota$cap
#> C1 C2 C3 C4
#> C1 0.0000000 0.9129447 0.9969566 1.0000000
#> C2 0.9129447 0.0000000 1.0000000 0.9992862
#> C3 0.9969566 1.0000000 0.0000000 0.9519190
#> C4 1.0000000 0.9992862 0.9519190 0.0000000
# Covering Point Set(CPS)
otplot(sim1$X,ota$cps[lab.match[[2]][1],],legend.labels=c('','CPS'),add.text=F,title='CPS for C1')
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_text()`).
otplot(sim1$X,ota$cps[lab.match[[2]][2],],legend.labels=c('','CPS'),add.text=F,title='CPS for C2')
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_text()`).
otplot(sim1$X,ota$cps[lab.match[[2]][3],],legend.labels=c('','CPS'),add.text=F,title='CPS for C3')
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_text()`).
otplot(sim1$X,ota$cps[lab.match[[2]][4],],legend.labels=c('','CPS'),add.text=F,title='CPS for C4')
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_text()`).
The red area of the above plots indicates covering point set (CPS) for each cluster. The detail of the CPS analysis is addressed in the next section.
The functions that are going to be used in this section are
# CPS analysis on selection of visualization methods
data(vis_pollen)
c=visCPS(vis_pollen$vis, vis_pollen$ref)
After the computation, we have the return list c, which would be the
input of function
Furthermore, if you want to see the statistics, you can simply view
the return of
In this section, the relevant functions are
# CPS Analysis on validation of clustering result
data(YAN)
y=clustCPS(YAN, k=7, l=FALSE, pre=FALSE, noi="after", cmethod="kmeans", dimr="PCA", vis="tsne")
#> Warning in min(ref): no non-missing arguments to min; returning Inf
#> sigma summary: Min. : 0.323162264525782 |1st Qu. : 0.686532727791371 |Median : 0.840637685950217 |Mean : 0.832540338898672 |3rd Qu. : 0.996223616580691 |Max. : 1.26695806934483 |
#> Epoch: Iteration #100 error is: 14.3176613957954
#> Epoch: Iteration #200 error is: 0.543317480280311
#> Epoch: Iteration #300 error is: 0.472106374779122
#> Epoch: Iteration #400 error is: 0.467064919717036
#> Epoch: Iteration #500 error is: 0.432161905030954
#> Epoch: Iteration #600 error is: 0.424053192821421
#> Epoch: Iteration #700 error is: 0.424050206450443
#> Epoch: Iteration #800 error is: 0.424050205418211
#> Epoch: Iteration #900 error is: 0.424050205417626
#> Epoch: Iteration #1000 error is: 0.424050205417624
# visualization of the results
mplot(y,4)
cplot(y,4)
# point-wise stability assessment
p=pplot(y)
p$v
If you want to try other clustering method rather than
[1] Jia Li, Beomseok Seo, and Lin Lin. “Optimal transport, mean partition, and uncertainty assessment in cluster analysis.” Statistical Analysis and Data Mining: The ASA Data Science Journal 12.5 (2019): 359-377.
[2] Lixiang Zhang, Lin Lin, and Jia Li. “CPS analysis: self-contained validation of biomedical data clustering.” Bioinformatics 36.11 (2020): 3516-3521.