Title: | Mean Partition, Uncertainty Assessment, Cluster Validation and Visualization Selection for Cluster Analysis |
---|---|
Description: | Providing mean partition for ensemble clustering by optimal transport alignment(OTA), uncertainty measures for both partition-wise and cluster-wise assessment and multiple visualization functions to show uncertainty, for instance, membership heat map and plot of covering point set. A partition refers to an overall clustering result. Jia Li, Beomseok Seo, and Lin Lin (2019) <doi:10.1002/sam.11418>. Lixiang Zhang, Lin Lin, and Jia Li (2020) <doi:10.1093/bioinformatics/btaa165>. |
Authors: | Lixiang Zhang [aut, cre], Beomseok Seo [aut], Lin Lin [aut], Jia Li [aut] |
Maintainer: | Lixiang Zhang <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.6 |
Built: | 2025-01-31 05:41:18 UTC |
Source: | https://github.com/lixiangzhang/otclust |
This function aligns an ensemble of partitions with a reference partition by optimal transport.
align(data)
align(data)
data |
– a numeric matrix of horizontally stacked cluster labels. Each column contains cluster labels for all the data points according to one clustering result. The reference clustering result is put in the first column, and the first cluster must be labeled as 1. |
a list of alignment result.
distance |
Wasserstein distances between the reference partition and the others. |
numcls |
the number of clusters for each partition. |
statistics |
average tightness ratio, average coverage ratio, 1-average jaccard distance. |
cap |
cluster alignment and points based (CAP) separability. |
id |
switched labels. |
cps |
covering point set. |
match |
topological relationship statistics between the reference partition and the others. |
Weight |
weight matrix. |
data(sim1) # the number of clusters. C = 4 # calculate baseline method for comparison. kcl = kmeans(sim1$X,C) # align clustering results for convenience of comparison. compar = align(cbind(sim1$z,kcl$cluster))
data(sim1) # the number of clusters. C = 4 # calculate baseline method for comparison. kcl = kmeans(sim1$X,C) # align clustering results for convenience of comparison. compar = align(cbind(sim1$z,kcl$cluster))
Covering Point Set Analysis for validating clustering results. It conducts alignment among different results and then calculates the covering point set. The return contains several statistics which can be directly used as input for mplot or cplot. If you want to design your own workflow, you can use function CPS instead.
clustCPS( data, k, l = TRUE, pre = TRUE, noi = "after", cmethod = "kmeans", dimr = "PCA", vis = "tsne", ref = NULL, nPCA = 50, nEXP = 100 )
clustCPS( data, k, l = TRUE, pre = TRUE, noi = "after", cmethod = "kmeans", dimr = "PCA", vis = "tsne", ref = NULL, nPCA = 50, nEXP = 100 )
data |
– data given in a matrix format, where rows are samples, and columns are variables. |
k |
– number of clusters. |
l |
– logical. If True, log-transformation will be carried out on the data. |
pre |
– logical. If True, pre-dimension reduction will be carried out based on the variance. |
noi |
– adding noise before or after the dimension reduction, choosing between "before" and "after", default "after". |
cmethod |
– clustering method, choosing from "kmeans" and "mclust", default "kmeans". |
dimr |
– dimension reduction technique, choose from "none" and "PCA", default "PCA". |
vis |
– the visualization method to be used, such as "tsne" and "umap", default "tsne". Also, you can provide your own visualization coordinates in a numeric matrix of two columns. |
ref |
– optional, clustering result in a vector format and the first cluster is labeled as 1. If provided it will be used as the reference, if not we will generate one. |
nPCA |
– number of principal components to use, default 50. |
nEXP |
– number of perturbed clustering results for CPS Analysis, default 100. |
a list used for mplot or cplot, in which tight_all is the overall tightness, member is the matrix used for the membership plot, set is the matrix for the covering point set plot, tight is the vector of cluster-wise tightness, vis is the visualization coordinates, ref is the reference labels and topo is the topological relationship between clusters for point-wise uncertainty assessment.
# CPS Analysis on validation of clustering result data(YAN) # Suppose you generate the visualization coordinates on your own x1=matrix(seq(1,nrow(YAN),1),ncol=1) x2=matrix(seq(1,nrow(YAN),1),ncol=1) # Using nEXP=50 for illustration, usually use nEXP greater 100 y=clustCPS(YAN[,1:100], k=7, l=FALSE, pre=FALSE, noi="after",vis=cbind(x1,x2), nEXP = 50) # visualization of the results mplot(y,4)
# CPS Analysis on validation of clustering result data(YAN) # Suppose you generate the visualization coordinates on your own x1=matrix(seq(1,nrow(YAN),1),ncol=1) x2=matrix(seq(1,nrow(YAN),1),ncol=1) # Using nEXP=50 for illustration, usually use nEXP greater 100 y=clustCPS(YAN[,1:100], k=7, l=FALSE, pre=FALSE, noi="after",vis=cbind(x1,x2), nEXP = 50) # visualization of the results mplot(y,4)
Output the Covering Point Set plot of the required cluster. The return of clustCPS, visCPS or CPS can be directly used as the input.
cplot(result, k)
cplot(result, k)
result |
– the return from function clustCPS, visCPS or CPS. |
k |
– which cluster that you want to see the covering point set plot. |
covering point set plot of the required cluster.
# CPS analysis on selection of visualization methods data(vis_pollen) c=visCPS(vis_pollen$vis, vis_pollen$ref) # visualization of the results mplot(c,2) cplot(c,2)
# CPS analysis on selection of visualization methods data(vis_pollen) c=visCPS(vis_pollen$vis, vis_pollen$ref) # visualization of the results mplot(c,2) cplot(c,2)
Covering Point Set Analysis of given clustering results. It conducts alignment among different results and then calculates the covering point set. The return contains several statistics which can be directly used as input for mplot or cplot. By using this function you can design your own workflow instead of using clustCPS, see vignette for more details.
CPS(ref, vis, pert)
CPS(ref, vis, pert)
ref |
– the reference clustering result in a vector, the first cluster is labeled as 1. |
vis |
– the visualization coordinates in a numeric matrix of two columns. |
pert |
– a collection of clustering results in a matrix format, each column represents one clustering result. |
a list used for mplot or cplot, in which tight_all is the overall tightness, member is the matrix used for the membership heat map, set is the matrix for the covering point set plot, tight is the vector of cluster-wise tightness, vis is the visualization coordinates, ref is the reference labels and topo is the topological relationship between clusters for point-wise uncertainty assessment.
# CPS analysis on selection of visualization methods data(vis_pollen) k1=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k2=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k=cbind(as.matrix(k1,ncol=1),as.matrix(k2,ncol=1)) c=CPS(vis_pollen$ref, vis_pollen$vis, pert=k) # visualization of the results mplot(c,2) cplot(c,2)
# CPS analysis on selection of visualization methods data(vis_pollen) k1=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k2=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k=cbind(as.matrix(k1,ncol=1),as.matrix(k2,ncol=1)) c=CPS(vis_pollen$ref, vis_pollen$vis, pert=k) # visualization of the results mplot(c,2) cplot(c,2)
Generate multiple clustering results (that is, partitions) based on multiple versions of perturbed data using a specified baseline clustering method.
ensemble(data, nbs, clust_param, clustering = "kmeans", perturb_method = 1)
ensemble(data, nbs, clust_param, clustering = "kmeans", perturb_method = 1)
data |
– data that will be perturbed. |
nbs |
– the number of clustering partitions to be generated. |
clust_param |
– parameters for pre-defined clustering methods. If clustering is "kmeans", "Mclust", "hclust", this is an integer indicating the number of clusters. For "dbscan", a numeric indicating epsilon. For "HMM-VB", a list of parameters. |
clustering |
– baseline clustering methods. User specified functions or example methods included in package ("kmeans", "Mclust", "hclust", "dbscan", "HMM-VB") can be used. Refer to the Detail. |
perturb_method |
– adding noise is |
a matrix of cluster labels of the ensemble partitions. Each column is cluster labels of an individual clustering result.
data(sim1) # the number of clusters. C = 4 ens.data = ensemble(sim1$X[1:10,], nbs=10, clust_param=C, clustering="kmeans", perturb_method=1)
data(sim1) # the number of clusters. C = 4 ens.data = ensemble(sim1$X[1:10,], nbs=10, clust_param=C, clustering="kmeans", perturb_method=1)
This function calculates Jaccard similarity matrix between two partitions.
jaccard(x, y)
jaccard(x, y)
x , y
|
– vectors of cluster labels |
a matrix of Jaccard similarity between clusters in two partitions.
x=c(1,2,3) y=c(3,2,1) jaccard(x,y)
x=c(1,2,3) y=c(3,2,1) jaccard(x,y)
Output the membership heat map of the required cluster. The return of clustCPS, visCPS or CPS can be directly used as the input.
mplot(result, k)
mplot(result, k)
result |
– the return from function clustCPS, visCPS or CPS. |
k |
– which cluster that you want to see the membership heat map. |
membership heat map of the required cluster.
# CPS analysis on selection of visualization methods data(vis_pollen) c=visCPS(vis_pollen$vis, vis_pollen$ref) # visualization of the results mplot(c,2) cplot(c,2)
# CPS analysis on selection of visualization methods data(vis_pollen) c=visCPS(vis_pollen$vis, vis_pollen$ref) # visualization of the results mplot(c,2) cplot(c,2)
This function calculates the mean partition of an ensemble of partitions by optimal transport alignment and uncertainty/stability measures.
otclust(ensemble, idx = NULL)
otclust(ensemble, idx = NULL)
ensemble |
– a matrix of ensemble partition. Use |
idx |
– an integer indicating the index of reference partition in |
a list of alignment result.
idx |
the index of reference partition. |
avedist |
average distances between each partition and all ensemble partitions. |
meanpart |
a list of mean partition. |
distance |
Wasserstein distances between mean partition and the others. |
numcls |
the number of clusters for each partition. |
statistics |
average tightness ratio, average coverage ratio, 1-average jaccard distance. |
cap |
cluster alignment and points based (CAP) separability. |
id |
switched labels. |
cps |
covering point set. |
match |
topological relationship statistics between the reference partition and the others. |
Weight |
weight matrix. |
data(sim1) # the number of clusters. C = 4 ens.data = ensemble(sim1$X[1:100,], nbs=10, clust_param=C, clustering="kmeans", perturb_method=1) # find mean partition and uncertainty statistics. ota = otclust(ens.data)
data(sim1) # the number of clusters. C = 4 ens.data = ensemble(sim1$X[1:100,], nbs=10, clust_param=C, clustering="kmeans", perturb_method=1) # find mean partition and uncertainty statistics. ota = otclust(ens.data)
This function plots a partition on 2 dimensional reduced space.
otplot( data, labels, convex.hull = F, title = "", xlab = "", ylab = "", legend.title = "", legend.labels = NULL, add.text = T )
otplot( data, labels, convex.hull = F, title = "", xlab = "", ylab = "", legend.title = "", legend.labels = NULL, add.text = T )
data |
– coordinates matrix of data. |
labels |
– cluster labels in a vector, the first cluster is labeled as 1. |
convex.hull |
– logical. If it is |
title |
– title |
xlab |
– xlab |
ylab |
– ylab |
legend.title |
– legend title |
legend.labels |
– legend labels |
add.text |
– default True |
none
data(sim1) # the number of clusters. C = 4 ens.data = ensemble(sim1$X[1:50,], nbs=50, clust_param=C, clustering="kmeans", perturb_method=1) # find mean partition and uncertainty statistics. ota = otclust(ens.data) # calculate baseline method for comparison. kcl = kmeans(sim1$X[1:50],C) # align clustering results for convenience of comparison. compar = align(cbind(sim1$z[1:50],kcl$cluster,ota$meanpart)) lab.match = lapply(compar$weight,function(x) apply(x,2,which.max)) kcl.algnd = match(kcl$cluster,lab.match[[1]]) ota.algnd = match(ota$meanpart,lab.match[[2]]) # plot the result on two dimensional space. otplot(sim1$X[1:50,],ota.algnd,con=FALSE,title='Mean partition') # mean partition by OTclust
data(sim1) # the number of clusters. C = 4 ens.data = ensemble(sim1$X[1:50,], nbs=50, clust_param=C, clustering="kmeans", perturb_method=1) # find mean partition and uncertainty statistics. ota = otclust(ens.data) # calculate baseline method for comparison. kcl = kmeans(sim1$X[1:50],C) # align clustering results for convenience of comparison. compar = align(cbind(sim1$z[1:50],kcl$cluster,ota$meanpart)) lab.match = lapply(compar$weight,function(x) apply(x,2,which.max)) kcl.algnd = match(kcl$cluster,lab.match[[1]]) ota.algnd = match(ota$meanpart,lab.match[[2]]) # plot the result on two dimensional space. otplot(sim1$X[1:50,],ota.algnd,con=FALSE,title='Mean partition') # mean partition by OTclust
Perturb data by adding Gaussian noise, bootstrap resampling or mix-up. Gaussian noise has mean 0 and variance 0.01*average variance of all variables. The mix-up lambda is 0.9.
perturb(data, method = 0)
perturb(data, method = 0)
data |
– data that will be perturbed. |
method |
– adding noise is |
the perturbed data.
data(vis_pollen) perturb(as.matrix(vis_pollen$vis),method=0)
data(vis_pollen) perturb(as.matrix(vis_pollen$vis),method=0)
Output both the numerical and graphical point-wise uncertainty assessment for each individual points. The return of clustCPS, visCPS or CPS can be directly used as the input.
pplot(result, method = 0)
pplot(result, method = 0)
result |
– the return from function clustCPS, visCPS or CPS. |
method |
– method for calculating point-wise uncertainty. Using posterior probability matrix is |
a list, in which P is the posterior probability matrix that each sample below to the reference clusters, point_stab is the point-wise stability for each sample and v is the visualization of the point-wise stability.
# CPS analysis on selection of visualization methods data(vis_pollen) k1=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k2=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k=cbind(as.matrix(k1,ncol=1),as.matrix(k2,ncol=1)) c=CPS(vis_pollen$ref, vis_pollen$vis, pert=k) # Point-wise Uncertainty Assessment pplot(c)
# CPS analysis on selection of visualization methods data(vis_pollen) k1=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k2=kmeans(vis_pollen$vis,max(vis_pollen$ref))$cluster k=cbind(as.matrix(k1,ncol=1),as.matrix(k2,ncol=1)) c=CPS(vis_pollen$ref, vis_pollen$vis, pert=k) # Point-wise Uncertainty Assessment pplot(c)
Preprocessing for dimension reduction based on variance, it will delete the variable whose variance is smaller than 0.5*mean variance of all variables.
preprocess(data, l = TRUE, pre = TRUE)
preprocess(data, l = TRUE, pre = TRUE)
data |
– data that needs to be processed |
l |
– logical. If True, log-transformation will be carried out on the data. |
pre |
– logical. If True, pre-dimension reduction will be carried out based on the variance. |
the processed data.
data(YAN) preprocess(YAN,l=FALSE,pre=TRUE)
data(YAN) preprocess(YAN,l=FALSE,pre=TRUE)
A dataset containing 5000 samples and 2 features.
sim1
sim1
A matrix with 5000 rows and 2 variables
A dataset containing the visualization coordinates and the true cluster labels of 301 cells.
vis_pollen
vis_pollen
A list containing two components
visualization coordinates of cells
true labels of cells
https://www.nature.com/articles/nbt.2967
Covering Point Set Analysis on the visualization results. Use K-Nearest Neighbor to generate a collection of results for CPS Analysis. The return contains several statistics which can be directly used as input for mplot or cplot.
visCPS(vlab, ref, nEXP = 100)
visCPS(vlab, ref, nEXP = 100)
vlab |
– the coordinates generated by one visualization method in a numeric matrix of two columns. |
ref |
– the true labels in a vector format, the first cluster is labeled as 1. |
nEXP |
– number of perturbed results for CPS Analysis. |
a list used for mplot or cplot, in which tight_all is the overall tightness, member is the matrix used for the membership heat map, set is the matrix for the covering point set plot, tight is the vector of cluster-wise tightness, vis is the visualization coordinates, ref is the reference labels and topo is the topological relationship between clusters for point-wise uncertainty assessment.
# CPS analysis on selection of visualization methods data(vis_pollen) c=visCPS(vis_pollen$vis, vis_pollen$ref) # visualization of the results mplot(c,2) cplot(c,2)
# CPS analysis on selection of visualization methods data(vis_pollen) c=visCPS(vis_pollen$vis, vis_pollen$ref) # visualization of the results mplot(c,2) cplot(c,2)
This function calculates Wasserstein distance between two partitions.
wassDist(x, y)
wassDist(x, y)
x , y
|
– vectors of cluster labels |
a distance between 0 and 1.
x=c(1,2,3) y=c(3,2,1) wassDist(x,y)
x=c(1,2,3) y=c(3,2,1) wassDist(x,y)
A dataset containing 124 cells with their 3840 genes.
YAN
YAN
A matrix with 124 rows and 3840 variables
https://www.nature.com/articles/nsmb.2660