Class MEC<T>
- Type Parameters:
T
- the data type of model input objects.
- All Implemented Interfaces:
Serializable
,Comparable<MEC<T>>
The clustering criterion is based on the conditional entropy H(C | x), where C is the cluster label and x is an observation. According to Fano's inequality, we can estimate C with a low probability of error only if the conditional entropy H(C | X) is small. MEC also generalizes the criterion by replacing Shannon's entropy with Havrda-Charvat's structural α-entropy. Interestingly, the minimum entropy criterion based on structural α-entropy is equal to the probability error of the nearest neighbor method when α= 2. To estimate p(C | x), MEC employs Parzen density estimation, a nonparametric approach.
MEC is an iterative algorithm starting with an initial partition given by any other clustering methods, e.g. k-means, CLARNAS, hierarchical clustering, etc. Note that a random initialization is NOT appropriate.
References
- Haifeng Li. All rights reserved., Keshu Zhang, and Tao Jiang. Minimum Entropy Clustering and Applications to Gene Expression Analysis. CSB, 2004.
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionfinal double
The conditional entropy as the objective function.final double
The range of neighborhood.Fields inherited from class smile.clustering.PartitionClustering
k, OUTLIER, size, y
-
Constructor Summary
-
Method Summary
Methods inherited from class smile.clustering.PartitionClustering
run, seed
-
Field Details
-
entropy
public final double entropyThe conditional entropy as the objective function. -
radius
public final double radiusThe range of neighborhood.
-
-
Constructor Details
-
MEC
Constructor.- Parameters:
entropy
- the conditional entropy of clusters.radius
- the neighborhood radius.nns
- the data structure for neighborhood search.k
- the number of clusters.y
- the cluster labels.
-
-
Method Details
-
compareTo
- Specified by:
compareTo
in interfaceComparable<T>
-
fit
Clustering the data.- Type Parameters:
T
- the data type.- Parameters:
data
- the observations.distance
- the distance function.k
- the number of clusters. Note that this is just a hint. The final number of clusters may be less.radius
- the neighborhood radius.- Returns:
- the model.
-
fit
public static <T> MEC<T> fit(T[] data, RNNSearch<T, T> nns, int k, double radius, int[] y, double tol) Clustering the data.- Type Parameters:
T
- the data type.- Parameters:
data
- the observations.nns
- the neighborhood search data structure.k
- the number of clusters. Note that this is just a hint. The final number of clusters may be less.radius
- the neighborhood radius.y
- the initial clustering labels, which could be produced by any other clustering methods.tol
- the tolerance of convergence test.- Returns:
- the model.
-
predict
Cluster a new instance.- Parameters:
x
- a new instance.- Returns:
- the cluster label. Note that it may be
PartitionClustering.OUTLIER
.
-
toString
- Overrides:
toString
in classPartitionClustering
-