Class SIB
- All Implemented Interfaces:
Serializable
,Comparable<CentroidClustering<double[],
SparseArray>>
In analogy to K-Means, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with Kullback-Leibler divergence, which is clearly a better dissimilarity measure for co-occurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of K-Means won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because K-L divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.
Note that this implementation has a little difference from the original paper, in which a weighted Jensen-Shannon divergence is employed as a criterion to assign a randomly-picked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted Jensen-Shannon divergence.
References
- N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. 1999.
- N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR, 2002.
- Jaakko Peltonen, Janne Sinkkonen, and Samuel Kaski. Sequential information bottleneck for finite data. ICML, 2004.
- See Also:
-
Field Summary
Fields inherited from class smile.clustering.CentroidClustering
centroids, distortion
Fields inherited from class smile.clustering.PartitionClustering
k, OUTLIER, size, y
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected double
distance
(double[] x, SparseArray y) The distance function.static SIB
fit
(SparseArray[] data, int k) Clustering data into k clusters up to 100 iterations.static SIB
fit
(SparseArray[] data, int k, int maxIter) Clustering data into k clusters.Methods inherited from class smile.clustering.CentroidClustering
compareTo, predict, toString
Methods inherited from class smile.clustering.PartitionClustering
run, seed
-
Constructor Details
-
SIB
public SIB(double distortion, double[][] centroids, int[] y) Constructor.- Parameters:
distortion
- the total distortion.centroids
- the centroids of each cluster.y
- the cluster labels.
-
-
Method Details
-
distance
Description copied from class:CentroidClustering
The distance function.- Specified by:
distance
in classCentroidClustering<double[],
SparseArray> - Parameters:
x
- an observation.y
- the other observation.- Returns:
- the distance.
-
fit
Clustering data into k clusters up to 100 iterations.- Parameters:
data
- the sparse normalized co-occurrence dataset of which each row is an observation of which the sum is 1.k
- the number of clusters.- Returns:
- the model.
-
fit
Clustering data into k clusters.- Parameters:
data
- the sparse normalized co-occurrence dataset of which each row is an observation of which the sum is 1.k
- the number of clusters.maxIter
- the maximum number of iterations.- Returns:
- the model.
-