Record Class CentroidClustering<T,U>
- Type Parameters:
T
- the type of centroids.U
- the type of observations. Usually, T and U are the same. But in case of SIB, they are different.- Record Components:
name
- the clustering algorithm name.centers
- the cluster centroids or medoids.distance
- the distance function.group
- the cluster labels of data.proximity
- the squared distance between data points and their respective cluster centers.size
- the number of data points in each cluster.distortions
- the average squared distance of data points within each cluster.
- All Implemented Interfaces:
Serializable
,Comparable<CentroidClustering<T,
U>>
Variations of k-means include restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means), etc.
Most k-means-type algorithms require the number of clusters to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders of clusters (which is not surprising since the algorithm optimizes cluster centers, not cluster borders).
- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionCentroidClustering
(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity) Constructor.CentroidClustering
(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity, int[] size, double[] distortions) Creates an instance of aCentroidClustering
record class. -
Method Summary
Modifier and TypeMethodDescriptioncenter
(int i) Returns the center of i-th cluster.T[]
centers()
Returns the value of thecenters
record component.int
distance()
Returns the value of thedistance
record component.double
Returns the average squared distance between data points and their respective cluster centers.double[]
Returns the value of thedistortions
record component.final boolean
Indicates whether some other object is "equal to" this one.int[]
group()
Returns the value of thegroup
record component.int
group
(int i) Returns the cluster label of i-th data point.final int
hashCode()
Returns a hash code value for this object.static <T> CentroidClustering
<T, T> init
(String name, T[] data, int k, ToDoubleBiFunction<T, T> distance) Returns a random clustering based on K-Means++ algorithm.int
k()
Returns the number of clusters.name()
Returns the value of thename
record component.int
Classifies a new observation.double[]
Returns the value of theproximity
record component.double
proximity
(int i) Returns the distance of i-th data point to its cluster center.double
radius
(int i) Returns the radius of i-th cluster.static double[][]
seeds
(double[][] data, int k) Selects random samples as seeds for various algorithms.int[]
size()
Returns the value of thesize
record component.int
size
(int i) Returns the size of i-th cluster.toString()
Returns a string representation of this record class.
-
Constructor Details
-
CentroidClustering
public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity) Constructor.- Parameters:
name
- the clustering algorithm name.centers
- the cluster centroids or medoids.distance
- the distance function.group
- the cluster labels of data.proximity
- the squared distance of each data point to its nearest cluster center.
-
CentroidClustering
public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity, int[] size, double[] distortions) Creates an instance of aCentroidClustering
record class.- Parameters:
name
- the value for thename
record componentcenters
- the value for thecenters
record componentdistance
- the value for thedistance
record componentgroup
- the value for thegroup
record componentproximity
- the value for theproximity
record componentsize
- the value for thesize
record componentdistortions
- the value for thedistortions
record component
-
-
Method Details
-
k
public int k()Returns the number of clusters.- Returns:
- the number of clusters.
-
distortion
public double distortion()Returns the average squared distance between data points and their respective cluster centers. This is also known as the within-cluster sum-of-squares (WCSS).- Returns:
- the distortion.
-
compareTo
- Specified by:
compareTo
in interfaceComparable<T>
-
toString
Returns a string representation of this record class. The representation contains the name of the class, followed by the name and value of each of the record components. -
center
Returns the center of i-th cluster.- Parameters:
i
- the index of cluster.- Returns:
- the cluster center.
-
group
public int group(int i) Returns the cluster label of i-th data point.- Parameters:
i
- the index of data point.- Returns:
- the cluster label.
-
proximity
public double proximity(int i) Returns the distance of i-th data point to its cluster center.- Parameters:
i
- the index of data point.- Returns:
- the distance to cluster center.
-
size
public int size(int i) Returns the size of i-th cluster.- Parameters:
i
- the index of cluster.- Returns:
- the cluster size.
-
radius
public double radius(int i) Returns the radius of i-th cluster.- Parameters:
i
- the index of cluster.- Returns:
- the cluster radius.
-
predict
Classifies a new observation.- Parameters:
x
- a new observation.- Returns:
- the cluster label.
-
init
public static <T> CentroidClustering<T,T> init(String name, T[] data, int k, ToDoubleBiFunction<T, T> distance) Returns a random clustering based on K-Means++ algorithm. Many clustering methods, e.g. k-means, need an initial clustering configuration as a seed.K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.
The exact algorithm is as follows:
- Choose one center uniformly at random from among the data points.
- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
- Repeat Steps 2 and 3 until k centers have been chosen.
- Now that the initial centers have been chosen, proceed using standard k-means clustering.
- D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
- Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
- Type Parameters:
T
- the type of input object.- Parameters:
name
- the clustering algorithm name.data
- data objects array of size n.k
- the number of medoids.distance
- the distance function.- Returns:
- the initial clustering.
-
seeds
public static double[][] seeds(double[][] data, int k) Selects random samples as seeds for various algorithms.- Parameters:
data
- samples to select seeds from.k
- the number of seeds.- Returns:
- the seeds.
-
hashCode
public final int hashCode()Returns a hash code value for this object. The value is derived from the hash code of each of the record components. -
equals
Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared withObjects::equals(Object,Object)
. -
name
Returns the value of thename
record component.- Returns:
- the value of the
name
record component
-
centers
Returns the value of thecenters
record component.- Returns:
- the value of the
centers
record component
-
distance
Returns the value of thedistance
record component.- Returns:
- the value of the
distance
record component
-
group
public int[] group()Returns the value of thegroup
record component.- Returns:
- the value of the
group
record component
-
proximity
public double[] proximity()Returns the value of theproximity
record component.- Returns:
- the value of the
proximity
record component
-
size
public int[] size()Returns the value of thesize
record component.- Returns:
- the value of the
size
record component
-
distortions
public double[] distortions()Returns the value of thedistortions
record component.- Returns:
- the value of the
distortions
record component
-