Record Class CentroidClustering<T,U>

java.lang.Object
java.lang.Record
smile.clustering.CentroidClustering<T,U>
Type Parameters:
T - the type of centroids.
U - the type of observations. Usually, T and U are the same. But in case of SIB, they are different.
Record Components:
name - the clustering algorithm name.
centers - the cluster centroids or medoids.
distance - the distance function.
group - the cluster labels of data.
proximity - the squared distance between data points and their respective cluster centers.
size - the number of data points in each cluster.
distortions - the average squared distance of data points within each cluster.
All Implemented Interfaces:
Serializable, Comparable<CentroidClustering<T,U>>

public record CentroidClustering<T,U>(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity, int[] size, double[] distortions) extends Record implements Comparable<CentroidClustering<T,U>>, Serializable
Centroid-based clustering that uses the center of each cluster to group similar data points into clusters. The cluster centers may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Variations of k-means include restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means), etc.

Most k-means-type algorithms require the number of clusters to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders of clusters (which is not surprising since the algorithm optimizes cluster centers, not cluster borders).

See Also:
  • Constructor Summary

    Constructors
    Constructor
    Description
    CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity)
    Constructor.
    CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity, int[] size, double[] distortions)
    Creates an instance of a CentroidClustering record class.
  • Method Summary

    Modifier and Type
    Method
    Description
    center(int i)
    Returns the center of i-th cluster.
    T[]
    Returns the value of the centers record component.
    int
     
    Returns the value of the distance record component.
    double
    Returns the average squared distance between data points and their respective cluster centers.
    double[]
    Returns the value of the distortions record component.
    final boolean
    Indicates whether some other object is "equal to" this one.
    int[]
    Returns the value of the group record component.
    int
    group(int i)
    Returns the cluster label of i-th data point.
    final int
    Returns a hash code value for this object.
    static <T> CentroidClustering<T,T>
    init(String name, T[] data, int k, ToDoubleBiFunction<T,T> distance)
    Returns a random clustering based on K-Means++ algorithm.
    int
    k()
    Returns the number of clusters.
    Returns the value of the name record component.
    int
    Classifies a new observation.
    double[]
    Returns the value of the proximity record component.
    double
    proximity(int i)
    Returns the distance of i-th data point to its cluster center.
    double
    radius(int i)
    Returns the radius of i-th cluster.
    static double[][]
    seeds(double[][] data, int k)
    Selects random samples as seeds for various algorithms.
    int[]
    Returns the value of the size record component.
    int
    size(int i)
    Returns the size of i-th cluster.
    Returns a string representation of this record class.

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • CentroidClustering

      public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity)
      Constructor.
      Parameters:
      name - the clustering algorithm name.
      centers - the cluster centroids or medoids.
      distance - the distance function.
      group - the cluster labels of data.
      proximity - the squared distance of each data point to its nearest cluster center.
    • CentroidClustering

      public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity, int[] size, double[] distortions)
      Creates an instance of a CentroidClustering record class.
      Parameters:
      name - the value for the name record component
      centers - the value for the centers record component
      distance - the value for the distance record component
      group - the value for the group record component
      proximity - the value for the proximity record component
      size - the value for the size record component
      distortions - the value for the distortions record component
  • Method Details

    • k

      public int k()
      Returns the number of clusters.
      Returns:
      the number of clusters.
    • distortion

      public double distortion()
      Returns the average squared distance between data points and their respective cluster centers. This is also known as the within-cluster sum-of-squares (WCSS).
      Returns:
      the distortion.
    • compareTo

      public int compareTo(CentroidClustering<T,U> o)
      Specified by:
      compareTo in interface Comparable<T>
    • toString

      public String toString()
      Returns a string representation of this record class. The representation contains the name of the class, followed by the name and value of each of the record components.
      Specified by:
      toString in class Record
      Returns:
      a string representation of this object
    • center

      public T center(int i)
      Returns the center of i-th cluster.
      Parameters:
      i - the index of cluster.
      Returns:
      the cluster center.
    • group

      public int group(int i)
      Returns the cluster label of i-th data point.
      Parameters:
      i - the index of data point.
      Returns:
      the cluster label.
    • proximity

      public double proximity(int i)
      Returns the distance of i-th data point to its cluster center.
      Parameters:
      i - the index of data point.
      Returns:
      the distance to cluster center.
    • size

      public int size(int i)
      Returns the size of i-th cluster.
      Parameters:
      i - the index of cluster.
      Returns:
      the cluster size.
    • radius

      public double radius(int i)
      Returns the radius of i-th cluster.
      Parameters:
      i - the index of cluster.
      Returns:
      the cluster radius.
    • predict

      public int predict(U x)
      Classifies a new observation.
      Parameters:
      x - a new observation.
      Returns:
      the cluster label.
    • init

      public static <T> CentroidClustering<T,T> init(String name, T[] data, int k, ToDoubleBiFunction<T,T> distance)
      Returns a random clustering based on K-Means++ algorithm. Many clustering methods, e.g. k-means, need an initial clustering configuration as a seed.

      K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.

      The exact algorithm is as follows:

      1. Choose one center uniformly at random from among the data points.
      2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
      3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
      4. Repeat Steps 2 and 3 until k centers have been chosen.
      5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
      This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast after this seeding and thus the algorithm actually lowers the computation time too.
      1. D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
      2. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
      Type Parameters:
      T - the type of input object.
      Parameters:
      name - the clustering algorithm name.
      data - data objects array of size n.
      k - the number of medoids.
      distance - the distance function.
      Returns:
      the initial clustering.
    • seeds

      public static double[][] seeds(double[][] data, int k)
      Selects random samples as seeds for various algorithms.
      Parameters:
      data - samples to select seeds from.
      k - the number of seeds.
      Returns:
      the seeds.
    • hashCode

      public final int hashCode()
      Returns a hash code value for this object. The value is derived from the hash code of each of the record components.
      Specified by:
      hashCode in class Record
      Returns:
      a hash code value for this object
    • equals

      public final boolean equals(Object o)
      Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared with Objects::equals(Object,Object).
      Specified by:
      equals in class Record
      Parameters:
      o - the object with which to compare
      Returns:
      true if this object is the same as the o argument; false otherwise.
    • name

      public String name()
      Returns the value of the name record component.
      Returns:
      the value of the name record component
    • centers

      public T[] centers()
      Returns the value of the centers record component.
      Returns:
      the value of the centers record component
    • distance

      public ToDoubleBiFunction<T,U> distance()
      Returns the value of the distance record component.
      Returns:
      the value of the distance record component
    • group

      public int[] group()
      Returns the value of the group record component.
      Returns:
      the value of the group record component
    • proximity

      public double[] proximity()
      Returns the value of the proximity record component.
      Returns:
      the value of the proximity record component
    • size

      public int[] size()
      Returns the value of the size record component.
      Returns:
      the value of the size record component
    • distortions

      public double[] distortions()
      Returns the value of the distortions record component.
      Returns:
      the value of the distortions record component