java.lang.Object

smile.clustering.PartitionClustering

All Implemented Interfaces:: Serializable

Direct Known Subclasses:: CentroidClustering, DBSCAN, DENCLUE, MEC, SpectralClustering

public abstract class PartitionClustering extends Object implements Serializable

Partition clustering. Partition methods classify the observations into distinct non-overlapping groups.

See Also:

Field Summary

Fields

Modifier and Type

Field

Description

final int

k

The number of clusters.

static final int

OUTLIER

Cluster label for outliers or noises.

final int[]

size

The number of observations in each cluster.

final int[]

y

The cluster labels of data.
Constructor Summary

Constructors

Constructor

Description

PartitionClustering(int k, int[] y)

Constructor.
Method Summary

Modifier and Type

Method

Description

static <T extends PartitionClustering & Comparable<? super T>> T

run(int runs, Supplier<T> clustering)

Runs a clustering algorithm multiple times and return the best one (e.g.

static <T> double[]

seed(T[] data, T[] medoids, int[] y, ToDoubleBiFunction<T,T> distance)

Initialize cluster membership of input objects with K-Means++ algorithm.

String

toString()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- OUTLIER
  
  public static final int OUTLIER
  
  Cluster label for outliers or noises.
  See Also:
  
  Constant Field Values
- k
  
  public final int k
  
  The number of clusters.
- y
  
  public final int[] y
  
  The cluster labels of data.
- size
  
  public final int[] size
  
  The number of observations in each cluster.
Constructor Details
- PartitionClustering
  
  public PartitionClustering(int k, int[] y)
  
  Constructor.
  
  Parameters:
  
  k - the number of clusters.
  
  y - the cluster labels.
Method Details
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- seed
  
  public static <T> double[] seed(T[] data, T[] medoids, int[] y, ToDoubleBiFunction<T,T> distance)
  Initialize cluster membership of input objects with K-Means++ algorithm. Many clustering methods, e.g. k-means, need an initial clustering configuration as a seed.
  K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.
  The exact algorithm is as follows:
  
  Choose one center uniformly at random from among the data points.
  
  For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  
  Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D²(x).
  
  Repeat Steps 2 and 3 until k centers have been chosen.
  
  Now that the initial centers have been chosen, proceed using standard k-means clustering.
  
  This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast after this seeding and thus the algorithm actually lowers the computation time too.
  
  D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
  
  Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
  Type Parameters:
  
  T - the type of input object.
  
  Parameters:
  
  data - data objects array of size n.
  
  medoids - an array of size k to store cluster medoids on output.
  
  y - an array of size n to store cluster labels on output.
  
  distance - the distance function.
  
  Returns:
  
  an array of size n to store the distance of each observation to nearest medoid.
- run
  
  public static <T extends PartitionClustering & Comparable<? super T>> T run(int runs, Supplier<T> clustering)
  
  Runs a clustering algorithm multiple times and return the best one (e.g. smallest distortion).
  
  Type Parameters:
  
  T - the data type.
  
  Parameters:
  
  runs - the number of runs.
  
  clustering - the clustering algorithm.
  
  Returns:
  
  the model.

Class PartitionClustering

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

OUTLIER

k

y

size

Constructor Details

PartitionClustering

Method Details

toString

seed

run