kmeansfit

K-Means clustering. It's an unsupervised learning algorithm.

Syntax

parameters = kmeansfit(X)

parameters = kmeansfit(X,options)

Inputs

X
Training data.
Type: double
Dimension: vector | matrix
options
Type: struct
n_clusters
Number of clusters to find (default: 8).
Type: integer
Dimension: scalar
init
Method for initialization of centroids. 'k-means++' (default): Selects initial cluster centers in a smart way to speedup convergence. 'random': Choose k observations (rows) at random from data for the initial centroids.
Type: char
Dimension: string
n_init
Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia (default: 10).
Type: integer
Dimension: scalar
max_iter
Maximum number of iterations of the k-means algorithm for a single run (default: 300).
Type: integer
Dimension: scalar
tol
Relative tolerance with regard to inertia to declare convergence (default: 1e-4).
Type: double
Dimension: scalar
random_state
Determines random number generation for centroid initialization. Set this parameter to make randomness deterministic.
Type: integer
Dimension: scalar
algorithm
K-means algorithm to use.
'full': classical EM-style algorithm
'elkan': more efficient variant of classical by using triangle inequality, but currently doesn't support sparse data.
'auto' (default): chooses 'elkan' for dense data and 'full' for sparse data.
Type: char
Dimension: string

Outputs

parameters
Contains all the values passed to kmeansfit method as options. Additionally it has below key-value pairs.
Type: struct
labels
Labels of each point.
Type: double
Dimension: vector
inertia
Sum of squared distances of samples to their closest cluster center.
Type: double
Dimension: scalar
n_iter
Number of interations run.
Type: integer
Dimension: scalar
n_samples
Number of rows in the training data.
Type: integer
Dimension: scalar
n_features
Number of columns in the training data.
Type: integer
Dimension: scalar

Example

Usage of kmeansfit with options

rand('seed', 2);
XTrain = rand(14, 5);
XTest 	= rand(2, 5);

options = struct;
options.n_clusters = 2; 
parameters = kmeansfit(XTrain, options);
> parameters
parameters = struct [
  algorithm: auto
  cluster_centers: [Matrix] 2 x 5
  0.25669  0.37129  0.78008  0.28967  0.55561
  0.57313  0.57262  0.30554  0.31799  0.40330
  init: k-means++
  interia: 2.4113899
  ...

Comments

If the algorithm stops before fully converging (because of tol or max_iter), labels and cluster_centers will not be consistent, i.e. the cluster_centers will not be the means of the points in each cluster. Also, the estimator will reassign labels after the last iteration to make labels consistent with predict on the training set. Output 'parameters' should be passed as input to kmeanspredict function.