Clustering

Hierarchical Clustering

Hierarchical Clustering algorithm derived from the R package ‘amap’ [Amap].

class mlpy.HCluster(method='euclidean', link='complete')

Hierarchical Cluster.

Initialize Hierarchical Cluster.

Parameters:
method : string (‘euclidean’)

the distance measure to be used

link : string (‘single’, ‘complete’, ‘mcquitty’, ‘median’)

the agglomeration method to be used

Example:

>>> import numpy as np
>>> import mlpy
>>> x = np.array([[ 1. ,  1.5],
...               [ 1.1,  1.8],
...               [ 2. ,  2.8],
...               [ 3.2,  3.1],
...               [ 3.4,  3.2]])
>>> hc = mlpy.HCluster()
>>> hc.compute(x)
>>> hc.ia
array([-4, -1, -3,  2])
>>> hc.ib
array([-5, -2,  1,  3])
>>> hc.heights
array([ 0.2236068 ,  0.31622776,  1.4560219 ,  2.94108844])
>>> hc.cut(0.5)
array([0, 0, 1, 2, 2])
compute(x)

Compute Hierarchical Cluster.

Parameters:
x : ndarray

An 2-dimensional vector (sample x features).

Returns:
self.ia : ndarray (1-dimensional vector)

merge

self.ib : ndarray (1-dimensional vector)

merge

self.heights : ndarray (1-dimensional vector)

a set of n-1 non-decreasing real values. The clustering height: that is, the value of the criterion associated with the clustering method for the particular agglomeration.

Element i of merge describes the merging of clusters at step i of the clustering. If an element j is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.

cut(ht)

Cuts the tree into several groups by specifying the cut height.

Parameters:
ht : float

height where the tree should be cut

Returns:
cl : ndarray (1-dimensional vector)

group memberships. Groups are in 0, ..., N-1

[Amap]amap: Another Multidimensional Analysis Package, http://cran.r-project.org/web/packages/amap/index.html

k-means

class mlpy.Kmeans(k, init='std', seed=0)

k-means algorithm.

Initialization.

Parameters:
k : int (>1)

number of clusters

init : string (‘std’, ‘plus’)
initialization algorithm
  • ‘std’ : randomly selected
  • ‘plus’ : k-means++ algorithm
seed : int (>=0)

random seed

Example:

>>> import numpy as np
>>> import mlpy
>>> x = np.array([[ 1. ,  1.5],
...               [ 1.1,  1.8],
...               [ 2. ,  2.8],
...               [ 3.2,  3.1],
...               [ 3.4,  3.2]])
>>> kmeans = mlpy.Kmeans(k=3, init="plus", seed=0)
>>> kmeans.compute(x)
array([1, 1, 2, 0, 0], dtype=int32)
>>> kmeans.means
array([[ 3.3 ,  3.15],
[ 1.05,  1.65],
[ 2.  ,  2.8 ]])
>>> kmeans.steps
2

New in version 2.2.0.

compute(x)

Compute Kmeans.

Parameters:
x : ndarray

an 2-dimensional vector (number of points x dimensions)

Returns:
cls : ndarray (1-dimensional vector)

cluster membership. Clusters are in 0, ..., k-1

Attributes:
Kmeans.means : 2d ndarray float (k x dim)

means

Kmeans.steps : int

number of steps

k-medoids

class mlpy.Kmedoids(k, dist, maxloops=100, rs=0)

k-medoids algorithm.

Initialize Kmedoids.

Parameters:
k : int

Number of clusters/medoids

dist : class

class with a .compute(x, y) method which returns a distance

maxloops : int

maximum number of loops

rs : int

random seed

Example:

>>> import numpy as np
>>> import mlpy
>>> x = np.array([[ 1. ,  1.5],
...               [ 1.1,  1.8],
...               [ 2. ,  2.8],
...               [ 3.2,  3.1],
...               [ 3.4,  3.2]])
>>> dtw = mlpy.Dtw(onlydist=True)
>>> km = mlpy.Kmedoids(k=3, dist=dtw)
>>> km.compute(x)
(array([4, 0, 2]), array([3, 1]), array([0, 1]), 0.072499999999999981)

Samples 4, 0, 2 are medoids and represent cluster 0, 1, 2 respectively.

  • cluster 0: samples 4 (medoid) and 3
  • cluster 1: samples 0 (medoid) and 1
  • cluster 2: sample 2 (medoid)

New in version 2.0.8.

compute(x)

Compute Kmedoids.

Parameters:
x : ndarray

An 2-dimensional vector (sample x features).

Returns:
m : ndarray (1-dimensional vector)

medoids indexes

n : ndarray (1-dimensional vector)

non-medoids indexes

cl : ndarray 1-dimensional vector)

cluster membership for non-medoids. Groups are in 0, ..., k-1

co : double

total cost of configuration