Clustering with Hypertools

The cluster feature performs clustering analysis on the data (an arrray, dataframe, or list) and returns a list of cluster labels.

The default clustering method is K-Means (argument ‘KMeans’) with MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering and HDBSCAN also supported.

Note that, if a list is passed, the arrays will be stacked and clustering will be performed across all lists (not within each list).

Import Packages

import hypertools as hyp
from collections import Counter

%matplotlib inline

Load your data

We will load one of the sample datasets. This dataset consists of 8,124 samples of mushrooms with various text features.

geo = hyp.load('mushrooms')
mushrooms = geo.get_data()

We can peek at the first few rows of the dataframe using the pandas function head()

mushrooms.head()
bruises cap-color cap-shape cap-surface gill-attachment gill-color gill-size gill-spacing habitat odor ... ring-type spore-print-color stalk-color-above-ring stalk-color-below-ring stalk-root stalk-shape stalk-surface-above-ring stalk-surface-below-ring veil-color veil-type
0 t n x s f k n c u p ... p k w w e e s s w p
1 t y x s f k b c g a ... p n w w c e s s w p
2 t w b s f n b c m l ... p n w w c e s s w p
3 t w x y f n n c u p ... p k w w e e s s w p
4 f g x s f k b w g n ... e n w w e t s s w p

5 rows × 22 columns

Obtain cluster labels

To obtain cluster labels, simply pass the data to hyp.cluster. Since we have not specified a desired number of cluster, the default of 3 clusters is used (labels 0, 1, and 2). Additionally, since we have note specified a desired clustering algorithm, K-Means is used by default.

labels = hyp.cluster(mushrooms)
set(labels)
{0, 1, 2}

We can further examine the number of datapoints assigned each label.

Counter(labels)
Counter({0: 1296, 1: 5067, 2: 1761})

Specify number of cluster labels

You can also specify the number of desired clusters by setting the n_clusters argument to an integer number of clusters, as below. We can see that when we pass the int 10 to n_clusters, 10 cluster labels are assigned.

Since we have note specified a desired clustering algorithm, K-Means is used by default.

labels_10 = hyp.cluster(mushrooms, n_clusters = 10)
set(labels_10)
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

Different clustering models

You may prefer to use a clustering model other than K-Means. To do so, simply pass a string to the cluster argument specifying the desired clustering algorithm.

In this case, we specify both the clustering model (HDBSCAN) and the number of clusters (10).

labels_HDBSCAN = hyp.cluster(mushrooms, cluster='HDBSCAN')
geo = hyp.plot(mushrooms, '.', hue=labels_10, title='K-means clustering')
geo = hyp.plot(mushrooms, '.', hue=labels_HDBSCAN, title='HCBSCAN clustering')
../_images/cluster_20_0.png ../_images/cluster_20_1.png