The cluster feature performs clustering analysis on the data (an arrray, dataframe, or list) and returns a list of cluster labels.
The default clustering method is K-Means (argument ‘KMeans’) with MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering and HDBSCAN also supported.
Note that, if a list is passed, the arrays will be stacked and clustering will be performed across all lists (not within each list).
import hypertools as hyp
from collections import Counter
%matplotlib inline
We will load one of the sample datasets. This dataset consists of 8,124 samples of mushrooms with various text features.
geo = hyp.load('mushrooms')
mushrooms = geo.get_data()
We can peek at the first few rows of the dataframe using the pandas
function head()
mushrooms.head()
bruises | cap-color | cap-shape | cap-surface | gill-attachment | gill-color | gill-size | gill-spacing | habitat | odor | ... | ring-type | spore-print-color | stalk-color-above-ring | stalk-color-below-ring | stalk-root | stalk-shape | stalk-surface-above-ring | stalk-surface-below-ring | veil-color | veil-type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | t | n | x | s | f | k | n | c | u | p | ... | p | k | w | w | e | e | s | s | w | p |
1 | t | y | x | s | f | k | b | c | g | a | ... | p | n | w | w | c | e | s | s | w | p |
2 | t | w | b | s | f | n | b | c | m | l | ... | p | n | w | w | c | e | s | s | w | p |
3 | t | w | x | y | f | n | n | c | u | p | ... | p | k | w | w | e | e | s | s | w | p |
4 | f | g | x | s | f | k | b | w | g | n | ... | e | n | w | w | e | t | s | s | w | p |
5 rows × 22 columns
To obtain cluster labels, simply pass the data to hyp.cluster
. Since
we have not specified a desired number of cluster, the default of 3
clusters is used (labels 0, 1, and 2). Additionally, since we have note
specified a desired clustering algorithm, K-Means is used by default.
labels = hyp.cluster(mushrooms)
set(labels)
{0, 1, 2}
We can further examine the number of datapoints assigned each label.
Counter(labels)
Counter({0: 1296, 1: 5067, 2: 1761})
You can also specify the number of desired clusters by setting the
n_clusters
argument to an integer number of clusters, as below. We
can see that when we pass the int 10 to n_clusters, 10 cluster labels
are assigned.
Since we have note specified a desired clustering algorithm, K-Means is used by default.
labels_10 = hyp.cluster(mushrooms, n_clusters = 10)
set(labels_10)
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
You may prefer to use a clustering model other than K-Means. To do so, simply pass a string to the cluster argument specifying the desired clustering algorithm.
In this case, we specify both the clustering model (HDBSCAN) and the number of clusters (10).
labels_HDBSCAN = hyp.cluster(mushrooms, cluster='HDBSCAN')
geo = hyp.plot(mushrooms, '.', hue=labels_10, title='K-means clustering')
geo = hyp.plot(mushrooms, '.', hue=labels_HDBSCAN, title='HCBSCAN clustering')