Clustering with Hypertools
==========================
The cluster feature performs clustering analysis on the data (an arrray,
dataframe, or list) and returns a list of cluster labels.
The default clustering method is K-Means (argument 'KMeans') with
MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration,
SpectralClustering and HDBSCAN also supported.
Note that, if a list is passed, the arrays will be stacked and
clustering will be performed *across* all lists (not within each list).
Import Packages
---------------
.. code:: ipython3
import hypertools as hyp
from collections import Counter
%matplotlib inline
Load your data
--------------
We will load one of the sample datasets. This dataset consists of 8,124
samples of mushrooms with various text features.
.. code:: ipython3
geo = hyp.load('mushrooms')
mushrooms = geo.get_data()
We can peek at the first few rows of the dataframe using the pandas
function ``head()``
.. code:: ipython3
mushrooms.head()
.. raw:: html
|
bruises |
cap-color |
cap-shape |
cap-surface |
gill-attachment |
gill-color |
gill-size |
gill-spacing |
habitat |
odor |
... |
ring-type |
spore-print-color |
stalk-color-above-ring |
stalk-color-below-ring |
stalk-root |
stalk-shape |
stalk-surface-above-ring |
stalk-surface-below-ring |
veil-color |
veil-type |
| 0 |
t |
n |
x |
s |
f |
k |
n |
c |
u |
p |
... |
p |
k |
w |
w |
e |
e |
s |
s |
w |
p |
| 1 |
t |
y |
x |
s |
f |
k |
b |
c |
g |
a |
... |
p |
n |
w |
w |
c |
e |
s |
s |
w |
p |
| 2 |
t |
w |
b |
s |
f |
n |
b |
c |
m |
l |
... |
p |
n |
w |
w |
c |
e |
s |
s |
w |
p |
| 3 |
t |
w |
x |
y |
f |
n |
n |
c |
u |
p |
... |
p |
k |
w |
w |
e |
e |
s |
s |
w |
p |
| 4 |
f |
g |
x |
s |
f |
k |
b |
w |
g |
n |
... |
e |
n |
w |
w |
e |
t |
s |
s |
w |
p |
5 rows × 22 columns
Obtain cluster labels
---------------------
To obtain cluster labels, simply pass the data to ``hyp.cluster``. Since
we have not specified a desired number of cluster, the default of 3
clusters is used (labels 0, 1, and 2). Additionally, since we have note
specified a desired clustering algorithm, K-Means is used by default.
.. code:: ipython3
labels = hyp.cluster(mushrooms)
set(labels)
.. parsed-literal::
{0, 1, 2}
We can further examine the number of datapoints assigned each label.
.. code:: ipython3
Counter(labels)
.. parsed-literal::
Counter({0: 1296, 1: 5067, 2: 1761})
Specify number of cluster labels
--------------------------------
You can also specify the number of desired clusters by setting the
``n_clusters`` argument to an integer number of clusters, as below. We
can see that when we pass the int 10 to n\_clusters, 10 cluster labels
are assigned.
Since we have note specified a desired clustering algorithm, K-Means is
used by default.
.. code:: ipython3
labels_10 = hyp.cluster(mushrooms, n_clusters = 10)
set(labels_10)
.. parsed-literal::
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Different clustering models
---------------------------
You may prefer to use a clustering model other than K-Means. To do so,
simply pass a string to the cluster argument specifying the desired
clustering algorithm.
In this case, we specify both the clustering model (HDBSCAN) and the
number of clusters (10).
.. code:: ipython3
labels_HDBSCAN = hyp.cluster(mushrooms, cluster='HDBSCAN')
.. code:: ipython3
geo = hyp.plot(mushrooms, '.', hue=labels_10, title='K-means clustering')
geo = hyp.plot(mushrooms, '.', hue=labels_HDBSCAN, title='HCBSCAN clustering')
.. image:: cluster_files/cluster_20_0.png
.. image:: cluster_files/cluster_20_1.png