Clustering with Hypertools ========================== The cluster feature performs clustering analysis on the data (an arrray, dataframe, or list) and returns a list of cluster labels. The default clustering method is K-Means (argument 'KMeans') with MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering and HDBSCAN also supported. Note that, if a list is passed, the arrays will be stacked and clustering will be performed *across* all lists (not within each list). Import Packages --------------- .. code:: ipython3 import hypertools as hyp from collections import Counter %matplotlib inline Load your data -------------- We will load one of the sample datasets. This dataset consists of 8,124 samples of mushrooms with various text features. .. code:: ipython3 geo = hyp.load('mushrooms') mushrooms = geo.get_data() We can peek at the first few rows of the dataframe using the pandas function ``head()`` .. code:: ipython3 mushrooms.head() .. raw:: html
bruises cap-color cap-shape cap-surface gill-attachment gill-color gill-size gill-spacing habitat odor ... ring-type spore-print-color stalk-color-above-ring stalk-color-below-ring stalk-root stalk-shape stalk-surface-above-ring stalk-surface-below-ring veil-color veil-type
0 t n x s f k n c u p ... p k w w e e s s w p
1 t y x s f k b c g a ... p n w w c e s s w p
2 t w b s f n b c m l ... p n w w c e s s w p
3 t w x y f n n c u p ... p k w w e e s s w p
4 f g x s f k b w g n ... e n w w e t s s w p

5 rows × 22 columns

Obtain cluster labels --------------------- To obtain cluster labels, simply pass the data to ``hyp.cluster``. Since we have not specified a desired number of cluster, the default of 3 clusters is used (labels 0, 1, and 2). Additionally, since we have note specified a desired clustering algorithm, K-Means is used by default. .. code:: ipython3 labels = hyp.cluster(mushrooms) set(labels) .. parsed-literal:: {0, 1, 2} We can further examine the number of datapoints assigned each label. .. code:: ipython3 Counter(labels) .. parsed-literal:: Counter({0: 1296, 1: 5067, 2: 1761}) Specify number of cluster labels -------------------------------- You can also specify the number of desired clusters by setting the ``n_clusters`` argument to an integer number of clusters, as below. We can see that when we pass the int 10 to n\_clusters, 10 cluster labels are assigned. Since we have note specified a desired clustering algorithm, K-Means is used by default. .. code:: ipython3 labels_10 = hyp.cluster(mushrooms, n_clusters = 10) set(labels_10) .. parsed-literal:: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Different clustering models --------------------------- You may prefer to use a clustering model other than K-Means. To do so, simply pass a string to the cluster argument specifying the desired clustering algorithm. In this case, we specify both the clustering model (HDBSCAN) and the number of clusters (10). .. code:: ipython3 labels_HDBSCAN = hyp.cluster(mushrooms, cluster='HDBSCAN') .. code:: ipython3 geo = hyp.plot(mushrooms, '.', hue=labels_10, title='K-means clustering') geo = hyp.plot(mushrooms, '.', hue=labels_HDBSCAN, title='HCBSCAN clustering') .. image:: cluster_files/cluster_20_0.png .. image:: cluster_files/cluster_20_1.png