{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering with Hypertools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cluster feature performs clustering analysis on the data (an arrray, dataframe, or list) and returns a list of cluster labels. \n", "\n", "The default clustering method is K-Means (argument 'KMeans') with MiniBatchKMeans, AgglomerativeClustering, Birch, FeatureAgglomeration, SpectralClustering and HDBSCAN also supported.\n", "\n", "Note that, if a list is passed, the arrays will be stacked and clustering will be performed *across* all lists (not within each list)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import hypertools as hyp\n", "from collections import Counter\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will load one of the sample datasets. This dataset consists of 8,124 samples of mushrooms with various text features." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "geo = hyp.load('mushrooms')\n", "mushrooms = geo.get_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can peek at the first few rows of the dataframe using the pandas function `head()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mushrooms.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtain cluster labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain cluster labels, simply pass the data to `hyp.cluster`. Since we have not specified a desired number of cluster, the default of 3 clusters is used (labels 0, 1, and 2). Additionally, since we have note specified a desired clustering algorithm, K-Means is used by default." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = hyp.cluster(mushrooms)\n", "set(labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can further examine the number of datapoints assigned each label." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Counter(labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Specify number of cluster labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also specify the number of desired clusters by setting the `n_clusters` argument to an integer number of clusters, as below. We can see that when we pass the int 10 to n_clusters, 10 cluster labels are assigned. \n", "\n", "Since we have note specified a desired clustering algorithm, K-Means is used by default." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels_10 = hyp.cluster(mushrooms, n_clusters = 10)\n", "set(labels_10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Different clustering models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may prefer to use a clustering model other than K-Means. To do so, simply pass a string to the cluster argument specifying the desired clustering algorithm.\n", "\n", "In this case, we specify both the clustering model (HDBSCAN) and the number of clusters (10)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "labels_HDBSCAN = hyp.cluster(mushrooms, cluster='HDBSCAN')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "geo = hyp.plot(mushrooms, '.', hue=labels_10, title='K-means clustering')\n", "geo = hyp.plot(mushrooms, '.', hue=labels_HDBSCAN, title='HCBSCAN clustering')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }