{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Normalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `normalize` is a helper function to z-score your data. This is useful if your features (columns) are scaled differently within or across datasets. By default, hypertools normalizes *across* the columns of all datasets passed, but also affords the option to normalize columns *within* individual lists. Alternatively, you can also normalize each row. The function returns an array or list of arrays where the columns or rows are z-scored (output type same as input type)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import hypertools as hyp\n", "import numpy as np\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate synthetic data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we generate two sets of synthetic data. We pull points randomly from a multivariate normal distribution for each set, so the sets will exhibit unique statistical properties." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "x1 = np.random.randn(10,10)\n", "x2 = np.random.randn(10,10)\n", "\n", "c1 = np.dot(x1, x1.T)\n", "c2 = np.dot(x2, x2.T)\n", "\n", "m1 = np.zeros([1,10])\n", "m2 = 10 + m1\n", "\n", "data1 = np.random.multivariate_normal(m1[0], c1, 100)\n", "data2 = np.random.multivariate_normal(m2[0], c2, 100)\n", "\n", "data = [data1, data2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "geo = hyp.plot(data, '.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalizing (Specified Cols or Rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, to specify a different normalization, pass one of the following arguments as a string, as shown in the examples below.\n", "\n", "+ 'across' - columns z-scored across passed lists (default)\n", "+ 'within' - columns z-scored within passed lists\n", "+ 'row' - rows z-scored " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizing 'across'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you normalize 'across', all of the data is stacked/combined, and the normalization is done on the columns of the full dataset. Then the data is split back into separate elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "norm = hyp.normalize(data, normalize = 'across')\n", "geo = hyp.plot(norm, '.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizing 'within'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you normalize 'within', normalization is done on the columns of each element of the data, separately. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "norm = hyp.normalize(data, normalize = 'within')\n", "geo = hyp.plot(norm, '.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizing by 'row'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "norm = hyp.normalize(data, normalize = 'row')\n", "geo = hyp.plot(norm, '.')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }