hypertools.tools.format_data

hypertools.tools.format_data(x, vectorizer='CountVectorizer', semantic='LatentDirichletAllocation', corpus='wiki', ppca=True, text_align='hyper')[source]

Formats data into a list of numpy arrays

This function is useful to identify rows of your array that contain missing data or nans. The returned indices can be used to remove the rows with missing data, or label the missing data points that are interpolated using PPCA.

Parameters
xnumpy array, dataframe, string or (mixed) list

The data to convert

vectorizerstr, dict, class or class instance

The vectorizer to use. Built-in options are ‘CountVectorizer’ or ‘TfidfVectorizer’. To change default parameters, set to a dictionary e.g. {‘model’ : ‘CountVectorizer’, ‘params’ : {‘max_features’ : 10}}. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text for details. You can also specify your own vectorizer model as a class, or class instance. With either option, the class must have a fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html). If a class, pass any parameters as a dictionary to vectorizer_params. If a class instance, no parameters can be passed.

semanticstr, dict, class or class instance

Text model to use to transform text data. Built-in options are ‘LatentDirichletAllocation’ or ‘NMF’ (default: LDA). To change default parameters, set to a dictionary e.g. {‘model’ : ‘NMF’, ‘params’ : {‘n_components’ : 10}}. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition for details on the two model options. You can also specify your own text model as a class, or class instance. With either option, the class must have a fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html). If a class, pass any parameters as a dictionary to text_params. If a class instance, no parameters can be passed.

corpuslist (or list of lists) of text samples or ‘wiki’, ‘nips’, ‘sotus’.

Text to use to fit the semantic model (optional). If set to ‘wiki’, ‘nips’ or ‘sotus’ and the default semantic and vectorizer models are used, a pretrained model will be loaded which can save a lot of time.

ppcabool

Performs PPCA to fill in missing values (default: True)

text_alignstr

Alignment algorithm to use when both text and numerical data are passed. If numerical arrays have the same shape, and the text data contains the same number of samples, the text and numerical data are automatically aligned to a common space. Example use case: an array of movie frames (frames by pixels) and text descriptions of the frame. In this case, the movie and text will be automatically aligned to the same space (default: hyperalignment).

Returns
datalist of numpy arrays

A list of formatted arrays