hypertools.tools.format_data¶

hypertools.tools.format_data(x, vectorizer='CountVectorizer', semantic='LatentDirichletAllocation', corpus='wiki', ppca=True, text_align='hyper')[source]¶

Formats data into a list of numpy arrays

This function is useful to identify rows of your array that contain missing data or nans. The returned indices can be used to remove the rows with missing data, or label the missing data points that are interpolated using PPCA.

Parameters:

xnumpy array, dataframe, string or (mixed) list: The data to convert
vectorizerstr, dict, class or class instance: The vectorizer to use. Built-in options are ‘CountVectorizer’ or ‘TfidfVectorizer’. To change default parameters, set to a dictionary e.g. {‘model’ : ‘CountVectorizer’, ‘params’ : {‘max_features’ : 10}}. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text for details. You can also specify your own vectorizer model as a class, or class instance. With either option, the class must have a fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html). If a class, pass any parameters as a dictionary to vectorizer_params. If a class instance, no parameters can be passed.
semanticstr, dict, class or class instance: Text model to use to transform text data. Built-in options are ‘LatentDirichletAllocation’ or ‘NMF’ (default: LDA). To change default parameters, set to a dictionary e.g. {‘model’ : ‘NMF’, ‘params’ : {‘n_components’ : 10}}. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition for details on the two model options. You can also specify your own text model as a class, or class instance. With either option, the class must have a fit_transform method (see here: http://scikit-learn.org/stable/data_transforms.html). If a class, pass any parameters as a dictionary to text_params. If a class instance, no parameters can be passed.
corpuslist (or list of lists) of text samples or ‘wiki’, ‘nips’, ‘sotus’.: Text to use to fit the semantic model (optional). If set to ‘wiki’, ‘nips’ or ‘sotus’ and the default semantic and vectorizer models are used, a pretrained model will be loaded which can save a lot of time.
ppcabool: Performs PPCA to fill in missing values (default: True)
text_alignstr: Alignment algorithm to use when both text and numerical data are passed. If numerical arrays have the same shape, and the text data contains the same number of samples, the text and numerical data are automatically aligned to a common space. Example use case: an array of movie frames (frames by pixels) and text descriptions of the frame. In this case, the movie and text will be automatically aligned to the same space (default: hyperalignment).

Returns:

datalist of numpy arrays: A list of formatted arrays