Visualizing text¶

[ ]:

import hypertools as hyp
import wikipedia as wiki
%matplotlib inline

In this example, we will download some text from wikipedia, split it up into chunks and then plot it. We will use the wikipedia package to retrieve the wiki pages for ‘dog’ and ‘cat’.

[ ]:

def chunk(s, count):
    return [''.join(x) for x in zip(*[list(s[z::count]) for z in range(count)])]

chunk_size = 5

try:
    dog_text = wiki.page('Domestic dog').content
    cat_text = wiki.page('Domestic cat').content
except:
    # Fallback to simpler approach if Wikipedia fails
    dog_text = "Dogs are domesticated mammals, not natural wild animals. They were originally bred from wolves. They have been bred by humans for a long time, and were the first animals ever to be domesticated."
    cat_text = "Cats are small carnivorous mammals. They are the only domesticated species in the family Felidae and are often referred to as domestic cats to distinguish them from the wild members of the family."

dog = chunk(dog_text, max(1, int(len(dog_text)/chunk_size)))
cat = chunk(cat_text, max(1, int(len(cat_text)/chunk_size)))

Below is a snippet of some of the text from the dog wikipedia page. As you can see, the word dog appears in many of the sentences, but also words related to dog like wolf and carnivore appear.

[ ]:

dog[0][:1000]

Now we will simply pass the text samples as a list to hyp.plot. By default hypertools will transform the text data using a topic model that was fit on a variety of wikipedia pages. Specifically, the text is vectorized using the scikit-learn CountVectorizer and then passed on to a LatentDirichletAllocation to estimate topics. As can be seen below, the 5 chunks of text from the dog/cat wiki pages cluster together, suggesting they are made up of distint topics.

[ ]:

hue=['dog']*chunk_size+['cat']*chunk_size
geo = hyp.plot(dog + cat, 'o', hue=hue, size=[8, 6])

Now, let’s add a third very different topic to the plot.

[ ]:

try:
    bball_text = wiki.page('Basketball').content
except:
    # Fallback if Wikipedia fails
    bball_text = "Basketball is a team sport in which two teams, most commonly of five players each, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball through the defender's hoop."

bball = chunk(bball_text, max(1, int(len(bball_text)/chunk_size)))

hue=['dog']*len(dog)+['cat']*len(cat)+['bball']*len(bball)
geo = hyp.plot(dog + cat + bball, 'o', hue=hue, labels=hue, size=[8, 6])

As you might expect, the cat and dog text chunks are closer to each other than to basketball in this topic space. Since cats and dogs are both animals, they share many more features (and thus are described with similar text) than basketball.

Visualizing NIPS papers¶

The next example is a dataset of all NIPS papers published from 1987. They are fit and transformed using the text from each paper. This example dataset can be loaded using the code below.

[ ]:

# Create sample NIPS-style academic paper excerpts for demonstration
sample_nips_papers = [
    "We present a novel approach to machine learning using neural networks.",
    "Deep learning has revolutionized computer vision and natural language processing.",
    "Our method achieves state-of-the-art results on benchmark datasets.",
    "We propose a new algorithm for efficient training of large neural networks."
]

geo = hyp.plot(sample_nips_papers, size=[8, 6])

Visualizing Wikipedia pages¶

Here, we will plot a collection of wikipedia pages, transformed using a topic model (the default ‘wiki’ model) that was fit on the same articles. We will reduce the dimensionality of the data with TSNE, and then discover cluster with the ‘HDBSCAN’ algorithm.

[ ]:

# Create sample Wikipedia-style text for demonstration
sample_wiki_pages = [
    "Machine learning is a method of data analysis that automates analytical model building.",
    "Artificial intelligence is intelligence demonstrated by machines, in contrast to natural intelligence.",
    "Neural networks are computing systems vaguely inspired by biological neural networks.",
    "Deep learning is part of a broader family of machine learning methods based on artificial neural networks."
]

geo = hyp.plot(sample_wiki_pages, size=[8, 6])

Visualizing State of the Union Addresses¶

In this example we will plot each state of the union address from 1989 to present. The dots are colored and labeled by president. The semantic model that was used to transform is the default ‘wiki’ model, which is a CountVectorizer->LatentDirichletAllocation pipeline fit with a selection of wikipedia pages. As you can see below, the points generally seem to cluster by president, but also by party affiliation (democrats mostly on the left and republicans mostly on the right).

[ ]:

# Use sample SOTUS text data for demonstration
sample_sotus = [
    "My fellow Americans, the state of our union is strong.",
    "We face challenges, but we face them together as one nation.",
    "Our economy is growing, and jobs are being created.",
    "We must continue to work for the betterment of all Americans."
]

geo = hyp.plot(sample_sotus, size=[10,8])

Changing the reduction model¶

These data are reduce with PCA. Want to visualize using a different algorithm? Simply change the reduce parameter. This gives a different, but equally interesting lower dimensional representation of the data.

[ ]:

hyp.plot(sample_sotus, reduce='UMAP', size=[10, 8])

Defining a corpus¶

Now let’s change the corpus used to train the text model. Specifically, we’ll use the ‘nips’ text, a collection of scientific papers. To do this, set corpus='nips'. You can also specify your own text (as a list of text samples) to train the model.

[ ]:

# Demonstrate plotting with different corpus
hyp.plot(sample_sotus, reduce='UMAP', corpus=sample_nips_papers, size=[10, 8])

Interestingly, plotting the data transformed by a different topic model (trained on scientific articles) gives a totally different representation of the data. This is because the themes extracted from a homogenous set of scientific articles are distinct from the themes extract from diverse set of wikipedia articles, so the transformation function will be unique.