The picture at the start of this chapter is a so called "word cloud"
that was generated by examining all of the words returned from a
Twitter search of the term "data science" (using a web application
at http://www.jasondavies.com) These colorful word clouds are
fun to look at, but they also do contain some useful information.
The geometric arrangement of words on the ﬁgure is partly ran-
dom and partly designed and organized to please the eye. Same
with the colors. The font size of each word, however, conveys some
measure of its importance in the "corpus" of words that was pre-
sented to the word cloud graphics program. Corpus, from the
Latin word meaning "body," is a word that text analysts use to refer
to a body of text material, often consisting of one or more docu-
ments. When thinking about a corpus of textual data, a set of docu-
ments could really be anything: web pages, word processing docu-
ments on your computer, a set of Tweets, or government reports. In
most cases, text analysts think of a collection of documents, each of
which contains some natural language text, as a corpus if they plan
to analyze all the documents together.
The word cloud on the previous page shows that "Data" and "Sci-
ence" are certainly important terms that came from the search of
Twitter, but there are dozens and dozens of less important, but per-
haps equally interesting, words that the search results contained.
We see words like algorithms, molecules, structures, and research,
all of which could make sense in the context of data science. We
also see other terms, like #christian, Facilitating, and Coordinator,
that don’t seem to have the same obvious connection to our origi-
nal search term "data science." This small example shows one of
the fundamental challenges of natural language processing and the
closely related area of search: ensuring that the analysis of text pro-
duces results that are relevant to the task that the user has in mind.
In this chapter we will use some new R packages to extend our
abilities to work with text and to build our own word cloud from
data retrieved from Twitter. If you have not worked on the chapter
"String Theory" that precedes this chapter, you should probably do
so before continuing, as we build on the skills developed there.
Depending upon where you left off after the previous chapter, you
will need to retrieve and pre-process a set of tweets, using some of
the code you already developed, as well as some new code. At the
end of the previous chapter, we have provided sample code for the
TweetFrame() function, that takes a search term and a maximum
tweet limit and returns a time-sorted dataframe containing tweets.
Although there are a number of comments in that code, there are
really only three lines of functional code thanks to the power of the
twitteR package to retrieve data from Twitter for us. For the activi-
ties below, we are still working with the dataframe that we re-
trieved in the previous chapter using this command:
tweetDF <- TweetFrame("#solar",100)
This yields a dataframe, tweetDF, that contains 100 tweets with the
hashtag #solar, presumably mostly about solar energy and related
"green" topics. Before beginning our work with the two new R
packages, we can improve the quality of our display by taking out
a lot of the junk that won’t make sense to show in the word cloud.
To accomplish this, we have authored another function that strips
out extra spaces, gets rid of all URL strings, takes out the retweet
header if one exists in the tweet, removes hashtags, and eliminates
references to other people’s tweet handles. For all of these transfor-
mations, we have used string replacement functions from the
stringr package that was introduced in the previous chapter. As an
example of one of these transformations, consider this command,