A local way to create a word cloud

Jul 03 2013 Published by under information analysis, Information Science

There's a lot of debate about whether word clouds are useful visualizations, but they do give you some insight into the prevalence of words in some amount of text.

Wordle is a way to do this online but you have to paste your text into a box or provide a url to an RSS or ATOM feed. So that won't work for some content because of sensitivity and it also won't work for batches of files. Here's a way I stumbled upon when I was trying to do some other analysis/visualizations.

Needed:

  • R
  • packages: tm (and all of its many dependencies), wordcloud, RColorBrewer (optional)
  • data

I originally had a 470 page pdf that I wanted to get a feel for. It was divided up into sections, each on a specific program. I ended up saving out each section as rtf by hand although I could have done it programmatically in some other program. I tried saving directly to plain text but that failed. I then used DocFrac to batch convert them to plain text (probably not necessary, but I wasn't sure). So then I had a directory of plain text files.

Load into R the whole directory:

thingy <- Corpus(DirSource("plaintext"), readerControl = list(reader = readPlain()))

make a copy in case of emergencies, but mostly to use at a later step:

thingy.orig <- thingy

Then I did a bunch of clean up tasks pretty typical for this type of work (there is an easier way to do it but this way works. I didn't take into account the order of doing this, which I probably should have)

#remove stop words
exchanger <- function(x) removeWords(x, stopwords("english"))
thingy<- tm_map(thingy, exchanger)

#stem
exchanger <- function(x) stemDocument(x)
thingy<- tm_map(thingy, exchanger)

#lower case
exchanger <- function(x) tolower(x)
thingy<- tm_map(thingy, exchanger)

# remove numbers
exchanger <- function(x) removeNumbers(x)
thingy<- tm_map(thingy, exchanger)

# using grep sub to remove punctuation
# I used this instead of the built in because I wanted it to leave a space where the punctuation had been
exchanger <- function(x) gsub('[[:punct:]]', ' ', x)
thingy<- tm_map(thingy, exchanger)

In addition to the regular stop words, I removed a list of words that were particular to this text.

#custom list
mystop <- c(" "," ")
exchanger <- function(x) removeWords(x, mystop)
thingy<- tm_map(thingy, exchanger)

Next you should take a peek at at least one specific document.

inspect (thingy[3])

And you can add more terms to the custom list or do a gsub or whatever.

Then you need to build a term document matrix.

# create term document matrix
dtm <- DocumentTermMatrix(thingy)

And it needs to be in matrix format. I understand there's a limit to how big these can be. There are some options to make them smaller.

dtm.mat <- as.matrix(dtm)

Then you need to get a vector of all the words and it's nice to have them in order of decreasing frequency. Plus you need the frequencies.

v <- sort(colSums(dtm.mat), decreasing=TRUE)
myNames <- names(v)

Then you pick your colorbrewer color scheme and draw your cloud:

pal <- brewer.pal(12,"Paired")
#add in seed to make it reproducible
set.seed (300)
wordcloud(myNames,v,c(8,.3),7,,TRUE,TRUE,.15,pal)

Now here's the issue. Our words are still stemmed. Well, there is a stem completion function which goes through and finds the most prevalent complete version of that word in a dictionary. In the examples, they often use the original corpus as the dictionary... the problem is that it's very, very, very slow. In fact, it can take days. I have some options I'm considering to get around this:

  1. try the wordnet package
  2. use python nltk to lemmatize prior to doing the r stuff (lemmas are real words unlike the stems)
  3. change the script to make it not iterate so much

Stemcompletion ran for a week on my computer before it was forced to reboot from an update... I haven't had a chance to try the other methods yet.

One response so far

  • Ellen says:

    Hello there! I simply wish to give you a big thumbs up for the great information you have here on this post.

    I will be returning to your web site for more soon.