Using R to expand Twitter URLS

So this should be so simple and obvious that it's not worth a post, but I keep forgetting how to do everything so I'm gonna put this here to remind myself.

Here's where I am. I have a list of 4011 tweets with #agu12 or #agu2012 hashtag. A lot of these are coding as "pointers" - their main function is to direct readers' attention somewhere else. So I got to wondering: where?  Are they directing people to online versions of the posters? Are they just linking to more NASA press releases?  % going to a .edu?

Of course all the URLs are shortened and there are services you can use to expand, but in R, it's already right there in the TwitteR package as


This uses the API . All you have to do is plug in the URL. Cool!

So here was my original plan: find the tweets with urls, extract the urls, expand them, profit! And I was going to do all this in R. But then it got a little ridiculous.
So instead I: used open refine to find all the urls, then assigned IDs to all the records, and then used filtering and copy and pasting to get them all in two columns ID, URL.

Issues: non-printing characters (Excel has a clean command), extra spaces (trim - didn't really work so I did a find and replace), random commas (some needed to be there), random other punctuation (find and replace), #sign

The idea in R was to do a for loop to iterate through each url, expand it, append it to a vector (or concatenate, whichever), then add that to the dataframe and do stats on it or maybe just export to Excel and monkey with it there.

For loop, fine, append - not for love nor money despite the fact that I have proof that I successfully did it in my Coursera class. I don't know. And the API was failing for certain rows. For a couple of rows, I found more punctuation. Then I found the rest of the issues were really all about length. They don't expect shortened urls to be long (duh)!  So then I had to pick a length, and only send ones shorter than that (50) to the api. I finally gave up with the stupid append, and I just printed them to the screen and copied them over to Excel. Also I cheated with how long the for loop had to be - I should have been able to just say the number of rows in the frame but meh.
Anyhow, this worked:

 setwd("~/ mine")
library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
data <- read.csv("agu12justurl.csv", colClasses = "character")
#check it out
#test a single one
#this was for me trying to append, sigh
full.vec <- vector(mode="character")
#create a vector to put the new stuff in, then I'll append to the data frame, I hope
#check the for loop 
 for (i in 1:200){print(data.sub$url[i])}
#that works
for (i in 1:3){print(decode_short_url(data.sub$url[i]))}
#that works - good to know, though, that if it can't be expanded it comes back null

#appending to the vector is not working, but printing is so will run with that 
for (i in 1:1502){ if(nchar(data$url[i])>50){
 } else {

If anyone wants to tell me what I'm doing wrong with the append, it would be appreciated. I'm sure it must be obvious.

So what's the answer? Not sure. I'll probably do a post on string splitting and counting... OR I'll be back in Open Refine. How do people only ever work in one tool?


3 responses so far

  • Christina Pikas says:

    Aw darn. I could just have easily done this

    I just tried it and it worked. oh well! I don't want to abuse the api so if I want that data in R, I'll probably just import it.

  • Bob O'H says:

    The problem with append is that you're trying to use it rather than vectorise the for() loop.

    Offhand (and on my iPhone) I can't remember if if() vectorises. If not, just use sapply():

    Shorten=function(URL) ifelse(!nchar(URL)>50) decode_short_url(URL), URL)
    full.url=sapply(data$url, Shorten)

    This will also be quicker: the way you wrote it, R has to repeatedly allocate memory to lengthening vectors.

    • Christina Pikas says:

      That's brilliant. So elegant! It didn't occur to me to approach it that way.