## Filling in missing data: books

May 17 2018 Published by under bibliometrics, information analysis

Photo by Janko Ferlič on Unsplash

My normal area to do information analysis and bibliometrics is technology - not even really science, but technology. Current project I'm obsessing on (it's actually really cool and interesting and fun) spans political science, history, philosophy, and even some sociology and criminal justice. So I played all my reindeer games on the journal articles, but when it comes to analyzing the abstracts, a lot were missing. I wanted to ignore these, but they were whole collections from extremely relevant journals and also nearly all of the books. The journals I filled in what i could from Microsoft Academic (no endorsement intended).

Books though.

I hoped I could use WorldCat - my library doesn't have a cataloging subscription with them so I don't think so? I hoped LoC - I could download ALL records before like 2014 which would help a lot, but wow, I can't really deal with that much data for right now. OpenLibrary- no.

Amazon - they've changed the rules a lot and it wasn't clear to me that - if I took all the time to figure it out, if I wouldn't be actually breaking the current terms of service (they have changed).

I asked the wonderful group on the Code4Lib list - one of them (thanks Mark!) pointed to Harvard's api and records. Woo-hoo! I hope they log my stats to justify their expense. I hope my messing around doesn't cause them any problems.

I'm not a cataloger although played one for 4 months 15 years ago. And I don't know MODS or Dublin Core (although it did exist when i was in library school). I pulled a couple of records and poked through them to see where what I wanted was located. Originally, I pulled the title from the record, but that proved to be too messy.

The data I needed to fill in:

• gathered from reference managers of a team of analysts over a number of years (as many as 9)
• searched in general (WoS, Scopus) and specific (World Political Science Abs, PAIS, etc) databases
• gathered in EndNote, de-duped, exported in tab del

in R

• limited to books and book sections
• for titles, removed punctuation, made lower case, removed numbers
• for authors, split to take first name provided of the first author (mostly the last name)
• for ISBN, removed ISSNs, kept first one, took out - (yes to be more thorough, I should have or'd them)

getharvabs <- function(citeNo){
if (!is.na(missingbooks1$holdisbn2[citeNo])){ query<-paste0("identifier=",missingbooks1$holdisbn2[citeNo],"&limit=1")
holding<-GET(harvardurl, query=query)
print(holding$status_code) if (holding$status_code==200){
bookContent <- rawToChar(holding$content) booktxt<-fromJSON(bookContent) if (booktxt$pagination$numFound==0){ citerow<-data.frame(id=citeNo, TI=missingbooks1$X4[citeNo],
AB=NA)
} else {
ifelse(is.null(booktxt$items$mods$abstract$#text),
booktxt$items$mods$abstract$#text<-NA,
booktxt$items$mods$abstract$#text<-as.character(booktxt$items$mods$abstract$#text))
citerow<-data.frame(id=citeNo,
TI=missingbooks1$X4[citeNo], AB=booktxt$items$mods$abstract$#text) } } } else { query<-paste0("title=",missingbooks1$holdti[citeNo],
"&name",missingbooks1$hold1au[citeNo], "&limit=1") holding<-GET(harvardurl, query=query) print(holding$status_code)
if (holding$status_code==200){ bookContent <- rawToChar(holding$content)
booktxt<-fromJSON(bookContent)
if (booktxt$pagination$numFound==0){
citerow<-data.frame(id=citeNo, TI=missingbooks1$X4[citeNo], AB=NA) } else { ifelse(is.null(booktxt$items$mods$abstract$#text), booktxt$items$mods$abstract$#text<-NA, booktxt$items$mods$abstract$#text<-as.character(booktxt$items$mods$abstract$#text)) citerow<-data.frame(id=i, TI=missingbooks1$X4[citeNo],
AB=booktxt$items$mods$abstract$#text)
}
}
}
print("this is citerow")
print(citeNo)
print(citerow)
return(citerow)
}

holdbookabs<-data.frame(id=as.integer(0),TI=as.character("test"), AB=as.character("test"))

for (i in 1:length(missingbooks1$holdti)){ temp<-getharvabs(i) holdbookabs<-rbind(holdbookabs,temp) Sys.sleep(2) }  This choked a couple of times. Maybe I don't always have the right book - but then again, if it's on the same subject. Meh. I also considered getting the LCSH, but decided against. Too much time spent on this already. ## Getting abstracts back from Microsoft Academic's Inverted Index May 14 2018 Published by under bibliometrics, information analysis Some time ago, I posted about using Microsoft Academic to fill in missing data from other searches. Jamie and I were going to do a package to wrap the API, but bureaucracy more or less killed our enthusiasm (well, not his, that would be impossible). Here I am obsessing over a really, really cool bibliometrics project, and have lots of citations missing abstracts. I'm sort of thinking I won't be able to do much with the books even though catalogs seem to have descriptions for a lot of them (happy to take suggestions). I've already looked at using other sources, so I'm back at Academic. Pulled out my script. Found I lost my keys, retrieved new keys and found there's a new endpoint url, updated that and hit go. .... Ladies and gentlemen, they moved the abstracts... no more paragraph, you now get an "inverted index." People who studied information retrieval may know what that is, but in this case it's a list of terms with each having a numeric vector of locations the term appears. Stop words are included so "the" might have 20 locations and "sassafras" has 1. Here it is. Jamie helped with the strategy and the rest comes from lots of searches: library("httr") library("tm") library("jsonlite") library("reshape2") library("plyr") library("glue") setwd("I:/Christina's/") options(stringsAsFactors = FALSE) #keys- put yours in: msakey1<-"" msakey2<-"" #current 05142018 url #https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate[?expr][&model][&count][&offset][&orderby][&attributes] apiurl<-"https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?" #sample search left in searchexpr<-"Ti='identity as a variable'" apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E" #test on one to see how it works testcite <- GET(apiurl, query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1)) #could look for errors first testcite$status_code

#get the json out into usable format
#comes out raw so need to make into text

testciteContent <- rawToChar(testcite$content) test<-fromJSON(testciteContent) test$entities$AA test$entities$AA[[1]]$AuN
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';') test$entities$AA[[1]]$AfN
test$entities$J$JN test$entities$Y test$entities$Ti ################### #use the following to get an abstract from the inverted index holdE<-fromJSON(test$entities$E) testII<-holdE$IA$InvertedIndex testII %>% do.call(rbind, .) testII.m<-melt(testII) testII.m<-unique(testII.m) arrange(testII.m,value) ab<-collapse(testII.m$L1, sep=" ")

####################



## OAuth in TwitteR much easier now, whew!

Mar 18 2015 Published by under Uncategorized

Not like I should be messing with this at this point, but I wanted to retrieve a tweet to provide evidence for a point. Anyway, instead of the like 50 step process in the past, you now follow the instructions in the TwitterR readme: http://cran.r-project.org/web/packages/twitteR/README.html with the exception of you now put your access token and secret in the *single command* now, too, like so:

setup_twitter_oauth(consumer_key, consumer_secret, access_token=NULL, access_secret=NULL)


Then you can just search or whatever. Wow!

Very nice. How much time did I spend playing with the old method?

Comments are off for this post

## Polar and Ellipsoid Graphs in iGraph in R

Mar 12 2015 Published by under Uncategorized

I'm still working to do some additional graphs for the project mentioned in this earlier post. It was too crowded with the Fruchterman Reingold layout, so my customer suggested we do a circular layout with one category in the center and the remaining on the outer ring. I said sure! But when I went to do it, I found only star layout (one in the center) and ring layout. No polar layout. I tried a few things but finally broke down and asked. Quick perfect answer on StackOverflow (as often happens).

That led to this:

But hey, still pretty jammed up. So what about an ellipse? Sure!

What's that equation again?

$\frac {x^2}{a^2} + \frac {y^2}{b^2} =1$

But that's a hard way to do it when I need x and y values in a matrix. This looks better:

$x = a \cos(\theta) , y=b \sin(\theta)$

And this is how I did it.

ellip.layout <- function(a,b, theta) { cbind(a*cos(theta), -b*sin(theta)) }

systems <- which(V(g)$category == "System") comp <- which(V(g)$category != "System")

a<- ifelse(V(g)$category == "System",4,5) b<- ifelse(V(g)$category == "System",0.5,1)

theta <- rep.int(0, vcount(g)) #creates a blank vector
theta[systems] <- (1:length(systems)-1) * 2 * pi / length(systems)
theta[comp] <- (1:length(comp)-1) * 2 * pi / length(comp)

layout<- ellip.layout(a,b,theta)

plot.igraph(g, layout=layout, asp=0)

Originally I was getting the outer ring to be a circle anyway, but then I asked the mailing list and it was a matter of setting asp (aspect ratio) to 0.

Here's where I ended up:

ETA: If you do labels, there's a neat trick to make them always outside the circle. See here: https://gist.github.com/kjhealy/834774

Comments are off for this post

## Exporting high resolution graphs from RStudio

Feb 12 2015 Published by under Off Topic

This may not be obvious until you look into it but apparently the default export from RStudio -  if you use the nifty little tool in plots tab on the lower right hand side -  is 72dpi. This is fine for showing on web pages, typically, but is not enough for print. Particularly if you're submitting to a journal or something like that. There's lots of advice, but I found it somewhat confusing.

RStudio Interface for Windows from RStudio.com

I found these posts helpful:

• http://blog.revolutionanalytics.com/2009/01/10-tips-for-making-your-r-graphics-look-their-best.html
• https://danieljhocking.wordpress.com/2013/03/12/high-resolution-figures-in-r/
• http://www.r-bloggers.com/exporting-nice-plots-in-r/

I think someone I was reading just got out of RStudio and did his work in the standard interface. Really, there's no need for that. I also read somewhere that Cairo is not really used any more? There is a way to export to pdf from RStudio and check a box to use Cairo...

Here's what I did.

CairoPDF(file="something.pdf", width=11, height=8.5, family="Helvetica", pointsize=11)

 set.seed(1337) plot.igraph(g, layout=layout.fruchterman.reingold, edge.arrow.size=0.4, edge.color="black", vertex.size=V(g)$degree, vertex.label.dist=V(g)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=V(g)$vertex.label.cex, edge.lty=E(g)$edge.lty, vertex.frame.color=V(g)$frame.color)  dev.off() A couple of notes: • I found I needed to increase the arrowhead size • I needed to decrease the font size • I needed to set a seed so I was only changing one thing at a time as I experimented • When I did png, my dotted lines didn't look so dotted anymore. I didn't feel like messing with that further  Cairo(file="something.png", type="png", units="in", width=10, height=7, pointsize=12, dpi=300)  set.seed(1337) plot.igraph(g, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(g)$degree, vertex.label.dist=V(g)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=V(g)$vertex.label.cex, edge.lty=E(g)$edge.lty, vertex.frame.color=V(g)$frame.color) 

dev.off()

## Using more of the possible dimensions in a network graph

Jan 30 2015 Published by under information analysis

When doing bibliometrics, or social network analysis or any kind of network graph, there are only so many different ways to convey information.

• Size of nodes
• Shape of nodes (including pictures)
• Color of nodes
• Border of nodes (or multiple borders)
• Labels (node or edge)
• Edge weight
• Edge color
• Arrows
• Shading areas around/behind nodes
• Layout or arrangement of nodes

Of these, I almost always size nodes by degree (connections to other nodes), do thickness of lines by their weight, and do some sort of energy or spring layout.

If I do some sort of clustering or community detection or even want to call out components, I'll do that with node color.

My normal things are easy in any package that will graph networks. I was working on a project where we were looking at the maturity of a particular part of an industry. As part of these, we wanted to know if the necessary component systems were available from multiple suppliers and if those suppliers had relationships with different system integrators and if their things were operational or were just for lab or testing purposes.

We could have done a graph for each sub system but they wanted this graph to really just be one slide in a fairly small deck. I tried various approaches in Gephi and NetDraw and wasn't excited. So back to R and iGraph.  In the end (anonymized) :

Resulting graph - minus any labels.

I used:

• node shape for if a component or a system integrator
• color for type of component
• size for degree
• line dashed or dotted for if it was in operation or not

I really wanted to show different shapes for each category but igraph only has like 6 default ones and they don't look all that different from each other. NetDraw has more. I tried to use raster images - but I'm on a windows machine and I found all that very confusing.

One unfortunate thing about this graph is that I had to list companies multiple times if they had offerings in multiple categories.

Customer seemed to like it.

I'm not going to take the time to anonymize all the code but here are some key pieces - ask if there's anything I figured out that you don't immediately see how to do.
I started with a spreadsheet (3 of us librarians were adding data)
nodetable tab:
id label category

edgetable tab:
source target yes/no notes

These I imported into gephi (super easy)... and then tried all sorts of stuff... and then exported into graphml
#read in the graph g<-read.graph("g.graphml", format="graphml")
 #shape nodes. these work, but you can't have n/a. so there has to be a default. also, there is an easier way for (i in 1:101)ifelse(V(g)[i]$Category=='category', V(g)[i]$shape<-'circle', V(g)[i]$shape<-'square')  #color of nodes - a simple number will draw from the palette. see below for (i in 1:101)if(V(g)[i]$Category=="category"){V(g)[i]$color<-1}  #calculate and keep the degree. i use it again for label placement (not shown) and to bold some labels (not shown) V(g)$degree<-degree(g, mode="all")  #when I tested the graphing, the isolates were all mixed in and messed up all the labels. #subgraph to show isolates separately gi<-induced.subgraph(g,V(g)$degree==0) gnoni<-induced.subgraph(g,V(g)$degree!=0)  #make dotted lines for not operational for (i in 1:76) ifelse (E(gnoni)[i]$"operational"=="yes", E(gnoni)[i]$edge.lty<-1,E(gnoni)[i]$edge.lty<-2)  #prettier colors library("RColorBrewer", lib.loc="~/R/win-library/3.1") mypalette<-brewer.pal(6,"Paired") palette(mypalette)  #legend definitions colors < - c('gray40', 1,2,3,4,5,6) labels <- vector of categories   #plot graph keep device open plot.igraph(gnoni, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(gnoni)$degree, vertex.label.dist=V(gnoni)$vertex.label.dist, vertex.label.color="black", vertex.label.family="sans",edge.curved=TRUE, vertex.label.cex=0.8, edge.lty=E(gnoni)$edge.lty)  #put legends on - isolates are just shown as a legend so they are neatly lined up #could have been done by plotting points  legend("bottomright",legend=labels, fill=colors, border="black", cex=0.7, inset=c(-0.1,0)) legend("topleft", legend=V(gi)$label, pch=19, col=V(gi)$color, cex=0.7, bty="n", y.intersp=0.5) legend("topright", legend=c("Yes", "No"), lty=c(1,2), cex=0.7,inset=c(-0.02,0)) Continue Reading »

## I'm a coding fool... use of the nuggets mentioned last post

Jul 18 2014 Published by under information analysis, Uncategorized

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
#this creates a vector to hang on to the numbers of counts
phys.hold <- vector(mode="numeric")
#this is so i can make sure i just have numbers in the count field
pattern<-"[[:digit:]]+"
#this finds matching records and then keeps just the part we want
m<-regexpr(pattern,physfile$Count) physfile$Count<-as.numeric(regmatches(physfile$Count,m)) #one of these days i'll just import right the first time instead of this part physfile$Analysis.Value<-as.character(physfile$Analysis.Value) for (j in 1:length(physfile$Count)){
if (is.element(physfile$Analysis.Value[[j]],physics)) { phys.hold[j]<- (physfile$Count[[j]])}
else phys.hold[j]<-0
}
total<- sum(phys.hold)
return(c(file,total))
rm(phys.hold)
}


So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

Comments are off for this post

## Getting blog post content in a usable format for content analysis

Jul 04 2014 Published by under dissertation

So in the past I've just gone to the blogs and saved down posts in whatever format to code (as in analyze). One participant with lots of equations asked me to use screenshots if my method didn't show them adequately.

This time I have just a few blogs of interest and I want to go all the way back, and I'll probably do some quantitative stuff as well as just coding at the post level. For example just indicating if the post discusses their own work, other scholarly work, a method (like this post!), a book review... , career advice, whatever. Maybe I'll also select some to go deeper but it isn't content analysis like linguists or others do at the word level.

Anyway, lots of ways to get the text of web pages. I wanted to do it in R completely, and I ended up getting the content there, but I found python to work much better for parsing the actual text out of the yucky tags and scripts galore.

I had *a lot* of help with this. I lived on StackOverflow, got some suggestions at work and on friendfeed (thanks Micah!), and got a question answered on StackOverflow (thanks alecxe!). I tried some books in Safari but meh?

I've had success with this on blogger and wordpress blogs. Last time when I was customizing a perl script to pull the commenter urls out every blog was so different from the others that I had to do all sorts of customization. These methods require very little change from one to the next. Plus I'm working on local copies when I'm doing the parsing so hopefully having as little impact as possible (now that I know what I'm doing - I actually got myself blocked from my own blog earlier because I sent so many requests with no user agent)

So using R to get the content of the archive pages. Largest reasonable archive pages possible instead of pulling each post individually, which was my original thought. One blog seemed to be doing an infinite scroll but when you actually looked at the address block it was still doing the blogurl/page/number format.  I made a csv file with the archive page urls in one column and the file name in another. I just filled down for these when they were of the format I just mentioned.

Read them into R. Then had the function:

UserAgent <- "pick something"
temp <- getURL(link, timeout = 8, ssl.verifypeer = FALSE, useragent = "UserAgent")
nameout <- paste(fileName, ".htm", sep="")
write (temp,file=nameout)
}

I ended up doing it in chunks.  if you're doing this function with one it's like:

getFullContent("http://scientopia.org/blogs/christinaslisrant/","archivep1" )

More often I did a few:

mapply(getFullContent,archiveurls$url[1:20],archiveurls$name[1:20])

So I moved the things around to put them in a folder.

Then this is the big help I got from StackOverflow. Here's how I ended up with a spreadsheet.

from bs4 import BeautifulSoup
import os, os.path

# from http://stackoverflow.com/questions/24502139/using-beautifulsoup-to-pull-multiple-posts-from-a-single-blog-archive-page-with
# this is the file to write out to
posts_file = open ("haposts.txt","w")

def pullcontent(filename):

soup = BeautifulSoup(open(filename))
posts = []
for post in soup.find_all('div', class_='post'):
title = post.find('h3', class_='post-title').text.strip()
author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
content = post.find('div', class_='post-body').p.text.strip()
date = post.find_previous_sibling('h2', class_='date-header').text.strip()

posts.append({'title': title,
'author': author,
'content': content,
'date': date})
#print posts
posts = str(posts)
posts_file.write (posts)

# this is from http://stackoverflow.com/questions/13014862/parse-each-file-in-a-directory-with-beautifulsoup-python-save-out-as-new-file?rq=1

for filename in os.listdir("files"):
pullcontent(filename)

print ("All done!")

posts_file.close()


So then I pasted it into word, put in some line breaks and tabs and pasted into excel.  I think I could probably go from that file or the data directly into Excel, but this works.

Really very minor tweaking between blogs. Most I don't actually need an author for but I added in the url using something like this:

url = post.find('h2').a.get('href')

The plan is to import this into something like nvivo or atlas.ti for the analysis. Of course it would be very easy to load it in to R as a corpus and then do various textmining operations.

Comments are off for this post

## Using R TwitteR to Get User Information

I'm gonna keep stating the obvious, because this took me a few hours to figure out. Maybe not working continuously, but still.

So, I have like more than 6000 tweets from one year of AGU alone, so I'm gonna have to sample somehow. Talking this over with my advisor, he suggested that we have to find some reasonable way to stratify and then do random within the stratification. I haven't worked all the details out yet - or really any of them - but I started gathering user features I could base the decision on. Number of tweets with the hashtag was super quick in Excel. But I was wondering if they were new to Twitter, if they tweeted a lot, and if they had a lot of followers. That's all available through the api and using the TwitteR package by Jeff Gentry.  Cool.

So getUser() is the function to use. I made up a list of the unique usernames in Excel and imported that in. Then I went to loop through.


library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
data USERdata<-vector()
userInfo<-function(USER){
temp<-getUser(USER, cainfo="cacert.pem")
USERdata<-c(USER,temp$created,temp$statusesCount,temp$followersCount) return(USERdata) } #test for users 4-6 tweeter.info<-sapply(data$user[4:6],userInfo)

But that was sorta sideways... I had a column for each user... sorta weird. Bob O'H helped me figure out how to transpose that and I did, but it was sorta weird.

So then I tried this way:
 get.USER.info<-function(startno,stopno){ # set up the vectors first n<-stopno-startno+1 USER.name<-character(n) USER.created<-numeric(n) USER.posts<-numeric(n) USER.foll<-numeric(n) for (i in startno:stopno) { thing<-getUser(data$user[i], cainfo="cacert.pem") USER.name[i]<-data$user[i] USER.created[i]<-thing$created USER.posts[i]<-thing$statusesCount USER.foll[i]<-thingfollowersCount }  return(data.frame(username=USER.name,created=USER.created, posts=USER.posts,followers=USER.foll, stringsAsFactors=FALSE)) } So that was cool, until it wasn't. I mean, turns out that 2% of the users have deleted their accounts or block me or are private or something. So it didn't recover from that error and I tried to test for is.null() and is.NA() but it failed.... So then I went back to the mailing list and there was a suggestion to user try() but eek. So then I noticed that if you have a pile to look through you're actually supposed to use lookupUsers(users, includeNA=FALSE, ...) And I did, and I wanted to keep the NA so that I could align with my other data later... but once again, no way to get the NAs out. And it's an object that's a pile of lists... which I was having trouble wrapping my little mind around (others have no issues). So I went back and used that command again, and this time said to skip the NA (the not found users). Then I think from the mailing list or maybe from Stack Overflow? I had gotten the idea to use unlist. So here's what I did then: easy.tweeters.noNA<-lookupUsers(datauser, cainfo="cacert.pem") #check how many fewer this was length(easy.tweeters.noNA) #1247 so there were 29 accounts missing hrm testbigdf<-data.frame() for (i in 1:1247){holddf<-twListToDF(easy.tweeters.noNA[i]) testbigdf<-rbind(testbigdf,holddf) }
And that created a lovely dataframe with all kinds of goodies for it. I guess I'll have to see what I want to do about the 29 accounts.

I really would have been happier if it was more graceful with users that weren't found.

Also, not for every single command you have to user the cainfo="cacert.pem" thingy... Every time, every command.

ALSO, I had figured out oauth, but the twitter address went from http:// to https:// and so that was broken, but I fixed it. I hope I don't have to reboot my computer soon! (Yeah, I saved my credentials to a file, but I don't know... )

Comments are off for this post

## Using R to expand Twitter URLS

So this should be so simple and obvious that it's not worth a post, but I keep forgetting how to do everything so I'm gonna put this here to remind myself.

Here's where I am. I have a list of 4011 tweets with #agu12 or #agu2012 hashtag. A lot of these are coding as "pointers" - their main function is to direct readers' attention somewhere else. So I got to wondering: where?  Are they directing people to online versions of the posters? Are they just linking to more NASA press releases?  % going to a .edu?

Of course all the URLs are shortened and there are services you can use to expand, but in R, it's already right there in the TwitteR package as

decode_short_url

This uses the longapi.org API . All you have to do is plug in the URL. Cool!

So here was my original plan: find the tweets with urls, extract the urls, expand them, profit! And I was going to do all this in R. But then it got a little ridiculous.
So instead I: used open refine to find all the urls, then assigned IDs to all the records, and then used filtering and copy and pasting to get them all in two columns ID, URL.

Issues: non-printing characters (Excel has a clean command), extra spaces (trim - didn't really work so I did a find and replace), random commas (some needed to be there), random other punctuation (find and replace), #sign

The idea in R was to do a for loop to iterate through each url, expand it, append it to a vector (or concatenate, whichever), then add that to the dataframe and do stats on it or maybe just export to Excel and monkey with it there.

For loop, fine, append - not for love nor money despite the fact that I have proof that I successfully did it in my Coursera class. I don't know. And the API was failing for certain rows. For a couple of rows, I found more punctuation. Then I found the rest of the issues were really all about length. They don't expect shortened urls to be long (duh)!  So then I had to pick a length, and only send ones shorter than that (50) to the api. I finally gave up with the stupid append, and I just printed them to the screen and copied them over to Excel. Also I cheated with how long the for loop had to be - I should have been able to just say the number of rows in the frame but meh.
Anyhow, this worked:

 setwd("~/ mine")
library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
data <- read.csv("agu12justurl.csv", colClasses = "character")
#check it out
str(data)
#test a single one
decode_short_url(data$url[2]) #this was for me trying to append, sigh full.vec <- vector(mode="character") #create a vector to put the new stuff in, then I'll append to the data frame, I hope #check the for loop for (i in 1:200){print(data.sub$url[i])}
#that works
for (i in 1:3){print(decode_short_url(data.sub$url[i]))} #that works - good to know, though, that if it can't be expanded it comes back null #appending to the vector is not working, but printing is so will run with that for (i in 1:1502){ if(nchar(data$url[i])>50){
urlhold<-data$url[i] } else { urlhold<-decode_short_url(data$url[i])
}
print(urlhold)
#append(full.vec,urlhold)
}


If anyone wants to tell me what I'm doing wrong with the append, it would be appreciated. I'm sure it must be obvious.

So what's the answer? Not sure. I'll probably do a post on string splitting and counting... OR I'll be back in Open Refine. How do people only ever work in one tool?

Older posts »

•  Support level Reader : $5.00 USD - monthly Supporter :$10.00 USD - monthly Sustainer : $25.00 USD - monthly Angel :$1,200.00 USD - yearly
• Scientopia Blogs

• I'm a science and engineering librarian and information scientist at a university-affiliated research center. I have a BS in Physics, an MLS, and a PhD in Information Studies. Nothing here represents my employer.

• Get updates by email

• ## Contribute to Scientopia

Click here to toss some change in our contribution jar. All funds donated will be used solely to support Scientopia's operating costs. See our Code for more information on our funding policies.
• Where am I?

N 39 W 76

Research Blogging Awards 2010 Finalist