Filling in missing data: books

May 17 2018 Published by under bibliometrics, information analysis

Photo by Janko Ferlič on Unsplash

My normal area to do information analysis and bibliometrics is technology - not even really science, but technology. Current project I'm obsessing on (it's actually really cool and interesting and fun) spans political science, history, philosophy, and even some sociology and criminal justice. So I played all my reindeer games on the journal articles, but when it comes to analyzing the abstracts, a lot were missing. I wanted to ignore these, but they were whole collections from extremely relevant journals and also nearly all of the books. The journals I filled in what i could from Microsoft Academic (no endorsement intended).

Books though.

I hoped I could use WorldCat - my library doesn't have a cataloging subscription with them so I don't think so? I hoped LoC - I could download ALL records before like 2014 which would help a lot, but wow, I can't really deal with that much data for right now. OpenLibrary- no.

Amazon - they've changed the rules a lot and it wasn't clear to me that - if I took all the time to figure it out, if I wouldn't be actually breaking the current terms of service (they have changed).

I asked the wonderful group on the Code4Lib list - one of them (thanks Mark!) pointed to Harvard's api and records. Woo-hoo! I hope they log my stats to justify their expense. I hope my messing around doesn't cause them any problems.

I'm not a cataloger although played one for 4 months 15 years ago. And I don't know MODS or Dublin Core (although it did exist when i was in library school). I pulled a couple of records and poked through them to see where what I wanted was located. Originally, I pulled the title from the record, but that proved to be too messy.

The data I needed to fill in:

  • gathered from reference managers of a team of analysts over a number of years (as many as 9)
  • searched in general (WoS, Scopus) and specific (World Political Science Abs, PAIS, etc) databases
  • gathered in EndNote, de-duped, exported in tab del

in R

  • limited to books and book sections
  • for titles, removed punctuation, made lower case, removed numbers
  • for authors, split to take first name provided of the first author (mostly the last name)
  • for ISBN, removed ISSNs, kept first one, took out - (yes to be more thorough, I should have or'd them)


getharvabs <- function(citeNo){
  if (!$holdisbn2[citeNo])){
    holding<-GET(harvardurl, query=query)
    if (holding$status_code==200){
      bookContent <- rawToChar(holding$content)
      if (booktxt$pagination$numFound==0){
        } else {
                              AB=booktxt$items$mods$abstract$`#text`) }
    } else {
      holding<-GET(harvardurl, query=query)
      if (holding$status_code==200){
        bookContent <- rawToChar(holding$content)
        if (booktxt$pagination$numFound==0){
          citerow<-data.frame(id=citeNo, TI=missingbooks1$X4[citeNo], AB=NA)
          } else {
  print("this is citerow")

holdbookabs<-data.frame(id=as.integer(0),TI=as.character("test"), AB=as.character("test"))

for (i in 1:length(missingbooks1$holdti)){

This choked a couple of times. Maybe I don't always have the right book - but then again, if it's on the same subject. Meh.
I also considered getting the LCSH, but decided against. Too much time spent on this already.

No responses yet

Getting abstracts back from Microsoft Academic's Inverted Index

May 14 2018 Published by under bibliometrics, information analysis

Some time ago, I posted about using Microsoft Academic to fill in missing data from other searches. Jamie and I were going to do a package to wrap the API, but bureaucracy more or less killed our enthusiasm (well, not his, that would be impossible).

Here I am obsessing over a really, really cool bibliometrics project, and have lots of citations missing abstracts. I'm sort of thinking I won't be able to do much with the books even though catalogs seem to have descriptions for a lot of them (happy to take suggestions). I've already looked at using other sources, so I'm back at Academic.

Pulled out my script. Found I lost my keys, retrieved new keys and found there's a new endpoint url, updated that and hit go. ....

Ladies and gentlemen, they moved the abstracts... no more paragraph, you now get an "inverted index." People who studied information retrieval may know what that is, but in this case it's a list of terms with each having a numeric vector of locations the term appears. Stop words are included so "the" might have 20 locations and "sassafras" has 1.

Here it is. Jamie helped with the strategy and the rest comes from lots of searches:





options(stringsAsFactors = FALSE)

#keys- put yours in:

#current 05142018 url


#sample search left in
searchexpr<-"Ti='identity as a variable'"

#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#could look for errors first

#get the json out into usable format
#comes out raw so need to make into text

testciteContent <- rawToChar(testcite$content)
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')


#use the following to get an abstract from the inverted index


testII %>%, .)




ab<-collapse(testII.m$L1, sep=" ")


2 responses so far

OAuth in TwitteR much easier now, whew!

Mar 18 2015 Published by under Uncategorized

Not like I should be messing with this at this point, but I wanted to retrieve a tweet to provide evidence for a point. Anyway, instead of the like 50 step process in the past, you now follow the instructions in the TwitterR readme: with the exception of you now put your access token and secret in the *single command* now, too, like so:

setup_twitter_oauth(consumer_key, consumer_secret, access_token=NULL, access_secret=NULL)

Then you can just search or whatever. Wow!

Very nice. How much time did I spend playing with the old method?

Comments are off for this post

Polar and Ellipsoid Graphs in iGraph in R

Mar 12 2015 Published by under Uncategorized

I'm still working to do some additional graphs for the project mentioned in this earlier post. It was too crowded with the Fruchterman Reingold layout, so my customer suggested we do a circular layout with one category in the center and the remaining on the outer ring. I said sure! But when I went to do it, I found only star layout (one in the center) and ring layout. No polar layout. I tried a few things but finally broke down and asked. Quick perfect answer on StackOverflow (as often happens).

That led to this:

Polar Layout

But hey, still pretty jammed up. So what about an ellipse? Sure!

What's that equation again?

 \frac {x^2}{a^2} + \frac {y^2}{b^2} =1


But that's a hard way to do it when I need x and y values in a matrix. This looks better:

x = a \cos(\theta) , y=b \sin(\theta)

And this is how I did it.

ellip.layout <- function(a,b, theta) {
cbind(a*cos(theta), -b*sin(theta))

systems <- which(V(g)$category == "System")
comp <- which(V(g)$category != "System")

a<- ifelse(V(g)$category == "System",4,5)
b<- ifelse(V(g)$category == "System",0.5,1)

theta <-, vcount(g)) #creates a blank vector
theta[systems] <- (1:length(systems)-1) * 2 * pi / length(systems)
theta[comp] <- (1:length(comp)-1) * 2 * pi / length(comp)

layout<- ellip.layout(a,b,theta)

plot.igraph(g, layout=layout, asp=0)

Originally I was getting the outer ring to be a circle anyway, but then I asked the mailing list and it was a matter of setting asp (aspect ratio) to 0.

Here's where I ended up:

EllipseETA: If you do labels, there's a neat trick to make them always outside the circle. See here:

Comments are off for this post

Exporting high resolution graphs from RStudio

Feb 12 2015 Published by under Off Topic

This may not be obvious until you look into it but apparently the default export from RStudio -  if you use the nifty little tool in plots tab on the lower right hand side -  is 72dpi. This is fine for showing on web pages, typically, but is not enough for print. Particularly if you're submitting to a journal or something like that. There's lots of advice, but I found it somewhat confusing.

RStudio Interface for Windows from

RStudio Interface for Windows from

I found these posts helpful:


I think someone I was reading just got out of RStudio and did his work in the standard interface. Really, there's no need for that. I also read somewhere that Cairo is not really used any more? There is a way to export to pdf from RStudio and check a box to use Cairo...

Here's what I did.

CairoPDF(file="something.pdf", width=11, height=8.5, family="Helvetica", pointsize=11)


plot.igraph(g, layout=layout.fruchterman.reingold, edge.arrow.size=0.4, edge.color="black", vertex.size=V(g)$degree, vertex.label.dist=V(g)$vertex.label.dist, vertex.label.color="black","sans",edge.curved=TRUE, vertex.label.cex=V(g)$vertex.label.cex, edge.lty=E(g)$edge.lty, vertex.frame.color=V(g)$frame.color)

A couple of notes:

  • I found I needed to increase the arrowhead size
  • I needed to decrease the font size
  • I needed to set a seed so I was only changing one thing at a time as I experimented
  • When I did png, my dotted lines didn't look so dotted anymore. I didn't feel like messing with that further

Cairo(file="something.png", type="png", units="in", width=10, height=7, pointsize=12, dpi=300)


plot.igraph(g, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(g)$degree, vertex.label.dist=V(g)$vertex.label.dist, vertex.label.color="black","sans",edge.curved=TRUE, vertex.label.cex=V(g)$vertex.label.cex, edge.lty=E(g)$edge.lty, vertex.frame.color=V(g)$frame.color)

One response so far

Using more of the possible dimensions in a network graph

Jan 30 2015 Published by under information analysis

When doing bibliometrics, or social network analysis or any kind of network graph, there are only so many different ways to convey information.

  • Size of nodes
  • Shape of nodes (including pictures)
  • Color of nodes
  • Border of nodes (or multiple borders)
  • Labels (node or edge)
  • Edge weight
  • Edge color
  • Arrows
  • Shading areas around/behind nodes
  • Layout or arrangement of nodes

Of these, I almost always size nodes by degree (connections to other nodes), do thickness of lines by their weight, and do some sort of energy or spring layout.

If I do some sort of clustering or community detection or even want to call out components, I'll do that with node color.

My normal things are easy in any package that will graph networks. I was working on a project where we were looking at the maturity of a particular part of an industry. As part of these, we wanted to know if the necessary component systems were available from multiple suppliers and if those suppliers had relationships with different system integrators and if their things were operational or were just for lab or testing purposes.

We could have done a graph for each sub system but they wanted this graph to really just be one slide in a fairly small deck. I tried various approaches in Gephi and NetDraw and wasn't excited. So back to R and iGraph.  In the end (anonymized) :


Resulting graph - minus any labels.

I used:

  • node shape for if a component or a system integrator
  • color for type of component
  • size for degree
  • line dashed or dotted for if it was in operation or not

I really wanted to show different shapes for each category but igraph only has like 6 default ones and they don't look all that different from each other. NetDraw has more. I tried to use raster images - but I'm on a windows machine and I found all that very confusing.

One unfortunate thing about this graph is that I had to list companies multiple times if they had offerings in multiple categories.

Customer seemed to like it.

I'm not going to take the time to anonymize all the code but here are some key pieces - ask if there's anything I figured out that you don't immediately see how to do.
I started with a spreadsheet (3 of us librarians were adding data)
nodetable tab:
id label category

edgetable tab:
source target yes/no notes

These I imported into gephi (super easy)... and then tried all sorts of stuff... and then exported into graphml
#read in the graph
g<-read.graph("g.graphml", format="graphml")

#shape nodes. these work, but you can't have n/a. so there has to be a default. also, there is an easier way
for (i in 1:101)ifelse(V(g)[i]$Category=='category', V(g)[i]$shape<-'circle', V(g)[i]$shape<-'square')

#color of nodes - a simple number will draw from the palette. see below
for (i in 1:101)if(V(g)[i]$Category=="category"){V(g)[i]$color<-1}

#calculate and keep the degree. i use it again for label placement (not shown) and to bold some labels (not shown)
V(g)$degree<-degree(g, mode="all")

#when I tested the graphing, the isolates were all mixed in and messed up all the labels.
#subgraph to show isolates separately

#make dotted lines for not operational
for (i in 1:76) ifelse (E(gnoni)[i]$"operational"=="yes", E(gnoni)[i]$edge.lty<-1,E(gnoni)[i]$edge.lty<-2)

#prettier colors
library("RColorBrewer", lib.loc="~/R/win-library/3.1")

#legend definitions
colors < - c('gray40', 1,2,3,4,5,6) labels <- vector of categories

#plot graph keep device open
plot.igraph(gnoni, layout=layout.fruchterman.reingold, edge.arrow.size=0.1, edge.color="black", vertex.size=V(gnoni)$degree, vertex.label.dist=V(gnoni)$vertex.label.dist, vertex.label.color="black","sans",edge.curved=TRUE, vertex.label.cex=0.8, edge.lty=E(gnoni)$edge.lty)

#put legends on - isolates are just shown as a legend so they are neatly lined up
#could have been done by plotting points

legend("bottomright",legend=labels, fill=colors, border="black", cex=0.7, inset=c(-0.1,0))
legend("topleft", legend=V(gi)$label, pch=19, col=V(gi)$color, cex=0.7, bty="n", y.intersp=0.5)
legend("topright", legend=c("Yes", "No"), lty=c(1,2), cex=0.7,inset=c(-0.02,0)) Continue Reading »

2 responses so far

I'm a coding fool... use of the nuggets mentioned last post

Jul 18 2014 Published by under information analysis, Uncategorized

Self-efficacy FTW... may I never crash!

Last post I mentioned determining if a string is an element of a character vector and opening Excel files in R. Here's what I was doing and how it worked.

I had a directory of xls files downloaded from a research database, one per organization, showing the high level subject categories in which they published. The subject categories were actually narrower than what I needed (think like condensed matter, high energy, amo, chemical physics and I need "physics"). I needed to rank the organizations by articles published in these composite categories that included maybe 1o or so of the categories from the database.

Originally, I was going to open all the files and just count them up, but whoa, that would suck. R for the win!

First, run RStudio with either 32- or 64- bit depending on which Java you have installed.

Next, get the list of files. I had saved them in a directory with other things, too, so I needed to search by pattern. I had already set my working directory to the data directory, for good or bad.

fileList<-list.files(pattern = "subject")

Get the list of their categories for my composite category (physics here). This was just a list:

physics <- read.delim("physics.txt", header=F)

Make sure words are characters and not factors, numbers are just numbers. By trial and error and inspection I figured out that there was a non-numeric character in the count field.

Here's the function:
countPhysics<- function(file){
  #this creates a vector to hang on to the numbers of counts
  phys.hold <- vector(mode="numeric")
  #this is so i can make sure i just have numbers in the count field
  #this finds matching records and then keeps just the part we want
  #one of these days i'll just import right the first time instead of this part
  for (j in 1:length(physfile$Count)){
    if (is.element(physfile$Analysis.Value[[j]],physics)) {
      phys.hold[j]<- (physfile$Count[[j]])}
    else phys.hold[j]<-0
  total<- sum(phys.hold)  

So you run this like so:

physicsResult <-sapply(fileList,countPhysics)

I transposed it and then pasted it into an excel file I was working on but this is essentially the answer. I did the same thing for the other categories separately, but obviously I should have checked each line for matching each/all of my categories before moving to the next line and then outputted a frame. Oh well.

Comments are off for this post

Getting blog post content in a usable format for content analysis

Jul 04 2014 Published by under dissertation

So in the past I've just gone to the blogs and saved down posts in whatever format to code (as in analyze). One participant with lots of equations asked me to use screenshots if my method didn't show them adequately.

This time I have just a few blogs of interest and I want to go all the way back, and I'll probably do some quantitative stuff as well as just coding at the post level. For example just indicating if the post discusses their own work, other scholarly work, a method (like this post!), a book review... , career advice, whatever. Maybe I'll also select some to go deeper but it isn't content analysis like linguists or others do at the word level.

Anyway, lots of ways to get the text of web pages. I wanted to do it in R completely, and I ended up getting the content there, but I found python to work much better for parsing the actual text out of the yucky tags and scripts galore.

I had *a lot* of help with this. I lived on StackOverflow, got some suggestions at work and on friendfeed (thanks Micah!), and got a question answered on StackOverflow (thanks alecxe!). I tried some books in Safari but meh?

I've had success with this on blogger and wordpress blogs. Last time when I was customizing a perl script to pull the commenter urls out every blog was so different from the others that I had to do all sorts of customization. These methods require very little change from one to the next. Plus I'm working on local copies when I'm doing the parsing so hopefully having as little impact as possible (now that I know what I'm doing - I actually got myself blocked from my own blog earlier because I sent so many requests with no user agent)

So using R to get the content of the archive pages. Largest reasonable archive pages possible instead of pulling each post individually, which was my original thought. One blog seemed to be doing an infinite scroll but when you actually looked at the address block it was still doing the blogurl/page/number format.  I made a csv file with the archive page urls in one column and the file name in another. I just filled down for these when they were of the format I just mentioned.

Read them into R. Then had the function:
 UserAgent <- "pick something"
 temp <- getURL(link, timeout = 8, ssl.verifypeer = FALSE, useragent = "UserAgent")
 nameout <- paste(fileName, ".htm", sep="") 
 write (temp,file=nameout)

I ended up doing it in chunks.  if you're doing this function with one it's like:

getFullContent("","archivep1" )

More often I did a few:


So I moved the things around to put them in a folder.

Then this is the big help I got from StackOverflow. Here's how I ended up with a spreadsheet.

from bs4 import BeautifulSoup
import os, os.path

# from
# this is the file to write out to
posts_file = open ("haposts.txt","w")

def pullcontent(filename):

    soup = BeautifulSoup(open(filename))
    posts = []
    for post in soup.find_all('div', class_='post'):
        title = post.find('h3', class_='post-title').text.strip()
        author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
        content = post.find('div', class_='post-body').p.text.strip()
        date = post.find_previous_sibling('h2', class_='date-header').text.strip()

        posts.append({'title': title,
             'author': author,
             'content': content,
             'date': date})
    #print posts
    posts = str(posts)
    posts_file.write (posts)

# this is from

for filename in os.listdir("files"):

print ("All done!")


So then I pasted it into word, put in some line breaks and tabs and pasted into excel.  I think I could probably go from that file or the data directly into Excel, but this works.

Really very minor tweaking between blogs. Most I don't actually need an author for but I added in the url using something like this:

url = post.find('h2').a.get('href')

The plan is to import this into something like nvivo or atlas.ti for the analysis. Of course it would be very easy to load it in to R as a corpus and then do various textmining operations.

Comments are off for this post

Using R TwitteR to Get User Information

I'm gonna keep stating the obvious, because this took me a few hours to figure out. Maybe not working continuously, but still.

So, I have like more than 6000 tweets from one year of AGU alone, so I'm gonna have to sample somehow. Talking this over with my advisor, he suggested that we have to find some reasonable way to stratify and then do random within the stratification. I haven't worked all the details out yet - or really any of them - but I started gathering user features I could base the decision on. Number of tweets with the hashtag was super quick in Excel. But I was wondering if they were new to Twitter, if they tweeted a lot, and if they had a lot of followers. That's all available through the api and using the TwitteR package by Jeff Gentry.  Cool.

So getUser() is the function to use. I made up a list of the unique usernames in Excel and imported that in. Then I went to loop through.

library("twitteR", lib.loc="C:/Users/Christina/Documents/R/win-library/3.0")
#get the data
 data USERdata<-vector()
 temp<-getUser(USER, cainfo="cacert.pem")
 #test for users 4-6<-sapply(data$user[4:6],userInfo)

But that was sorta sideways... I had a column for each user... sorta weird. Bob O'H helped me figure out how to transpose that and I did, but it was sorta weird.

So then I tried this way:<-function(startno,stopno){
# set up the vectors first
for (i in startno:stopno) {
thing<-getUser(data$user[i], cainfo="cacert.pem")[i]<-data$user[i]

return(data.frame(,created=USER.created, posts=USER.posts,followers=USER.foll, stringsAsFactors=FALSE))

So that was cool, until it wasn't. I mean, turns out that 2% of the users have deleted their accounts or block me or are private or something. So it didn't recover from that error and I tried to test for is.null() and is.NA() but it failed....
So then I went back to the mailing list and there was a suggestion to user try() but eek.
So then I noticed that if you have a pile to look through you're actually supposed to use
lookupUsers(users, includeNA=FALSE, ...)
And I did, and I wanted to keep the NA so that I could align with my other data later... but once again, no way to get the NAs out. And it's an object that's a pile of lists... which I was having trouble wrapping my little mind around (others have no issues).
So I went back and used that command again, and this time said to skip the NA (the not found users). Then I think from the mailing list or maybe from Stack Overflow? I had gotten the idea to use unlist. So here's what I did then:
easy.tweeters.noNA<-lookupUsers(data$user, cainfo="cacert.pem")
#check how many fewer this was
#1247 so there were 29 accounts missing hrm
for (i in 1:1247){holddf<-twListToDF(easy.tweeters.noNA[i])

And that created a lovely dataframe with all kinds of goodies for it. I guess I'll have to see what I want to do about the 29 accounts.

I really would have been happier if it was more graceful with users that weren't found.

Also, not for every single command you have to user the cainfo="cacert.pem" thingy... Every time, every command.

ALSO, I had figured out oauth, but the twitter address went from http:// to https:// and so that was broken, but I fixed it. I hope I don't have to reboot my computer soon! (Yeah, I saved my credentials to a file, but I don't know... )

Comments are off for this post

My ongoing struggle with the Twitter API, R, … copy paste

I’m posting this in hopes that someone with experience in any/all of the above or maybe Perl, can point out that I’m doing something stupid or have overlooked something obvious. If nothing else, you might read this to see what not to try.

Here’s the issue: it’s totally obvious that I need to look at the other tweets that were sent by #agu10 tweeters (the ones not marked with the hash tag) if I want to understand how Twitter was used at the meeting. But it’s now five months later and there are 860 of them (although I would be fine with looking at the most prolific non-institutional tweeters).

I first looked at the Twitter API and I tried just adding terms to URLs and got the recent timelines for a user at a time but I couldn’t see a way to get a user’s timeline for a set period of time (the conference time period +a week on each end, or so).

I asked two experts and they both said that you couldn’t combine the user timeline with a time period.

Darn. So my next idea was to see if I could actually access someone’s timeline that far back through the regular interface. I tried one of the more prolific tweeters and I could. Ok, so if I can pull down all of their tweets, then I could pick out the ones I wanted. Or, even better, I could also look at the evolution of the social network over time. Did people meet at the meeting and then continue to tweet at each other or are these people only connected during the actual meeting?  Did the network exist in the same way before the meeting?

I was looking for ways to automate this a bit and I noticed that there were things already built for Perl and for R. I used Perl with a lot of handholding to get the commenter network for an earlier paper and I used R for both that same article and in lieu of STATA for my second semester of stats. I’m not completely comfortable with either one and I don’t always find the help helpful. I decided to start with R.

The main package is twitteR by Jeff Gentry. I updated my installation of R and installed and loaded that package and the dependencies. First thing I did was to get my own standard timeline:

testtweets <- userTimeline("cpikas")

Then I typed out the first few to see what I got (like when you’re using DIALOG)


And I saw my tweets in the format:


[1]”username: text”

I checked the length of that and got 18 – the current timeline was 18 items. I tried the same thing substituting user id but that didn’t work. So then I tried to retrieve 500 items and that worked fine, too.

testlonger <- userTimeline ("cpikas", n=500)

Great. Now, let me see the dates so I can cut off the ones I want. Hm. Ok, let’s see, how to get the other columns. What type of object is this anyhow? The manual is no help. I tried some things with object$field. No joy. Tried to edit. no joy – it was upset about the < in the image url. And it was also telling me that the object was of type S4. The manual said it wasn’t but I can’t argue if that’s what it’s reading. I somehow figured out it was a list. I tried object name [[1]][2]  - null. Then I eventually tried


Hrumph. It says 1 slot. So as far as i can tell, it’s a deprecated object type and it didn’t retrieve or keep all of the other information needed to narrow by date.

When googling around, I ran across this entry by Heuristic Andrew on text mining twitter data with R. I didn’t try his method with the xml package yet (may try that). I did try the package that was listed in the comments tm.plugin.webcorpus by Mario Annau. That does get the whole tweet and put the things in slots the right way (object$author), but it looks like you can only do a word search. Oh wait, this just worked:

testTM <- getMeta.twitter('from:cpikas')

But that’s supposed to default to 100 per page, 100 things returned and it only returned 7 for me. I guess the next thing to try is the XML version unless someone reading this has a better idea?

edit: forgot the copy paste. When I tried to just look at the tweets i wanted on screen and then copy them into a text document it crashed firefox. who knows why

18 responses so far