Filling in missing data: books

May 17 2018 Published by under bibliometrics, information analysis

Photo by Janko Ferlič on Unsplash

My normal area to do information analysis and bibliometrics is technology - not even really science, but technology. Current project I'm obsessing on (it's actually really cool and interesting and fun) spans political science, history, philosophy, and even some sociology and criminal justice. So I played all my reindeer games on the journal articles, but when it comes to analyzing the abstracts, a lot were missing. I wanted to ignore these, but they were whole collections from extremely relevant journals and also nearly all of the books. The journals I filled in what i could from Microsoft Academic (no endorsement intended).

Books though.

I hoped I could use WorldCat - my library doesn't have a cataloging subscription with them so I don't think so? I hoped LoC - I could download ALL records before like 2014 which would help a lot, but wow, I can't really deal with that much data for right now. OpenLibrary- no.

Amazon - they've changed the rules a lot and it wasn't clear to me that - if I took all the time to figure it out, if I wouldn't be actually breaking the current terms of service (they have changed).

I asked the wonderful group on the Code4Lib list - one of them (thanks Mark!) pointed to Harvard's api and records. Woo-hoo! I hope they log my stats to justify their expense. I hope my messing around doesn't cause them any problems.

I'm not a cataloger although played one for 4 months 15 years ago. And I don't know MODS or Dublin Core (although it did exist when i was in library school). I pulled a couple of records and poked through them to see where what I wanted was located. Originally, I pulled the title from the record, but that proved to be too messy.

The data I needed to fill in:

  • gathered from reference managers of a team of analysts over a number of years (as many as 9)
  • searched in general (WoS, Scopus) and specific (World Political Science Abs, PAIS, etc) databases
  • gathered in EndNote, de-duped, exported in tab del

in R

  • limited to books and book sections
  • for titles, removed punctuation, made lower case, removed numbers
  • for authors, split to take first name provided of the first author (mostly the last name)
  • for ISBN, removed ISSNs, kept first one, took out - (yes to be more thorough, I should have or'd them)

 

getharvabs <- function(citeNo){
  if (!is.na(missingbooks1$holdisbn2[citeNo])){
    query<-paste0("identifier=",missingbooks1$holdisbn2[citeNo],"&limit=1")
    holding<-GET(harvardurl, query=query)
    print(holding$status_code)
    if (holding$status_code==200){
      bookContent <- rawToChar(holding$content)
      booktxt<-fromJSON(bookContent)
      if (booktxt$pagination$numFound==0){
        citerow<-data.frame(id=citeNo, 
                            TI=missingbooks1$X4[citeNo],
                            AB=NA)
        } else {
          ifelse(is.null(booktxt$items$mods$abstract$`#text`),
                 booktxt$items$mods$abstract$`#text`<-NA,
                 booktxt$items$mods$abstract$`#text`<-as.character(booktxt$items$mods$abstract$`#text`))
          citerow<-data.frame(id=citeNo,
                              TI=missingbooks1$X4[citeNo],
                              AB=booktxt$items$mods$abstract$`#text`) }
    }
    } else {
      query<-paste0("title=",missingbooks1$holdti[citeNo],
                    "&name",missingbooks1$hold1au[citeNo],
                    "&limit=1")
      holding<-GET(harvardurl, query=query)
      print(holding$status_code)
      if (holding$status_code==200){
        bookContent <- rawToChar(holding$content)
        booktxt<-fromJSON(bookContent)
        if (booktxt$pagination$numFound==0){
          citerow<-data.frame(id=citeNo, TI=missingbooks1$X4[citeNo], AB=NA)
          } else {
            ifelse(is.null(booktxt$items$mods$abstract$`#text`),
                   booktxt$items$mods$abstract$`#text`<-NA,
                   booktxt$items$mods$abstract$`#text`<-as.character(booktxt$items$mods$abstract$`#text`))
            citerow<-data.frame(id=i,
                                TI=missingbooks1$X4[citeNo],
                                AB=booktxt$items$mods$abstract$`#text`)
          }
        }
    }
  print("this is citerow")
  print(citeNo)
  print(citerow)
  return(citerow)
  }


holdbookabs<-data.frame(id=as.integer(0),TI=as.character("test"), AB=as.character("test"))


for (i in 1:length(missingbooks1$holdti)){
  temp<-getharvabs(i)
  holdbookabs<-rbind(holdbookabs,temp)
  Sys.sleep(2)
}

This choked a couple of times. Maybe I don't always have the right book - but then again, if it's on the same subject. Meh.
I also considered getting the LCSH, but decided against. Too much time spent on this already.

Tags: ,

No responses yet

Leave a Reply