Reflection on librarians and bibliometrics

(by Christina Pikas) Mar 25 2017

I only attended a few of the sessions related to bibliometrics at ACRL2017, but I talked with a bunch of new people and I saw what was presented. This confirmed my view that:

Librarians are sophisticated and nuanced producers and consumers of bibliometrics

Last Fall, as I sat in  a presentation by Ludo Waltman (slides pdf) in which he talked of librarians as "citizen" bibliometricians who are, essentially, only able to uncritically use the JIF and h-index because they are all we understand. We are only able to look up metrics when carefully handed to us by vendors.

I was irate, because I do not see myself in that category at all. Nor do I see my colleagues at CMU, NYU, UIC, Northwestern, ARL, NIH, Cornell, and CWRU that way. But maybe I live in a bubble?

No. It was clear from ACRL that there are librarians everywhere who follow these things, care deeply about them, think critically about them, and who are sophisticated in creating and using metrics.

So I'm blowing a raspberry at your "citizen" bibliometrics. We are not the citizenry; we are the insurgency.   Ok, that's too dramatic. Let's say we're the army of the less-well funded ally that nevertheless has lots of troops in the battle.

2 responses so far

ACRL2017: When Tradition and Reality Collide: Metrics, Impact and Beyond

(by Christina Pikas) Mar 25 2017

Friday morning Abigail Goben, Meg Smith, and I presented at the Association of College and Research Libraries conference in Baltimore. I am not an academic librarian but I do serve researchers. I would say that SLA is probably more appropriate for librarians serving researchers in government, industry, and other settings. This was local, though!

The polls yielded some interesting feedback.

  • Our audience members were overall fairly experienced in metrics, with some experts. They knew most of the terms we threw out
  • Many of their libraries have informal support for metrics with a few libraries having formal support
  • Librarians sometimes have an uneasy role with metrics:
    • Frustrated with inappropriate use or use by uninformed people
    • Difficulty working with researchers and with administration: who should pay for the tools? who should do the work?
    • Librarian as neutral vs. metrics for competition
  • Many organizations do have RIM thingies, but they are mostly at the office of research or provost's office. There is a need for more help in how librarians can work with these offices.

No responses yet

Getting SXSW Schedule Information

(by Christina Pikas) Mar 13 2017

I asked around if anyone had this already - but no luck. So I did it myself and posted to github and now here.

This gets a csv file with the basics of the interactive panels in it. Seems like a conference like this probably should offer this info in various formats? Maybe I just didn't look hard enough

Also, a chance to test rvest vs. Beautiful Soup. Actually worked pretty well.

library("httr", lib.loc="~/R/win-library/3.3")
library("rvest", lib.loc="~/R/win-library/3.3")

#get the session page urls from the schedule page






#test all the individual pieces
session<-sessionpages[2] %>%
  read_html() %>%

name<- session %>%
  html_node(".event-name") %>%

date<- session %>%
  html_node(".event-date") %>%

speaker<- session %>%
  html_node("h4") %>%

organization<- session %>%
  html_node("h5") %>%

description <- session %>%
  html_node("p") %>%

  session<- url %>%
    read_html() %>%
  name<- session %>%
    html_node(".event-name") %>%
  date<- session %>%
    html_node(".event-date") %>%
  speaker<- session %>%
    html_node("h4") %>%
  organization<- session %>%
    html_node("h5") %>%
  description <- session %>%
    html_node("p") %>%
  item<-c(url, name,date, speaker,organization, description)
  return (item)


for (i in 1:length(sessionpages)) sessioninfo<- rbind(sessioninfo, getsessiondetails(sessionpages[i]))
write.csv(sessioninfo,file = "sxswinteractivepanels.csv")

No responses yet

Focusing on counts erodes research libraries' competitiveness

(by Christina Pikas) Dec 05 2016

by @glennobleFor many years, research libraries (mainly those in academic institutions but also in other research centers) have been all about counting collections: how many volumes owned? how many journals licensed? Bigger is better. Millions of volumes.

This pressure, combined with continual downward budgetary pressure and the global doubling of scientific output every nine years, has led to most libraries taking some short cuts to get more coverage (more volume and more volumes). In place of carefully curated abstracting and indexing services necessarily specific to certain domains of knowledge that help explore and identify sources of information but do not provide physical access, many libraries are licensing massive collections from Eb and PQ that hugely boost the numbers. They are also licensing these massive "discovery" systems that, in my opinion, completely fail to improve discovery. We librarians have told our vendors that our most important users are the undergraduates who need any few articles on a topic to quickly pad their bibliography.  Vendor offerings that make that process easier are welcomed.  So we cancel Inspec, Biosis, GEOBASE and similar to feed the beast of more and more content. The vendors who provide access to formerly very useful databases (cough Aerospace cough) more or less eviscerate them to also give more - higher counts, faster, broader... and cheaper (no - lol - never cheaper for *libraries*)

Yet, as everyone has said before me, we are living in times of information abundance not scarcity. We know we cannot survive with the library-as-pocketbook model. Some of our value comes in working with users as partners in their research. We work to understand what their information problem entails and to help them (teach, do for, or provide tools for them to) find the information they need. We should also be building and licensing systems for the most sophisticated of users on our faculties and in our research centers. We should strive for precision and also serendipity of unexpected very relevant articles. We should save the time of the reader. What value millions of responses to a web query if your answer is on page 10? New researchers should be taught to be more sophisticated in their searching (I honestly think chemistry may be the only field that does this well), instead of accepting good enough or random iteration around the theme.

The best services and tools respect the researcher's precious time. They help the researcher have better information more quickly and with more context and confidence.  This is the way we compete with the ubiquity of information freely available on the internet. It's something we do and something we can do quite well... but we need to stop these collections processes now before it's too late.


*These are my opinions and do not necessarily reflect those of my immediate organization or my parent institution. Any specific products are mentioned to clarify my meaning. No endorsement should be inferred.

No responses yet

Getting article metadata from MS Academic: some R code

(by Christina Pikas) Nov 27 2016

As promised, I went back and did this myself instead of relying on a partner in crime (earlier referred to as an SME but he outed himself). It's funny because I had his code, but he did things differently than I do them so I needed to do it myself.

First mostly successful run I ended up with about 44% of the rows missing the metadata. I discovered fairly quickly that using TM's removePunctuation was, of course (in retrospect), closing up instead of leaving a space for intraword dashes. You can have it ignore those, but you can't have it go ahead and leave a space. I first did some finding and replacing in Excel but that got me down to 32%. Then I was like, duh, just do the gsub for [[:punct:]] and see if that's better. I hope I haven't used my quota!

Here's the code. Sign up for your key here. Also note: not affiliated, not endorsing.

#microsoft academic to try to find affiliations for article titles

library("httr", lib.loc="~/R/win-library/3.3")
library("tm", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")


#don't forget the following or you will regret it
options(stringsAsFactors = FALSE)

# api info


msakey1<-"put yours here"


#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#get the json out into usable format
#could look for errors first

#comes out raw so need to make into text
testciteContent <- rawToChar(testcite$content)

#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')


# initiate a dataframe
# for each title, go out and search using that title
# could add in a warn_for_status(r)  when status is not 200 (happy)
# if status !200 go to the next one,  if status =200
# extract ti, y, authors (paste), affil (paste), jn, cn, and out of entities VFN, V, FP LP DOI D
# write them to the data frame
#1904 is the length of my article title list

CitesOut<- data.frame(ti = rep(NA,1904),
                      y = integer(1904),
                      au = rep(NA,1904),
                      af = rep(NA,1904),
                      jn = rep(NA,1904),
                      cn = rep(NA,1904),
                      vfn = rep(NA,1904),
                      v = rep(NA,1904),
                      fp = rep(NA,1904),
                      lp = rep(NA,1904),
                      doi = rep(NA,1904),
                      abs = rep(NA,1904),
                      stringsAsFactors = FALSE)
getMScites<- function(citeNo){
  searchexpr<- paste0("Ti='",TitlesToFindf[citeNo],"'")
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))
  print(paste("cite number", citeNo,"status is:", holding$status_code))
  holdingContent <- rawToChar(holding$content)
    au=ifelse(is.null(holdC$entities$AA[[1]]$AuN),NA,paste(holdC$entities$AA[[1]]$AuN, collapse = ';')),
    af=ifelse(is.null(holdC$entities$AA[[1]]$AfN),NA,paste(holdC$entities$AA[[1]]$AfN, collapse = ';')),
  } else {
  citerow<-cbind(cciterow,eciterow, stringsAsFactors=FALSE)
  print("this is citerow")

searchexpr<- paste0("Ti='",TitlesToFindf[4],"'")
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))



#prepare title list
## IMPORTANT - all the titles have to be lower case and there can't be any punctuation
TitlesToFind <- read.delim("~/DataScienceResearchInitiative/TitlesToFind.csv", header=FALSE)


TitlesToFindf<- gsub("[[:punct:]]"," ",TitlesToFindl)


#use the sys.sleep so you don't get an error for too many requests too quickly
for (i in 21:1904){



for (i in 1:length(missCites)) {

edited: to fix formatting. also the missing cites were writing to the wrong rows.. sigh.

No responses yet

Retrieving article metadata from Microsoft Academic Scholar

(by Christina Pikas) Nov 19 2016

In the ongoing saga of doing information analysis and bibliometrics of some sort in computer science... now I need affiliations. As a reminder, I did the first bit of this work in Inspec* because it has high quality metadata but then I discovered after reviewing results with SMEs that it totally was missing a bunch of important conferences in the field - most notably some big ones from ACM. So I searched DBLP using their API , ACM's Guide to the Computing Literature, ArXiv, and CiteSeer and found a bunch more interesting articles. I de-duplicated with the Inspec set and then did topic modelling using whatever I had (used abstract, title, and keywords). Well, ACM doesn't export abstracts and DBLP doesn't even have them.

And then I got all turned around after linking the article titles back to the topics, working with the SMEs to name and select the interesting topics.... so, oops... now I had a list of ~2000 titles alone and no other information but I had actually needed to give a list of top organizations and top venues for these interesting topics... Uh-oh.

Of course Google Scholar doesn't have an API. Aminer does, but a quick check had it returning 0 results for my first few titles through the web interface. CiteSeer, I don't even know. What to do?  Ah-ha Microsoft Academic Search* does have an API but it's not all that comprehensive yet... oh wait - it's actually IS quite good in computer science.

Ideally, there would already be an rOpenSci package to search it but the only package I found was for using some of the other Microsoft Cognitive Services APIs. The main Academic Knowledge site makes it very easy to sign up to make up to 10k requests a month for free. There's even a console you can use to test your queries separately from your code.
So what's the problem, you ask? Just iterate through searching for each title, pull down JSON for just the fields you need (C.CN, J.JN,AA.AfN), parse into a data frame, then tot them up.... Yet our searches were not getting any results... until we happened on a StackOverflow question ... You need to lower case and remove all punctuation prior to searching.

A SME at work ended up doing the actual coding for this part but I'm going to try to reproduce it on my own to make sure I have it. When I do, I will surely share.

Long story but: 1) it would be lovely to have a package to use for this API 2) MAG does work fine for this purpose for this topic 3) be sure to lower case and remove punctuation as a first step

*no endorsement intended

3 responses so far

Special Libraries, Information Fluency, & Post-Truth

(by Christina Pikas) Nov 19 2016

Lots of librarians are communicating online about providing resources and information in these fraught times. Some are creating displays and programs to support diverse populations. Others are crafting statements of support and/or opposition. More practically, some are stocking their reference desks with safety pins and extra scarves (to be used by women who have had their hijab snatched off).

But these activities are more useful, perhaps, in public or academic libraries.

In the past few days, "fake" news and propaganda has been receiving a lot of attention. (Hear a rundown from On The Media  - clearly a very liberal news source but transparent about it). As noted on this blog, it is not really possible to insist that our patrons/customers/users use only our licensed sources. To be quite honest, even if we could, that alone isn't enough to ensure they are getting the best information. We think that because our people all have at least college degrees that they are experts on or at least competent in critical thinking.

I think, though, that the media environment isn't what it was when many of them were in school. We take the click bait and we see headlines repeated so often on Facebook that maybe we start to believe?

So, now, how do special libraries train and support their organizations in the post-truth world? I have been asked and have accordingly scheduled training that discusses critically evaluating resources; however, that is NOT at all attractive to busy professionals. The only training I offer that is well-attended is problem oriented and is explicitly related to doing the scientific and technical work (no surprise here to my library school professors!). Otherwise, short on-point training at the point of need is also well-accepted.

Integrate aspects of critical thinking and evaluating resources into every bit of training you do. If your user base can qualify for a free web account for the Washington Post (.gov, .mil, & .edu), make that information available even if you provide access through another source.  Do show finding news information in other topical sessions. For example, a session on aerospace engineering could cover things like society news sources and Aviation Week.

If your organization has an internal newsletter and or social media site, link early and often to reputable sources.

Are you integrated into strategic processes (never as much as you would like, I know!)? What information is your leadership getting and from where? The very highest levels of your organization won't typically attend your classes - can you brief their assistants? Can you make this information available to their mobile devices?

No responses yet

Using bibliometrics to make sense of research proposals

(by Christina Pikas) Nov 01 2016

This was presented at the Bibliometrics & Research Assessment Symposium held at NIH on October 31, 2016.

No responses yet

DBLP > EndNote using R

(by Christina Pikas) Oct 17 2016

I'm doing a study in which I'm mapping the landscape for an area of Computer Science. I did the initial work in Inspec and once I found the best search (hint: use a classification code and then the term), I was pretty happy with the results.  When I showed it to my SMEs, however, they fairly quickly noticed I was missing some big name ACM conferences in the field. I've contacted Inspec about those being missing from the database, but in the mean time, oops!  What else is missing?

The more comprehensive databases in CS are things like ACM Guide to Computing Literature, CiteSeer, and DBLP.... ACM is very difficult to be precise with and you can either export all the references or one at a time... CiteSeer was giving me crazy results... DBLP had good results but once again, export one at a time.
So here's how to use DBLP's API through R and then get the results into EndNote (using X7 desktop)

#getting stuff faster from dblp
options(stringsAsFactors = FALSE)
library("httr", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")
library("XML", lib.loc="~/R/win-library/3.3")
library("plyr", lib.loc="~/R/win-library/3.3")
library("dplyr", lib.loc="~/R/win-library/3.3")


# for publication queries


# Parameter	Description	Default	Example
# q The query string to search for, as described on a separate page.		...?q=test+search
# format The result format of the search. Recognized values are "xml", "json", and "jsonp".	xml	...?q=test&format=json
# h Maximum number of search results (hits) to return. For bandwidth reasons, this number is capped at 1000.	30	...?q=test&h=100
# f The first hit in the numbered sequence of search results (starting with 0) to return. In combination with the h parameter, this parameter can be used for pagination of search results.	0	...?q=test&h=100&f=300
# c Maximum number of completion terms (see below) to return. For bandwidth reasons, this number is capped at 1000.	10	...?q=test&c=0

raw.result<- GET("")

this.raw.content <- rawToChar(raw.result$content)


this.content.frame<- ldply(this.content.list$hits, data.frame)

#update to be sure to use the correct field names - except for author because still need to combine later
#two word ones have to be made into one word - for R - have to edit later
#ReferenceType has to be first to import multiple types in one file others order doesn't matter
content.frame3<- data.frame(ReferenceType = this.content.frame$info.type,
                            Title = this.content.frame$info.title, author = this.content.frame$,
                            author1 = this.content.frame$, 
                            author.2 = this.content.frame$, 
                            author.3 = this.content.frame$, 
                            author4 = this.content.frame$, 
                            author5 = this.content.frame$, 
                            author6 = this.content.frame$, 
                            SecondaryTitle = this.content.frame$info.venue, 
                            Pages = this.content.frame$info.pages, Year = this.content.frame$info.year, 
                             URL = this.content.frame$info.url, 
                            Volume = this.content.frame$info.volume, Number = this.content.frame$info.number, 
                            SecondaryAuthor = this.content.frame$, 
                            Publisher = this.content.frame$info.publisher)

#want to get all authors together and get it basically in the format for TR. 
# first get all authors together separated by ; 
# example:  data <- within(data,  id <- paste(F, E, D, C, sep="")

content.frame4<- within(content.frame3, Author<- paste(author,author1,author.2, author.3, author4, author5, author6, sep="; " ))

content.frame4$Author<-gsub("NA; ","",content.frame4$Author)


#remove NA from other fields


#now drop unwanted columns using df <- subset(df, select = -c(a,c) )  from

content.frame5<-subset(content.frame4, select = -c(author,author1,author.2, author.3, author4, author5, author6))

#add in a gsub for the correct reference types
content.frame5$ReferenceType<-gsub("Conference and Workshop Papers","Conference Paper", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Parts in Books or Collections","Book Section", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Books and Theses","Book", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Journal Articles","Journal Article", content.frame5$ReferenceType)

#need tab delimited no rownames and update column names to have the necessary spaces

correctnames<- c("Reference Type","Title", "Secondary Title", "Pages", "Year",  "URL", "Volume", "Number", "Secondary Author", "Publisher", "Author")

# if only one type of reference specify at top *Generic to top of file also add a vector of correct column names
#write.table(content.frame5,"dblptestnew.txt",append = T, quote=F,sep = "\t",row.names = F,col.names=correctnames, fileEncoding = "UTF-8")

#if multiple types use this one
write.table(content.frame5,"dblp30wrangl.txt", quote=F,sep = "\t",row.names = F,col.names=correctnames, fileEncoding = "UTF-8")

(this is also on Git because WP keeps messing up the code)

After you have this file, import into EndNote using the boilerplate tab delimited, with UTF-8 translation.

4 responses so far

Unpacking Societies Publishing With For Profit Companies

(by Christina Pikas) Aug 06 2016

This week, Jon Tennant went off on a riff on Wiley and the poor experience he had with a particular journal published for a society by Wiley.

First - I'm not affiliated and so very much not endorsing any companies, etc.

Second - I'm on record saying some things are worth paying for and I still feel that way.

I've reviewed for a Wiley-published society journal but not published with one. The ScholarOne interface is like whoa, yuck, but that is, by the way, actually a TR product. Any interactions with the editorial staff have been very professional and pleasant.

I've also been helping a colleague navigate ScholarOne to submit to a Taylor and Francis journal. It has been more than a year and we're still going back and forth with them. E-mails to the editor go unanswered. One reviewer was just like "this isn't science"  and doesn't do any more reviewing. The other has provided detailed feedback which the authors have appreciated.

Over the years, I've seen plenty of organizations think they can just do it all themselves. Why, though, should they not outsource to vendors who already have set-ups? I mean, OJS is just ugly. Free CMS are plentiful, but just because you can put articles online for cheap doesn't mean that they'll work with the rest of the ecosystem.

From what I can tell about what Tennant said, his real problem is with the society and the editors, not with the platform.

The other think to think about is if the society had to pay the intermediate vendors themselves (Atypon, etc) and manage those relationships, would that really be cheaper than an all-in-one package? Maybe? Not sure.

Remember, too, that journals are sometimes expensive because the society sees them as a revenue stream so they can pay expensive executives and lobbyists and maybe a scholarship here or there.

If you're part of a society trying to make the decision, you'll likely have the numbers to help - but I don't think the decision is as obvious as everyone thinks.

No responses yet

Older posts »