Archive for the 'bibliometrics' category

Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

Aug 28 2017 Published by under bibliometrics, Uncategorized

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.

Bibliometrix

http://www.bibliometrix.org/

This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.

Literally:

filenames <- list.files("directory", full.names=TRUE)

getWoSdf<-function(filename){
 holdrecs<-readLines(filename)
 recsdf<-isi2df(holdrecs)
 return(recsdf)
}

WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.

MetaKnowledge

http://networkslab.org/metaknowledge/

I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.

 

* no endorsement intended, for real.

 

2 responses so far

Animating a graph

Apr 21 2017 Published by under bibliometrics, information analysis

What I really want is an interactive network graph that my viewer can click on, slide things, flip switches, etc. I'll get there someday. In the mean time, this was fairly easy to show the evolution of a network over time.

The network is a company network generated in Quid*. I used the time filter and exported a png every 3-5 years for this network. The point being that there has been a lot of growth and in particular clusters of companies.

Here it is:

Probably need to click to enlarge to see the animation

ImageMagick is an awesome open source image tool. There have been other ways to get to it from R or python, but the magick package is the easiest I've seen. The vignettes and a blog post from rOpenSci were helpful.

 

library("magick", lib.loc="~/R/win-library/3.3")
setwd("I:/Christina's/person")
newcomp_files<-list.files("folder", pattern="*.png", full.names=TRUE)
newcomp<-lapply(newcomp_files, image_read)
#need image_join because the above produces a list instead of a magick vector 
#dispose deletes previous image
newcomp_animate<-image_animate(image_join(newcomp), fps = 0.5, dispose = "previous")
image_write(newcomp_animate, 'newcomp_animate_82-.gif')

 
Note: the final version you see also used magick to crop (not great) and to annotate with the year. I slowed the frame rate down quite a bit.
 

I tried morphing and it was pretty slow and really not worth it so much.

*not affiliated and not endorsing

No responses yet

Who are my researchers citing? A quick and dirty way to find out

Apr 21 2017 Published by under bibliometrics, Collection Development

This probably shouldn't warrant a post, but a few very experienced and up-to-date librarians didn't immediately know so probably worth sharing. Also, it turns out that Scopus* has hidden or removed a lot of the useful stuff and made it confusing to even know where to enter the search.**

In Scopus, search for your researchers. In my case, an affiliation search works.***

 

The affiliation search is a tab from the home page. There's no way to know it's a tab. It's just an underlined word. You actually then click around on the page until you find some place to type your query.

From the results list - and mind you I do this exact search regularly so yours might not be so precise - go ahead and click on all the documents. If you're at NIH then crap, because you have more then 10,000 journal articles per year so you have to do some major slicing and dicing. I just limited to 2016 and journal articles just because.

Then you look  for the "..." but then you realize it's grayed out and you can't actually click on it.

So then you click to highlight all, and then you click on "..." and you see view references.

From here, you can list the top sources and, theoretically, analyze them. They're not completely clean though. My set had JGR as well as the spelled out and the specific ones. Likewise with ApJ. So how quick and how dirty is ok? For collections development, you're probably cool with reading off. Otherwise you could export and then use OpenRefine or similar to clean.

* Not affiliated, not endorsing!

** plus - this thing in advanced search in which it is forever putting in codes I do not want ... anyway...

***hey, all the branding for my larger institution is gone? aw come on.

No responses yet

Nebulous Connections Presentation: Using bibliometrics to keep up with the Joneses

Apr 04 2017 Published by under bibliometrics

I presented this today, April 4, 2017 at the SLA Maryland event held at NASA Goddard's Library

No responses yet

Reflection on librarians and bibliometrics

Mar 25 2017 Published by under bibliometrics

I only attended a few of the sessions related to bibliometrics at ACRL2017, but I talked with a bunch of new people and I saw what was presented. This confirmed my view that:

Librarians are sophisticated and nuanced producers and consumers of bibliometrics

Last Fall, as I sat in  a presentation by Ludo Waltman (slides pdf) in which he talked of librarians as "citizen" bibliometricians who are, essentially, only able to uncritically use the JIF and h-index because they are all we understand. We are only able to look up metrics when carefully handed to us by vendors.

I was irate, because I do not see myself in that category at all. Nor do I see my colleagues at CMU, NYU, UIC, Northwestern, ARL, NIH, Cornell, and CWRU that way. But maybe I live in a bubble?

No. It was clear from ACRL that there are librarians everywhere who follow these things, care deeply about them, think critically about them, and who are sophisticated in creating and using metrics.

So I'm blowing a raspberry at your "citizen" bibliometrics. We are not the citizenry; we are the insurgency.   Ok, that's too dramatic. Let's say we're the army of the less-well funded ally that nevertheless has lots of troops in the battle.

2 responses so far

ACRL2017: When Tradition and Reality Collide: Metrics, Impact and Beyond

Mar 25 2017 Published by under bibliometrics, Conferences

Friday morning Abigail Goben, Meg Smith, and I presented at the Association of College and Research Libraries conference in Baltimore. I am not an academic librarian but I do serve researchers. I would say that SLA is probably more appropriate for librarians serving researchers in government, industry, and other settings. This was local, though!

The polls yielded some interesting feedback.

  • Our audience members were overall fairly experienced in metrics, with some experts. They knew most of the terms we threw out
  • Many of their libraries have informal support for metrics with a few libraries having formal support
  • Librarians sometimes have an uneasy role with metrics:
    • Frustrated with inappropriate use or use by uninformed people
    • Difficulty working with researchers and with administration: who should pay for the tools? who should do the work?
    • Librarian as neutral vs. metrics for competition
  • Many organizations do have RIM thingies, but they are mostly at the office of research or provost's office. There is a need for more help in how librarians can work with these offices.

No responses yet

Getting article metadata from MS Academic: some R code

As promised, I went back and did this myself instead of relying on a partner in crime (earlier referred to as an SME but he outed himself). It's funny because I had his code, but he did things differently than I do them so I needed to do it myself.

First mostly successful run I ended up with about 44% of the rows missing the metadata. I discovered fairly quickly that using TM's removePunctuation was, of course (in retrospect), closing up instead of leaving a space for intraword dashes. You can have it ignore those, but you can't have it go ahead and leave a space. I first did some finding and replacing in Excel but that got me down to 32%. Then I was like, duh, just do the gsub for [[:punct:]] and see if that's better. I hope I haven't used my quota!

Here's the code. Sign up for your key here. Also note: not affiliated, not endorsing.

#microsoft academic to try to find affiliations for article titles

library("httr", lib.loc="~/R/win-library/3.3")
library("tm", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")

setwd("~/DataScienceResearchInitiative")

#don't forget the following or you will regret it
options(stringsAsFactors = FALSE)

# api info https://dev.projectoxford.ai/docs/services/56332331778daf02acc0a50b/operations/565d753be597ed16ac3ffc03

# https://api.projectoxford.ai/academic/v1.0/evaluate[?expr][&amp;model][&amp;count][&amp;offset][&amp;orderby][&amp;attributes]

#key:
msakey1<-"put yours here"

apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
searchexpr<-"Ti='example'"
apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"

#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#get the json out into usable format
#could look for errors first
testcite$status_code

#comes out raw so need to make into text
testciteContent <- rawToChar(testcite$content)

test<-fromJSON(testciteContent)
test$entities$AA
test$entities$AA[[1]]$AuN
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')

test$entities$AA[[1]]$AfN
test$entities$J$JN
test$entities$Y
test$entities$Ti

# initiate a dataframe
# for each title, go out and search using that title
# could add in a warn_for_status(r)  when status is not 200 (happy)
# if status !200 go to the next one,  if status =200
# extract ti, y, authors (paste), affil (paste), jn, cn, and out of entities VFN, V, FP LP DOI D
# write them to the data frame
#1904 is the length of my article title list

CitesOut<- data.frame(ti = rep(NA,1904),
                      y = integer(1904),
                      au = rep(NA,1904),
                      af = rep(NA,1904),
                      jn = rep(NA,1904),
                      cn = rep(NA,1904),
                      vfn = rep(NA,1904),
                      v = rep(NA,1904),
                      fp = rep(NA,1904),
                      lp = rep(NA,1904),
                      doi = rep(NA,1904),
                      abs = rep(NA,1904),
                      stringsAsFactors = FALSE)
  
getMScites<- function(citeNo){
  apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
  searchexpr<- paste0("Ti='",TitlesToFindf[citeNo],"'")
  apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"
  holding<-GET(apiurl,
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))
  print(paste("cite number", citeNo,"status is:", holding$status_code))
  print(holding$headers$`content-length`)
  holdingContent <- rawToChar(holding$content)
  holdC<-fromJSON(holdingContent)
  cciterow<-data.frame(
    ti=ifelse(is.null(holdC$entities$Ti),NA,holdC$entities$Ti),
    y=ifelse(is.null(holdC$entities$Y),NA,as.integer(holdC$entities$Y)), 
    au=ifelse(is.null(holdC$entities$AA[[1]]$AuN),NA,paste(holdC$entities$AA[[1]]$AuN, collapse = ';')),
    af=ifelse(is.null(holdC$entities$AA[[1]]$AfN),NA,paste(holdC$entities$AA[[1]]$AfN, collapse = ';')),
    jn=ifelse(is.null(holdC$entities$J$JN),NA,holdC$entities$J$JN),
    cn=ifelse(is.null(holdC$entities$C$CN),NA,holdC$entities$C$CN))
  print(cciterow)
  if(is.null(holdC$entities$E)){
    eciterow<-data.frame(
      vfn=NA,
      v=NA,
      fp=NA,
      lp=NA,
      doi=NA,
      abs=NA)
  } else {
    holdE<-fromJSON(holdC$entities$E)
    eciterow<-data.frame(
      vfn=ifelse(is.null(holdE$VFN),NA,holdE$VFN),
      v=ifelse(is.null(holdE$V),NA,holdE$V),
    fp=ifelse(is.null(holdE$FP),NA,holdE$FP),
    lp=ifelse(is.null(holdE$LP),NA,holdE$LP),
    doi=ifelse(is.null(holdE$DOI),NA,holdE$DOI),
    abs=ifelse(is.null(holdE$D),NA,holdE$D)
    )
  }
  print(eciterow)
  citerow<-cbind(cciterow,eciterow, stringsAsFactors=FALSE)
  print("this is citerow")
  print(citerow)
  return(citerow)
} 

#troubleshooting
apiurl<-"https://api.projectoxford.ai/academic/v1.0/evaluate?expr="
searchexpr<- paste0("Ti='",TitlesToFindf[4],"'")
apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"
troubleshoot<-GET(apiurl,
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))

troubleshoot$status_code
troubleshoot$headers$`content-length`

troubleshootcontent<-rawToChar(troubleshoot$content)  
troubleC<-fromJSON(troubleshootcontent)
troubleE<-fromJSON(troubleC$entities$E)

#prepare title list
## IMPORTANT - all the titles have to be lower case and there can't be any punctuation
TitlesToFind <- read.delim("~/DataScienceResearchInitiative/TitlesToFind.csv", header=FALSE)

TitlesToFindl<-apply(TitlesToFind,1,tolower)

TitlesToFindf<- gsub("[[:punct:]]"," ",TitlesToFindl)

head(TitlesToFindf)

#use the sys.sleep so you don't get an error for too many requests too quickly
for (i in 21:1904){
  temp<-getMScites(i)
  CitesOut[i,]<-temp
  Sys.sleep(2)}
write.csv(CitesOut,"MSdsCites.csv")

length(which(is.na(CitesOut$ti)))
length(which(is.na(CitesOut$abs)))

missCites<-which(is.na(CitesOut$ti))

for (i in 1:length(missCites)) {
  temp<-getMScites(missCites[i])
  CitesOut[missCites[i],]<-temp
  Sys.sleep(2)
}

edited: to fix formatting. also the missing cites were writing to the wrong rows.. sigh.

No responses yet

Retrieving article metadata from Microsoft Academic Scholar

Nov 19 2016 Published by under bibliometrics, information analysis

In the ongoing saga of doing information analysis and bibliometrics of some sort in computer science... now I need affiliations. As a reminder, I did the first bit of this work in Inspec* because it has high quality metadata but then I discovered after reviewing results with SMEs that it totally was missing a bunch of important conferences in the field - most notably some big ones from ACM. So I searched DBLP using their API , ACM's Guide to the Computing Literature, ArXiv, and CiteSeer and found a bunch more interesting articles. I de-duplicated with the Inspec set and then did topic modelling using whatever I had (used abstract, title, and keywords). Well, ACM doesn't export abstracts and DBLP doesn't even have them.

And then I got all turned around after linking the article titles back to the topics, working with the SMEs to name and select the interesting topics.... so, oops... now I had a list of ~2000 titles alone and no other information but I had actually needed to give a list of top organizations and top venues for these interesting topics... Uh-oh.

Of course Google Scholar doesn't have an API. Aminer does, but a quick check had it returning 0 results for my first few titles through the web interface. CiteSeer, I don't even know. What to do?  Ah-ha Microsoft Academic Search* does have an API but it's not all that comprehensive yet... oh wait - it's actually IS quite good in computer science.

Ideally, there would already be an rOpenSci package to search it but the only package I found was for using some of the other Microsoft Cognitive Services APIs. The main Academic Knowledge site makes it very easy to sign up to make up to 10k requests a month for free. There's even a console you can use to test your queries separately from your code.
So what's the problem, you ask? Just iterate through searching for each title, pull down JSON for just the fields you need (C.CN, J.JN,AA.AfN), parse into a data frame, then tot them up.... Yet our searches were not getting any results... until we happened on a StackOverflow question ... You need to lower case and remove all punctuation prior to searching.

A SME at work ended up doing the actual coding for this part but I'm going to try to reproduce it on my own to make sure I have it. When I do, I will surely share.

Long story but: 1) it would be lovely to have a package to use for this API 2) MAG does work fine for this purpose for this topic 3) be sure to lower case and remove punctuation as a first step

*no endorsement intended

3 responses so far

Using bibliometrics to make sense of research proposals

Nov 01 2016 Published by under bibliometrics, Conferences

This was presented at the Bibliometrics & Research Assessment Symposium held at NIH on October 31, 2016.

No responses yet

DBLP > EndNote using R

Oct 17 2016 Published by under bibliometrics

I'm doing a study in which I'm mapping the landscape for an area of Computer Science. I did the initial work in Inspec and once I found the best search (hint: use a classification code and then the term), I was pretty happy with the results.  When I showed it to my SMEs, however, they fairly quickly noticed I was missing some big name ACM conferences in the field. I've contacted Inspec about those being missing from the database, but in the mean time, oops!  What else is missing?

The more comprehensive databases in CS are things like ACM Guide to Computing Literature, CiteSeer, and DBLP.... ACM is very difficult to be precise with and you can either export all the references or one at a time... CiteSeer was giving me crazy results... DBLP had good results but once again, export one at a time.
So here's how to use DBLP's API through R and then get the results into EndNote (using X7 desktop)

#getting stuff faster from dblp
#https://www.r-bloggers.com/accessing-apis-from-r-and-a-little-r-programming/
options(stringsAsFactors = FALSE)
library("httr", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")
library("XML", lib.loc="~/R/win-library/3.3")
library("plyr", lib.loc="~/R/win-library/3.3")
library("dplyr", lib.loc="~/R/win-library/3.3")


setwd("~/DataScienceResearchInitiative")



#http://dblp.org/search/publ/api for publication queries

url<-"http://dblp.org/"
path<-"search/publ/api"

# Parameter	Description	Default	Example
# q The query string to search for, as described on a separate page.		...?q=test+search
# format The result format of the search. Recognized values are "xml", "json", and "jsonp".	xml	...?q=test&format=json
# h Maximum number of search results (hits) to return. For bandwidth reasons, this number is capped at 1000.	30	...?q=test&h=100
# f The first hit in the numbered sequence of search results (starting with 0) to return. In combination with the h parameter, this parameter can be used for pagination of search results.	0	...?q=test&h=100&f=300
# c Maximum number of completion terms (see below) to return. For bandwidth reasons, this number is capped at 1000.	10	...?q=test&c=0

raw.result<- GET("http://dblp.org/search/publ/api?q=wrangl")

this.raw.content <- rawToChar(raw.result$content)


#http://rpubs.com/jsmanij/131030
this.content.list<-xmlToList(this.raw.content)

this.content.frame<- ldply(this.content.list$hits, data.frame)


#update to be sure to use the correct field names - except for author because still need to combine later
#two word ones have to be made into one word - for R - have to edit later
#ReferenceType has to be first to import multiple types in one file others order doesn't matter
content.frame3<- data.frame(ReferenceType = this.content.frame$info.type,
                            Title = this.content.frame$info.title, author = this.content.frame$info.authors.author,
                            author1 = this.content.frame$info.authors.author.1, 
                            author.2 = this.content.frame$info.authors.author.2, 
                            author.3 = this.content.frame$info.authors.author.3, 
                            author4 = this.content.frame$info.authors.author.4, 
                            author5 = this.content.frame$info.authors.author.5, 
                            author6 = this.content.frame$info.authors.author.6, 
                            SecondaryTitle = this.content.frame$info.venue, 
                            Pages = this.content.frame$info.pages, Year = this.content.frame$info.year, 
                             URL = this.content.frame$info.url, 
                            Volume = this.content.frame$info.volume, Number = this.content.frame$info.number, 
                            SecondaryAuthor = this.content.frame$info.author, 
                            Publisher = this.content.frame$info.publisher)
content.frame3<-distinct(content.frame3)


#want to get all authors together and get it basically in the format for TR. 
# first get all authors together separated by ; 
# http://stackoverflow.com/questions/6308933/r-concatenate-row-wise-across-specific-columns-of-dataframe
# example:  data <- within(data,  id <- paste(F, E, D, C, sep="")

content.frame4<- within(content.frame3, Author<- paste(author,author1,author.2, author.3, author4, author5, author6, sep="; " ))

# http://stackoverflow.com/questions/22854112/how-to-skip-a-paste-argument-when-its-value-is-na-in-r
content.frame4$Author<-gsub("NA; ","",content.frame4$Author)

content.frame4$Author<-gsub("NA$","",content.frame4$Author)


#remove NA from other fields

content.frame4[is.na(content.frame4)]<-""

#now drop unwanted columns using df <- subset(df, select = -c(a,c) )  from http://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name

content.frame5<-subset(content.frame4, select = -c(author,author1,author.2, author.3, author4, author5, author6))


#add in a gsub for the correct reference types
content.frame5$ReferenceType<-gsub("Conference and Workshop Papers","Conference Paper", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Parts in Books or Collections","Book Section", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Books and Theses","Book", content.frame5$ReferenceType)
content.frame5$ReferenceType<-gsub("Journal Articles","Journal Article", content.frame5$ReferenceType)


#need tab delimited no rownames and update column names to have the necessary spaces

correctnames<- c("Reference Type","Title", "Secondary Title", "Pages", "Year",  "URL", "Volume", "Number", "Secondary Author", "Publisher", "Author")

# if only one type of reference specify at top *Generic to top of file also add a vector of correct column names
#write("*Generic","dblptestnew.txt")
#write.table(content.frame5,"dblptestnew.txt",append = T, quote=F,sep = "\t",row.names = F,col.names=correctnames, fileEncoding = "UTF-8")

#if multiple types use this one
write.table(content.frame5,"dblp30wrangl.txt", quote=F,sep = "\t",row.names = F,col.names=correctnames, fileEncoding = "UTF-8")

(this is also on Git because WP keeps messing up the code)

After you have this file, import into EndNote using the boilerplate tab delimited, with UTF-8 translation.

4 responses so far

Older posts »