Archive for the 'bibliometrics' category

Filling in missing data: books

May 17 2018 Published by under bibliometrics, information analysis

Photo by Janko Ferlič on Unsplash

My normal area to do information analysis and bibliometrics is technology - not even really science, but technology. Current project I'm obsessing on (it's actually really cool and interesting and fun) spans political science, history, philosophy, and even some sociology and criminal justice. So I played all my reindeer games on the journal articles, but when it comes to analyzing the abstracts, a lot were missing. I wanted to ignore these, but they were whole collections from extremely relevant journals and also nearly all of the books. The journals I filled in what i could from Microsoft Academic (no endorsement intended).

Books though.

I hoped I could use WorldCat - my library doesn't have a cataloging subscription with them so I don't think so? I hoped LoC - I could download ALL records before like 2014 which would help a lot, but wow, I can't really deal with that much data for right now. OpenLibrary- no.

Amazon - they've changed the rules a lot and it wasn't clear to me that - if I took all the time to figure it out, if I wouldn't be actually breaking the current terms of service (they have changed).

I asked the wonderful group on the Code4Lib list - one of them (thanks Mark!) pointed to Harvard's api and records. Woo-hoo! I hope they log my stats to justify their expense. I hope my messing around doesn't cause them any problems.

I'm not a cataloger although played one for 4 months 15 years ago. And I don't know MODS or Dublin Core (although it did exist when i was in library school). I pulled a couple of records and poked through them to see where what I wanted was located. Originally, I pulled the title from the record, but that proved to be too messy.

The data I needed to fill in:

  • gathered from reference managers of a team of analysts over a number of years (as many as 9)
  • searched in general (WoS, Scopus) and specific (World Political Science Abs, PAIS, etc) databases
  • gathered in EndNote, de-duped, exported in tab del

in R

  • limited to books and book sections
  • for titles, removed punctuation, made lower case, removed numbers
  • for authors, split to take first name provided of the first author (mostly the last name)
  • for ISBN, removed ISSNs, kept first one, took out - (yes to be more thorough, I should have or'd them)


getharvabs <- function(citeNo){
  if (!$holdisbn2[citeNo])){
    holding<-GET(harvardurl, query=query)
    if (holding$status_code==200){
      bookContent <- rawToChar(holding$content)
      if (booktxt$pagination$numFound==0){
        } else {
                              AB=booktxt$items$mods$abstract$`#text`) }
    } else {
      holding<-GET(harvardurl, query=query)
      if (holding$status_code==200){
        bookContent <- rawToChar(holding$content)
        if (booktxt$pagination$numFound==0){
          citerow<-data.frame(id=citeNo, TI=missingbooks1$X4[citeNo], AB=NA)
          } else {
  print("this is citerow")

holdbookabs<-data.frame(id=as.integer(0),TI=as.character("test"), AB=as.character("test"))

for (i in 1:length(missingbooks1$holdti)){

This choked a couple of times. Maybe I don't always have the right book - but then again, if it's on the same subject. Meh.
I also considered getting the LCSH, but decided against. Too much time spent on this already.

No responses yet

Getting abstracts back from Microsoft Academic's Inverted Index

May 14 2018 Published by under bibliometrics, information analysis

Some time ago, I posted about using Microsoft Academic to fill in missing data from other searches. Jamie and I were going to do a package to wrap the API, but bureaucracy more or less killed our enthusiasm (well, not his, that would be impossible).

Here I am obsessing over a really, really cool bibliometrics project, and have lots of citations missing abstracts. I'm sort of thinking I won't be able to do much with the books even though catalogs seem to have descriptions for a lot of them (happy to take suggestions). I've already looked at using other sources, so I'm back at Academic.

Pulled out my script. Found I lost my keys, retrieved new keys and found there's a new endpoint url, updated that and hit go. ....

Ladies and gentlemen, they moved the abstracts... no more paragraph, you now get an "inverted index." People who studied information retrieval may know what that is, but in this case it's a list of terms with each having a numeric vector of locations the term appears. Stop words are included so "the" might have 20 locations and "sassafras" has 1.

Here it is. Jamie helped with the strategy and the rest comes from lots of searches:





options(stringsAsFactors = FALSE)

#keys- put yours in:

#current 05142018 url


#sample search left in
searchexpr<-"Ti='identity as a variable'"

#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#could look for errors first

#get the json out into usable format
#comes out raw so need to make into text

testciteContent <- rawToChar(testcite$content)
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')


#use the following to get an abstract from the inverted index


testII %>%, .)




ab<-collapse(testII.m$L1, sep=" ")


2 responses so far

Poster for METRICS2017: Methods for Bibliometric Institutional Profiles for the Practitioner

Oct 29 2017 Published by under bibliometrics, Conferences

The poster:

I don't know if is actually clear enough to read? This big PDF should work: Pikas Methods for Bibliometric Institutional Profiles for the Practitioner

The submission has a little more about my motivation in the poster: Pikas Institutional Profiles MET17 (pdf)

The scripts are here:

Here's a lovely map I had to cut from the poster for size. Viewers may not appreciate that it is actually very unusual for us to collaborate outside of the US.

Affiliations of co-authors, sized by number of articles.

Comments are off for this post


Oct 27 2017 Published by under bibliometrics, Conferences

Edwin Henneken, Alberto Accomazzi, Sergio Blanco-Cuaresma, August Muench, Lars Holm Nielsen Asclepias – Capturing Software Citations in Astronomy

Asclepias project. Enabling software citation & discovery workflows. To "promote scientific software into an identifiable, citable, and preservable object. " Adding DOI based software citations to ADS. Tracking events.

Collaborative Codebase (GitHub) > Repository (Zenodo) > software broker (harvests repositry events, software citations

example ( - published in JOSS, 60 regular citations to, but also deposited in Zenodo. Citations to every single version of the software and a total of 100 citations.

Journals need to be able to accept software citations (actual citation to the software and not a related article). Just slapping a doi on it isn't enough.

End to end go from original proposal through all the data, papers, software, etc. and have analytics along the way.

Q: difficult to get people doing the right thing with the repositories? yes - but astro is amenable. long history of linking data

Q2: like bigger world of citing things not papers about things? yes

Eto MasakiIncreasing Source Documents of Rough Co-citation to Expand Co-citation Networks for Scientific Paper Searches

rough co-citation is a generation back from co-citation.

a + b cited together, co-citation... a+c cited together infer relationship with b, this did increase information retrieval retrieved documents that didn't exist in the network.

Pei-Ying Chen (speaker), Erica Hayes, Stefanie Haustein, Vincent Larivière, Cassidy R. Sugimoto -  Politics of platforms: the ideological perspectives of social reference manager users on scholarly communication

Looking at Mendeley and Zotero - hypothesis that Zotero users will be more to open data, etc., and Mendeley will be more traditional bcs using corporate platform.

Mendeley provided a stratified random sample of 26k users, response from about 1200. Zotero was an anonymous link advertised by Zotero at conferences.

In survey they didn't provide a category for librarians so they got a lot of "others"

From both groups: all advocate for open source software, all adopter of new technologies, most advocate for open access.

Majority of both think peer review system is broken and publishers aren't necessary for scholarly communication.

Some similarities and differences, but no real clear support for their hypothesis, as far as I could tell.

Q: try to look at the contents of the library to see if more oa or paywall journals?


Eshan Mohammadi, Mike Thelwall, Kristi Holmes - Interpret the meaning of academic tweets: A multi-disciplinary survey

Altmetrics - who uses twitter to communicate scholarly info, does twitter play an important role in communicating scholarly info, why, does it depend on discipline

twitter users who re/tweeted academic publications at least once using 4.5m twitter accounts

looked at personal web page urls 1.7 urls

using webmining, identified emailaddresses

sent online survey to 57k twitter users, got 2000 responses.

most respondents tweeting scholarly information were from the social sciences and humanities

most agree:

  • change way to read and disseminate sci info
  • twitter facilitates knowledge flows
  • reflects research impact
  • share academic findings with the general public

motivations for using and type of content shared depend on discipline, occupation and employment sector

They have a paper under review in a journal so stand by.


Philippe Mongeon Is there a Matilda effect in academic patenting?

We know men publish more papers than women and their papers are more cited

Now for patenting. Only about 15% of inventors are women. Patent-paper pairs. Same discovery published in a paper and patent

are women less likely to be inventor than men when we control for: position on the byline, discipline, reputation, contribution

Previous studies: no gender difference (Haeusslet & Sauermann, 2013), female more likely excluded from inventorship (Lissoni et al 2013)

all articles with 2 or more authors in wos 1991-2016, uspto patent applications 1986-2015

papers patents -1 to 5 years of app, all inventors on the actors list

text similarity of title and abstract.

discipline - based on discipline of journals cited by the paper

attribution of gender - based on Wikipedia pages (Berube in preparation)

automatic disambiguation of authors

accumulated number of citations at time of app.

contributions - manual extraction, where there were statements coded conception, analysis, performed...

regression models...

turns out place in author list has much more impact than gender, but gender is significant for all but engineering.

When taking contribution into account (many fewer papers), conception role is important  - which makes sense.

Small effect of gender on the attribution of inventorship, gender gap occurring earlier in the research process


Comments are off for this post


Oct 27 2017 Published by under bibliometrics, Conferences

This event was held Friday October 27, 2017

Kate McCain  - Undercounting the gift givers: issues when tallying acknowledgements in life sciences research

ongoing research effort - she originally worked on this 20 years ago but has come back to it recently. Background - model organisms - useful to organize research around. Community databases, stock centers, community databases, community ethos wrt sharing.

Ways to focus research - by journal is often done, but she uses this model organism. She is looking at 1980-2004 during growth phase when there is more sharing because nascent research area. And she is looking at acknowledgements.

Compared to citations - acknowledged most likely to be alive.

Personal ack vs. funding - she's interested in personal ackn. "peer interactive communication"

May be lots of different places: end note, methods section, end of text with no section label, ... No control or standardization of how people are named, what granularity they are thanked for, etc.

WoS mostly gets funding ack, and only secondarily sweeps up some personal ack (if they are in the same block, which is not always the case).

Undercounting big deal: text extraction relying on formal ack section. personal name disambiguation. Sampling or single year studies.

Check her slides to see what she found where. She also categorized types of ack - animals, software, data, editing, etc.

Top 15 individuals listed - first few time periods dominated by University of Oregon - founders and suppliers of fish early on.

She then went through profiles of some individuals with the diversity of how they appeared.

Trends - fewer examples of thanking for research materials - have their own, get from repository, or get from stock center

questions: manually - yes? learn things to help automate - yes, but lots and lots and lots of ways to trip up. Also just picking up surnames is not enough because then get some citations mixed in, named equations/methods, etc.

Reminds me of:

questions: in the lab outside of the lab. also tracking people who are frequently acknowledged and not often co-authors/cited

questions: comment - collaboration - set up something from PMC data (already coded in XML), but only using ack section and not the Materials & Methods (M&M) section.


Isabelle Dorsch - Relative Visibility

How well known. She's comparing personal publication list and information services (like WoS).

Relative visibility (IS) = (d/r)*100
d= in information services, r=publication list

Previous work - Cronin & Stock, and ISSI board study

Issues - finding the personal list, is it up to date and complete, is it structured to be used at all, what types of documents to keep (novels? newspaper articles?), keep in press?

(*discussion of this on SIGMETRICS really found that a combined edited list is probably best, but these aren't universally available - list maintained by information service but updated by author)

Which information service matters (of course)  -  visibility to one field when author publishes in multiple. Conference paper coverage, book coverage, etc.

questions: new author - only two publications - 100% (they only looked at established authors). Very dependent on the database

Judit Bar-Ilan - CiteScore vs JIF and Other Journal Indicators

Criticisms of JIF but still heavily used. Standard definition. Criticisms like lack of transparency. Things in the numerator not included as "citable items" in the denominator. Also now offer a 5year JIF

Citescore - publication window 3 years. They count all items so no numerator/denominator coverage mismatch. Transparent - can see all the citations that are covered. Freely available. Some criticism that covers too many different document types

EigenFactor, SJR, pagerank type indicators - more weight to more influential sources

Article Influence - normalized - average journal is 1.

She looked at correlations - for those sources that appear on most sources.

Quite high - CS-JIF 0.94,

HOWEVER - Lancet is 5 in JIF, 314 in CS - so huge differences and she suspects due to notes, editorials, etc.

Top 20 by CS are almost all review journals (Annual Review of... , Progress in... )

Eigenfactor doesn't include journal self-citation, and doesn't correlate as well with others.

Note also that even though high correlation, there are these big differences.

question: comment - real correlation between size of journal and JIF, Eigenfactor is the only one that corrects for this.


Student papers

Zhao, Mao, & Kun Lu (speaking, not student) - An Exploratory Study on Co-word Network Simulation

Network centrality and other network measures for co-word network. Are they correlated. Are there differences in disciplines in these measures. Looking at generative process of a co-word network.

Q: co-word can mean 3 different things: words that appear in the text, co-descriptor - uses carefully assigned things, keywords plus - is another thing separately (not controlled, but titles of articles cited). Are you simulating second hand natural language assigned things.

Antoine Archambault, Philippe Mongeon (speaking), Vincent Larivière  - The concentration of journal use in Canadian universities

As Canadian universities have to cut big packages due to budgetary issues.

Evaluating - downloads statistics from the 28 universities (~300 Excel files, 5M lines), references (articles written by authors at these universities citing these journals) perceived importance of journals (what journals do you perceive important to your research, your teaching) 23 of 28 universities, 5,500 participants (of which 3k from their university so actually disappointing response)

Cleaning important journals - title disambiguation, manual validation, classification by major disciplinary area (AH, SS, BM, NSE) - WoS, NSF, Ulrich's, Google, also verified research journal and not newsletter, etc.

47k unique journals.

Priority journals - 80/20 rule - anything in top of 80% downloads, references, mentions (10% of subscriptions account for 80% of any of these measures)

66% of the 47k journals are not in the top anywhere.

Broke out by publishers - Springer 80% of publications were not in anyone's top. Sage only 22% were not in anyone's

Only 41.6% overlap of core journals between universities

Correlation of cites, downloads, mentions (cites are super lengthy for institutions to do themselves can they just use downloads?) - answer is no. Have to use the 3 measures, not completely correlated.

Q: can you some sort of demand driven acquisition

Q: are there libraries of record - keep even if don't use locally

Q: combining visibility presentation earlier with this.

Christine Meschede Cross-Metric Compatibility of Altmetrics: Consistency of the Metrics from PlumX and

(posting before notes - battery going right now - will try to update)

Comments are off for this post

Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

Aug 28 2017 Published by under bibliometrics, Uncategorized

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.


This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.


filenames <- list.files("directory", full.names=TRUE)


WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.


I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.


* no endorsement intended, for real.


2 responses so far

Animating a graph

Apr 21 2017 Published by under bibliometrics, information analysis

What I really want is an interactive network graph that my viewer can click on, slide things, flip switches, etc. I'll get there someday. In the mean time, this was fairly easy to show the evolution of a network over time.

The network is a company network generated in Quid*. I used the time filter and exported a png every 3-5 years for this network. The point being that there has been a lot of growth and in particular clusters of companies.

Here it is:

Probably need to click to enlarge to see the animation

ImageMagick is an awesome open source image tool. There have been other ways to get to it from R or python, but the magick package is the easiest I've seen. The vignettes and a blog post from rOpenSci were helpful.


library("magick", lib.loc="~/R/win-library/3.3")
newcomp_files<-list.files("folder", pattern="*.png", full.names=TRUE)
newcomp<-lapply(newcomp_files, image_read)
#need image_join because the above produces a list instead of a magick vector 
#dispose deletes previous image
newcomp_animate<-image_animate(image_join(newcomp), fps = 0.5, dispose = "previous")
image_write(newcomp_animate, 'newcomp_animate_82-.gif')

Note: the final version you see also used magick to crop (not great) and to annotate with the year. I slowed the frame rate down quite a bit.

I tried morphing and it was pretty slow and really not worth it so much.

*not affiliated and not endorsing

Comments are off for this post

Who are my researchers citing? A quick and dirty way to find out

Apr 21 2017 Published by under bibliometrics, Collection Development

This probably shouldn't warrant a post, but a few very experienced and up-to-date librarians didn't immediately know so probably worth sharing. Also, it turns out that Scopus* has hidden or removed a lot of the useful stuff and made it confusing to even know where to enter the search.**

In Scopus, search for your researchers. In my case, an affiliation search works.***


The affiliation search is a tab from the home page. There's no way to know it's a tab. It's just an underlined word. You actually then click around on the page until you find some place to type your query.

From the results list - and mind you I do this exact search regularly so yours might not be so precise - go ahead and click on all the documents. If you're at NIH then crap, because you have more then 10,000 journal articles per year so you have to do some major slicing and dicing. I just limited to 2016 and journal articles just because.

Then you look  for the "..." but then you realize it's grayed out and you can't actually click on it.

So then you click to highlight all, and then you click on "..." and you see view references.

From here, you can list the top sources and, theoretically, analyze them. They're not completely clean though. My set had JGR as well as the spelled out and the specific ones. Likewise with ApJ. So how quick and how dirty is ok? For collections development, you're probably cool with reading off. Otherwise you could export and then use OpenRefine or similar to clean.

* Not affiliated, not endorsing!

** plus - this thing in advanced search in which it is forever putting in codes I do not want ... anyway...

***hey, all the branding for my larger institution is gone? aw come on.

Comments are off for this post

Nebulous Connections Presentation: Using bibliometrics to keep up with the Joneses

Apr 04 2017 Published by under bibliometrics

I presented this today, April 4, 2017 at the SLA Maryland event held at NASA Goddard's Library

Comments are off for this post

Reflection on librarians and bibliometrics

Mar 25 2017 Published by under bibliometrics

I only attended a few of the sessions related to bibliometrics at ACRL2017, but I talked with a bunch of new people and I saw what was presented. This confirmed my view that:

Librarians are sophisticated and nuanced producers and consumers of bibliometrics

Last Fall, as I sat in  a presentation by Ludo Waltman (slides pdf) in which he talked of librarians as "citizen" bibliometricians who are, essentially, only able to uncritically use the JIF and h-index because they are all we understand. We are only able to look up metrics when carefully handed to us by vendors.

I was irate, because I do not see myself in that category at all. Nor do I see my colleagues at CMU, NYU, UIC, Northwestern, ARL, NIH, Cornell, and CWRU that way. But maybe I live in a bubble?

No. It was clear from ACRL that there are librarians everywhere who follow these things, care deeply about them, think critically about them, and who are sophisticated in creating and using metrics.

So I'm blowing a raspberry at your "citizen" bibliometrics. We are not the citizenry; we are the insurgency.   Ok, that's too dramatic. Let's say we're the army of the less-well funded ally that nevertheless has lots of troops in the battle.

2 responses so far

Older posts »