Retrieving article metadata from Microsoft Academic Scholar

Nov 19 2016 Published by under bibliometrics, information analysis

In the ongoing saga of doing information analysis and bibliometrics of some sort in computer science... now I need affiliations. As a reminder, I did the first bit of this work in Inspec* because it has high quality metadata but then I discovered after reviewing results with SMEs that it totally was missing a bunch of important conferences in the field - most notably some big ones from ACM. So I searched DBLP using their API , ACM's Guide to the Computing Literature, ArXiv, and CiteSeer and found a bunch more interesting articles. I de-duplicated with the Inspec set and then did topic modelling using whatever I had (used abstract, title, and keywords). Well, ACM doesn't export abstracts and DBLP doesn't even have them.

And then I got all turned around after linking the article titles back to the topics, working with the SMEs to name and select the interesting topics.... so, oops... now I had a list of ~2000 titles alone and no other information but I had actually needed to give a list of top organizations and top venues for these interesting topics... Uh-oh.

Of course Google Scholar doesn't have an API. Aminer does, but a quick check had it returning 0 results for my first few titles through the web interface. CiteSeer, I don't even know. What to do?  Ah-ha Microsoft Academic Search* does have an API but it's not all that comprehensive yet... oh wait - it's actually IS quite good in computer science.

Ideally, there would already be an rOpenSci package to search it but the only package I found was for using some of the other Microsoft Cognitive Services APIs. The main Academic Knowledge site makes it very easy to sign up to make up to 10k requests a month for free. There's even a console you can use to test your queries separately from your code.
So what's the problem, you ask? Just iterate through searching for each title, pull down JSON for just the fields you need (C.CN, J.JN,AA.AfN), parse into a data frame, then tot them up.... Yet our searches were not getting any results... until we happened on a StackOverflow question ... You need to lower case and remove all punctuation prior to searching.

A SME at work ended up doing the actual coding for this part but I'm going to try to reproduce it on my own to make sure I have it. When I do, I will surely share.

Long story but: 1) it would be lovely to have a package to use for this API 2) MAG does work fine for this purpose for this topic 3) be sure to lower case and remove punctuation as a first step

*no endorsement intended

3 responses so far

Leave a Reply