Parsing citations for dabblers

Jun 26 2016 Published by under information analysis

Warning! This post is more about questions than answers!

Wouldn't it be nice to be able to grab references from the end of an article - say the article is in pdf, even - and have them in a usable tagged format? I am sure not the only one to consider this. In fact, everyone seems to do it. CiteSeer, ACM digital library, and others.  Deborah Fitchett had success figuring this out.

My incentive is a bit different. I'm looking at a pile of proposals and I want to know what citations they have in common. Everyone cites themselves, of course, but we think there are several schools of thought that we should be able to identify.

My general thought was

  1. extract the bibliography
  2. parse
  3. label each citation with firstauthorlastnamepubyear - even if there are multiple works by a single author, I think should be good enough? pretty rare to volte-face mid year?
  4. Make a matrix proposers x citation labels
  5. Graph and profit

Several ways to extract the bibliography. I realize now that I probably should have done something with beautiful soup or something and even if not, some tools actually take in a document and find the bibliography. Anyway, I have them.

Now for parsing here is a list of ones that may be around and usable in some fashion (this is a helpful listing)

  • ParsCit - this seems really to be the most popular and the most successful
  • FreeCite - a web service from Brown University libraries
  • Using CrossRef
  • ParaCite / ParaTools from Southampton - perl modules
  • eta: AnyStyle.io - I tried the web interface and it worked pretty well

ParsCit is the most popular so I thought it would give it a go. The page is not terribly hopeful about running it on Windows. Well... so I did request and receive an Ubuntu VM to play with... hoo-boy, the instructions are fairly off-putting and really sort of condescending (if you know how to work in Unix, this is trivial).

So now, instead, I'm playing with using the Brown library service and RCurl to see what I can do. Of course, then I have to deal with what I get back. Meh.

If I get it all figured out, I'll be sure to report back.

 

Edited: Richard Urban reminded me about AnyStyle.io so I added it. I think I will try to get an API key to try to get to it through R using RCurl. Because of time constraints... right now I'm just extracting the network using Excel 🙁

No responses yet

Leave a Reply