Clustering articles using Carrot2

Sep 11 2012 Published by under bibliometrics, Information Science

I did a very basic intro to using some social network analysis tools for bibliometrics here. This post will also be a brief how I did something for people with similar skills to mine. In other words, if you're a computer programmer or like, this will be too basic.

I've been getting into information analysis more and more at work. I've tried to learn this stuff and keep up as time goes on because I know that it's a very useful tool but it's taken this long to get any real uptake at my place of work. Typically I'll use metadata from either WoS or Scopus or from a collection of specific research databases filtered through RefWorks. Once I have the metadata from the research database I'll often run it through Vantage Point for some cleaning and to make matrices (co-authorship, author x keyword, etc). More recently, I've been using Sci2 for some of this.

All of the tools I use so far work with metadata but I get a lot of calls for doing mining with the text. I do know of tons of tools to do this but I think they all take a little more commitment than I'm willing to give right now (learning to program, for example). Some things can be done in R, but I really haven't tried that either as there is still a steep learning curve.

Anyway, a software developer (well he's really a lot more than that - he does rapid prototyping for human language technologies) buddy of mine from work has recommended Carrot2 a bunch of times. I now have a project that gives me an excuse to give it a whirl. We're mapping an area of research to pick out the key institutions, key authors, key venues... but also looking at the major areas of research. This could be done with author co-citation or bibliographic coupling, but another way is to cluster the articles based on their abstracts - I used Carrot2 for this. A reason not to use Sci2 with WoS data to do ACA or bib coupling is that for this particular research area I was having a very hard time getting a nice clean tight search in WoS where as some social sciences databases were yielding great results. As I was just telling a group at work, a lot depends on your starting set - if you do a crap search with lots of noise, then your bibliometrics aren't reliable and can be embarrassing.

Carrot2 out of the box will cluster search engine results from Bing, Google, and PubMed. If you download it, you can incorporate it into various programming thingies, and you can also use the document clustering workbench on any xml file or feed (like rss). They have a very simple xml input format and you use an xslt to get your base file or feed to look like that. I exported my records from RefWorks in their xml and I started reading up on XSLT...after some playing around I had an epiphany - I could just make a custom export format to get the input format directly from RefWorks!

I started from the RW xml but could have gone from scratch.In the output style editor, bibliography settings:

reference list title: <?xml version="1.0"  ?>n<searchresult>n<query></query>n

text after bibliography: </searchresult>

Then I only defined generic and all the ref types use that:

refid precede with <document id="

follow with ">

Basically do the same for title, primary; abstract (call it snippet); and url

Then add text to output: </document>

You end up with

<?xml version="1.0" ?>
<searchresult>
<query>paste your subject here, its supposed to help</query>
<document id="ID">
<title>article title</title>
<url>url</url>
<snippet>abstract</snippet>
</document>
<document id="ID">
<title>article title</title>
<url>url</url>
<snippet>abstract</snippet>
</document>
</searchresult>

More or less, that is. I had some spaces I needed to remove. There was also one weird character that caused an error.

Then in Carrot2 workbench you select XML and then identify the location of the file and shazaam! You get 3 different visualizations and you can export the clusters. One of my biggest was copyright Sage but it can be tuned and you can add to the stopword list if you want. I still want to play with the tuning and the different clustering methods.

 

Comments are off for this post