Archive for the 'bibliometrics' category

Notes from International Symposium on Science of Science 2016 (#ICSS2016) - Day 1

This conference was held at the Library of Congress March 22 and 23, 2016. The conference program is at: http://icss.ist.psu.edu/program.html

I had the hardest time remembering the hashtag so you may want to search for ones with more C or fewer or more S.

This conference was only one track but it was jam-packed and the days were pretty long. On the first day, my notes were by hand and my tweets were by phone (which was having issues). The second day I brought a power strip along and then took notes and tweeted by laptop.

One thing I want to do here is to gather the links to the demo tools and data sets that were mentioned with some short commentary where appropriate. I do wish I could have gotten myself together enough to submit something, but what with the dissertation and all. (and then I'm only a year late on a draft of a paper and then I need to write up a few articles from the dissertation and and and and...)
Maryann Feldman SciSIP Program Director

As you would expect, she talked about funding in general and the program. There are dear colleague letters. She really wants to hear from researchers in writing - send her a one-pager to start a conversation. She funded the meeting.

Katy Börner Indiana University

She talked about her Mapping Exhibit - they're working on the next iteration and are also looking for venues for the current. She is interested in information analysis/visualization literacy (hence her MOOC and all her efforts with SCI2 and all). One thing she's trying now is a weather report format. She showed an example.

She did something with the descriptive models of the global scientific food web. Where are sources and where are sinks of citations?

Something more controversial was her idea of collective allocation of funding. Give each qualified PI a pot of money that they *must* allocate to other projects. So instead of a small body of reviewers, everyone in the field would be a reviewer. If the top PI got more than a certain amount. They would have to re-allocate to other projects.

I'm not sure I got this quote exactly but it was something like:

Upcoming conference at National Academy of Science on Modeling Sci Tech Innovations May 16-18.

They have a data enclave at Indiana with research data they and their affiliates can use. (I guess LaRiviere also has and has inherited a big pile o'data? This has been a thought of mine... getting data in format so I could have it lying around if I wanted to play with it).
Filippo Radicchi Indiana University

He spoke about sleeping beauties in science. These are the articles that receive few citations for many years and then are re-discovered and start anew. This is based on this article. Turns out the phenomenon occurs fairly regularly and across disciplines. In some cases it's a model that then is more useful when computing catches up. In other cases it's when something gets picked up by a different discipline. One case is something used to make graphene. He's skeptical one of the top articles in this category is actually being read by people who cite it because it's only available in print in German from just a few libraries! (However, a librarian in the session *had* gotten a copy for a staff member who could read German).

I would love to take his 22M article data set and try the k-means longitudinal. If sleeping beauty is found often, what are the other typical shapes beyond the standard one?

He also touched on his work with movies - apparently using an oft-overlooked section of IMDB that provides information on references (uses same framing as x, adopt cinematography style of y, remakes z... I don't know, but relationships).

Carl Bergstrom University of Washington

The first part of his talk reviewed Eigenfactor work which should be very familiar to this audience (well except a speaker on the second day had no idea it was a new-ish measure that had since been adopted by JCR - he should update his screenshot - anyhoo)

Then he went on to discuss a number of new projects they're working on. Slides are here.

Where ranking journals has a certain level of controversy, they did continue on to rank authors (ew?), and most recently articles which required some special steps.

Cooler, I think was the next work discussed.  A mapping technique for reducing a busy graph to find patterns. "Good maps simplify and highlight relevant structures." Their method did well when compared to other method and made it possible to compare over years. Nice graphic showing the emergence of neuroscience. They then did a hierarchical version. Also pretty cool. I'd have to see this in more detail, but looks like a better option than the pruning and path methods I've seen to do similar things. So this hierarchical map thing is now being used as a recommendation engine.  See babe'.eigenfactor.org . I'll have to test it out to see.

Then (it was a very full talk) women vs. men. Men self-cite more. Means they have higher h-index.
Jacob Foster UCLA (Sociology)

If the last talk seemed packed. This was like whoa. He talked really, really fast and did not slow down. The content was pretty heavy duty, too. It could be that the remainder of the room basically knew it all so it was all review. I have read all the standard STS stuff, but it was fast.

He defines science as "the social production of collective intelligence."

Rumsfeld unknown unknowns... he's more interested in unknown knowns. (what do you know but do not know you know... you know? 🙂 )

Ecological rationality - rationality of choices depends on context vs rational choice theory which is just based on rules, not context.

Think of scientists as ants. Complex sociotechnical system. Information processing problem, using Marr's Levels.

  • computational level: what does the system do (e.g.: what problems does it solve or overcome) and similarly, why does it do these things
  • algorithmic/representational level: how does the system do what it does, specifically, what representations does it use and what processes does it employ to build and manipulate the representations
  • implementational/physical level: how is the system physically realised (in the case of biological vision, what neural structures and neuronal activities implement the visual system)

https://en.wikipedia.org/wiki/David_Marr_(neuroscientist)#Levels_of_analysis

Apparently less studied in humans is the representational to hardware. ... ? (I have really, really bad handwriting.)

science leverages and tunes basic information processing (?).. cluster attention.

(incidentally totally weird Google Scholar doesn't know about "american sociological review" ? or ASR? ended up browsing)
Foster,J.G., Rzhetsky,A., Evans, J.A. (2015) Tradition and Innovation in Scientists’ Research Strategies. ASR 80, 875-908. doi: 10.1177/0003122415601618

Scientists try various strategies to optimize between tradition (more likely to be accepted) and innovation (bigger pay offs). More innovative papers get more citations but conservative efforts are rewarded with more cumulative citations.

Rzhetsky,A.,Foster,I.T., Foster,J.G.,  Evans, J.A (2015) Choosing experiments to accelerate collective discovery. PNAS 112, 14569–14574. doi: 10.1073/pnas.1509757112

This article looked at chemicals in pubmed. Innovative was new ones. Traditional was in the neighborhood of old ones. They found that scientists spend a lot of time in the neighborhood of established important ones where they could advance science better by looking elsewhere. (hmmm, but... hm.)

The next bit of work I didn't get a citation for - not even enough to search - but they looked at JSTOR and word overlap. Probabilistic distribution of terms. Joint probability. (maybe this article? pdf). It looked at linguistic similarity (maybe?) and then export/import of citations. So ecology kept to itself while social sciences were integrated. I asked about how different social sciences fields use the same word with vastly different meanings - mentioned Fleck. He responded that it was true but often there is productive ambiguity of new field misusing or misinterpreting another field's concept (e.g., capital). I'm probably less convinced about this one, but would need to read further.

Panel 1: Scientific Establishment

  • George Santangelo - NIH portfolio management. Meh.
  • Maryann Feldman - geography and Research Triangle Park
  • Iulia Georgescu, Veronique Kiermer,Valda Vinson - publishers who, apparently, want what might already be available? Who are unwilling (except PLOS) or unable to quid pro quo share data/information in return for things. Who are skeptical (except for PLOS) that anything could be done differently? that's my take. Maybe others in the room found it more useful.

Nitesh Chawla University of Notre Dame

(scanty notes here - not feedback on the talk)

Worked with Arnet Miner data to predict h-indices.

Paper: http://arxiv.org/abs/1412.4754 

It turns out, that according to them, venue is key. So all of the articles that found poor correlation between JIF and an individual paper's likelihood of being cited.. they say actually a pretty good predictor when combined with researcher's authority. Yuck!

Janet Vertesi Princeton University

Perked up when I realized who she is - she's the one who studied the Rover teams! Her book is Seeing Like a Rover.  Her dissertation is also available online, but everyone should probably go buy the book.  She looked at more a meso level of knowledge, really interested in teams. She found that different teams - even teams with overlapping membership - managed knowledge differently. The way instrument time (or really spacecraft maneuvering so you can use your instrument time) was handled was very different. A lot had to do with the move in the '90s for faster...better... cheaper (example MESSENGER). She used co-authoring networks in ADS and did community detection. Co-authorship shows team membership as same casts of characters writing. This field is very different from others as publications are in mind while the instruments are being designed.

She compared Discovery class missions - Mars Exploration Rover - collectivist, integrated; everyone must give a go ahead for decisions; Messenger - design system working groups (oh my handwriting!)

vs. Flagship - Cassini - hierarchical, separated. Divided up sections of spacecraft. Conflict and competition. Used WWII as a metaphor (?!!). No sharing even among subteams before release.  Clusters are related to team/instrument.

New PI working to merge across - this did show in evolution of network to a certain extent.

Galileo is another flagship example. breaks down into separate clusters. not coordinated.

Organization of teams matters.

I admitted my fan girl situation and asked about the engineers. She only worked with scientists because she's a foreign national (may not mean anything to my readers who aren't in this world but others will be nodding their heads).  She is on a team for an upcoming mission so will see more then. She also has a doctoral student who is a citizen who may branch off and study some of these things.
Ying Ding Indiana University

She really ran out of time in the end. I was interested in her presentation but she flew past the meaty parts.

Ignite Talks (15s per slide 2min overall or similar)

  • Filippo Menczer - http://scholarometer.indiana.edu/ - tool to view more information about authors and their networks. Browser extension.
  • Caleb Smith,
  • Orion Penner - many of us were absolutely transfixed that he dropped his note pages on the floor as he finished. It was late in the day!  He has a few articles on predicting future impact (example). On the floor.
  • Charles Ayoubi,
  • Michael Rose,
  • Jevin West,
  • Jeff Alstott - awesome timing, left 15 for a question and 15 for its answer. Audience didn't play along.

Lee Giles Penn State University

It was good to save his talk for last. A lot going on besides keeping CiteSeer up and running. They do make their data and their algorithms freely available (see: http://csxstatic.ist.psu.edu/about ) . This includes extracting references. They also are happy to add in new algorithms that make improvements and work in their system. They accept any kind of document that works in their parsers so typically journal articles and conference papers.

RefSeer - recommends cites you should add

TableSeer - extracts tables (didn't mention and there wasn't time to ask... he talked a lot about this for chemistry... I hope he's working with the British team doing the same?)

Also has things to extract formulas, plots, and equations. Acknowledgements. Recommend collaborators (0 for me, sniff.) See his site for links.

 

 

2 responses so far

Preliminary thoughts on longitudinal k-means for bibliometric trajectories

I read with great interest Baumgartner and Leydesdorff's article* on group based trajectory modeling of bibliometric trajectories and I immediately wanted to try it. She used SAS or something like that, though, and I wanted R. I fooled around with this last year for a while and I couldn't get it going in the R package for GBTM**

Later, I ran across a way to do k-means clustering for longitudinal data - for trajectories! Cool. I actually understand the math a lot better, too.

Maybe I should mention what I mean about trajectories in this case. When you look at citations per year for articles in science, there's a typical shape .. a peak at year 2-3 (depends on field), and then slacks off and is pretty flat. Turns out there are a few other typical shapes you see regularly. One is the sleeping beauty - it goes along and then gets rediscovered and all the sudden has another peak - maybe it turns out to be useful for computational modeling once computers catch up. Another is the workhorse paper that just continues to be useful overtime and takes a steady strain - maybe it's a really nice review of a phenomenon. There may be 5 different shapes?  I don't think anyone knows yet, for sure.

So instead of my other dataset I was playing with last year with like 1000 articles from MPOW, I'm playing with articles from MPOW that were published between 1948 and 1979 and that were identified in a 1986 article as citation classics. 22 articles. I downloaded the full records for their citing articles and then ran an R script to pull of the PY of the citing articles (I also pulled of cited articles and did a fractional Times Cited count but that's another story). I cut off the year the article was published, and then kept the next 35 years for each of the articles. It's like up to 2015 for a couple but I don't think that will matter a lot as we're a ways into 2016 now.

Loaded it into R, plotted the trajectories straight off:

trajLooks like a mess and there are only 22!

Let's look at 3 clusters:

3clustersOk, so look at the percentiles. 4% is one article. This is a very, very famous article. You can probably guess it if you know MPOW. Then the green cluster is probably the work horses. The majority are the standard layout.

Let's look at 4 clusters:

4clustersYou still here have the one crazy one. Like 5 workhorses. The rest are variations on the normal spike. Some a really sharp spike and then not much after (these were the latest ones in the set - the author didn't have enough distance to see what they would do). Others a normal spike then pretty flat.

So I let it do the default and calculate with 2, 3, 4, 5, 6 clusters. When you get above 4, you just add more singletons. The article on kml*** says there's no absolute way to identify the best number of clusters but they give you a bunch of measurements and if they all agree, Bob's your uncle.

qualityBigger is better (they normalize and flip some of them so you can look at them like this). Well, nuts. So the methods that look at compactness of the clusters divided by how far apart they're spaced (the first 3, I think?) are totally different than 4 - which is just like distance from centroids or something like that. I don't know.I probably have to look at that section again.

Looking at the data, it doesn't make sense at all to do 5 or 6. Does 4 add information over 3? I think so, really. Of course with this package you can do different distance measurements and different starting points, and different numbers of iterations.

What practical purpose does this solve? Dunno? I really think it's worth giving workhorse papers credit. A good paper that continues to be useful... makes a real contribution, in my mind. But is there any way to determine that vs. a mediocre paper with a lower spike short of waiting 35 years? Dunno.

 

*Baumgartner, S. E., & Leydesdorff, L. (2014). Group‐based trajectory modeling (GBTM) of citations in scholarly literature: dynamic qualities of “transient” and “sticky knowledge claims”. Journal of the Association for Information Science and Technology, 65(4), 797-811. doi: 10.1002/asi.23009 (see arxiv)

** Interesting articles on it. It's from criminology and looks at recidivism. Package.

*** Genolini, C., Alacoque, X., Sentenac, M., & Arnaud, C. (2015). kml and kml3d: R Packages to Cluster Longitudinal Data. Journal of Statistical Software, 65(4), 1-34. Retrieved from http://www.jstatsoft.org/v65/i04/

Comments are off for this post

Bibliometrics: Getting an accurate (+/-) count of articles published by an organization in a year

Dec 02 2015 Published by under bibliometrics

As part of a benchmarking activity, I'm comparing our scholarly output with a few of our peers. Since we're comparing and we're normalizing by number of technical professional staff, we're not particularly concerned with being absolutely comprehensive.

Our strategy is to use Web of Science* and Scopus* and to use their profile page for each (instead of developing our own searches for each which would be more comprehensive but perhaps not evenly). I export these records into (now) EndNote* client and de-dup from there. As previously mentioned, I need to have a more powerful way to de-dup.

Here are some strange findings that really do impact the results that should be attended to if you want to get a decent number:

  • Book chapters for books with only one author. I do get the idea of a book chapter or two from the same book counting as individual items. In this case, coming from WoS, there were >50 chapters (including "introduction to section III") all from the same author. I will count that as 1 book, not as 50+ articles
  • Errata: should be easy to weed out yet apparently still came through.  I definitely don't think these should be counted as a separate contribution!
  • The case of OSA and the 6 conference papers.

Take a look at this screenshot:

Screenshot of OSA search results

Results from OSA search (https://www.osapublishing.org/search.cfm?q=Instantaneous%20multiplex%20imaging%20in%20reacting%20flows&meta=1&cj=0&cc=1)

There are 6 identical papers showing as having been presented at 6 different conferences. Scopus exported these as 6 different contributions. When I contacted OSA, they told me that this was correct as the paper was presented at one of these conferences that were all collocated. I'm sure the author only thinks of this as one paper, but if I had accepted the number uncritically or not de-duplicated, I wouldn't have noticed.

It's worth pointing out, too, that articles with Greek characters in their titles come through to EndNote* in several different ways making them appear distinct to the algorithm.

FWIW, I do know that typically bibliometric projects only use journal articles and one database to avoid these problems; however, a large portion of our output is in engineering fields in which conferences play an important role. I also know that I'm probably unfairly under counting in Computer Science by using these databases. I don't think that unfairly targets any one of the individual organizations in the benchmarking study.

Another issue of concern is the conference paper that is expanded and reprinted in a journal (I count that) vs. meeting abstracts that appear in a journal (I do not count these).

 

 

*Not affiliated, not an endorsement, yadda, yadda

2 responses so far

Slides from Leveraging Data to Lead

Nov 20 2015 Published by under bibliometrics, Conferences, libraries

This was a great conference put on by Maryland SLA. I tweeted at bit using the hashtag: #datamdsla

Here's my slides. Not awesome but I did find some nice pictures 🙂

 

One response so far

Which are the bestest? Top articles from a diverse organization - part 1

Apr 01 2015 Published by under bibliometrics, information analysis

In which Christina goes into the weeds, yet not really thoroughly enough... anyhoo.

So MPOW is approaching an anniversary and we're looking at retrospectives of all sorts. What are the top articles we've published in the literature? What do you mean by top? Ok, so let's say that top means most cited - just for argument's sake. Is it really fair to compare a biomed article to an aerospace engineering article? An article published last week (ok, if in a special issue it might come complete with citations attached) with one published 5 years ago? 10? 20? Review articles with .... you see where this is going.

I had thought to normalize by 5 or 10 year periods and use the subject categories in WoS. But... 1) there are a lot of them 2) they overlap 3) argh.  And things like Acoustics, for example. JASA covers biomed like hearing stuff and it covers underwater sound... but they're not cited the same... at...all.  The acoustics category covers medical journals, physics journals, and maybe some math and engineering (I'd have to look again to be sure).

At the same time, the nice folks there on SIGMETRICS had a argument starting last weekend and going through the beginning of the week on various normalization schemes. One of the complaints against the impact factor is that it's an average and averages don't work on skewed distributions. And the WoS categories suck.

So... what I'm trying to do now is both fractional counting (and I'm checking to make sure I know what that is, but  I think you don't get credit for 1 citation you get credit for 1/(total things cited by citing article) so like a citation from a review article is worth a lot less than one from a regular article because it may be like +1/200 vs. +1/30). And then I'm normalizing by percentile. Not even normal percentile but this Hazen(1914) percentile. Tricky.

I'll be sure to share the script once I've got it. So far the method looks like:

  1. Find my org, relevant time period, articles only in WoS.
  2. Sort by cited, pull off the most cited or all the ones cited more than x or something. Save them down in plain text full record (probably don't need citations?)
  3. Then for each of the top, click on Times Cited. Export them all down in Tab del Windows UTF-8
  4. Move them over to data folder
  5. Run R script (to be shared when I'm sure it's right) to get the new TCs and stick them into the file from 2

*note: if your thingy was cited more than 500 times, you can't export them all at once. Also this would not be practical if you have someone with like thousands of citations. If you do, I would just take the plunge and call that one of the best. We only had 5 over 500.

Next, I'll put them into the ISI.exe script and then the i3 script from here.  See what happens.

As for normalizing by year. I was thinking about maybe omitting a couple of years or so and then doing 5 year bins 3 times and then doing 10 year bins. Not sure. Willing to take advice. It's a 75 year history, but there was a similar paper done in 1986 so I only agreed to go back to 1980. Before a certain time - no longer necessarily 1973 - the affiliation/address aren't there. One very nice retiree I had the pleasure to meet just died and I found that he was listed in Garfield's top cited articles. His work on polar gases is not coming up in the search so it's definitely not complete that far back.

Comments are off for this post

Sizing bars in a bar chart in R

Jun 24 2014 Published by under bibliometrics, information analysis

Yet another stupid thing... but I did it so here's to remembering how.

I wanted to show of all the places my place of work (MPOW) published in the past 5 years, what their impact factors were and how many in each venue. (yes, caveat the IF but this is in response to a request)

So I have a citation manager collection with the articles we've written, collected through database alerts in all the major databases. I exported that and cleaned up the journal names in VantagePoint (not affiliated, yadda, yadda... use Open Refine if you don't have VP), and then laboriously a co-worker and I went through and added the IFs. Then I created a shortened name for each journal (woulda been easier if I kept the official abbr) by first replacing journal with j, transactions with trans, proceedings with proc, letters with let, etc. Then using the Excel

=if (LEN(A2)>25,LEFT(A2,25),A2)

Then copied values and then saved as CSV: short name, number of articles published, IF.

Here's how I graphed it.

w <- mpow2010.2014jnl$total.articles
pos <- 0.5 * (cumsum(w) + cumsum(c(0, w[-length(w)])))
x <- c (1:257)
y <- mpow2010.2014jnl$Impact.Factor
my.labs <- mpow2010.2014jnl.s$Jnl.Short

p<-ggplot() + 
  geom_bar(aes(x = pos, width = w, y = y, fill = x ), stat = "identity") + 
  scale_x_continuous(labels = my.labs, breaks = pos) 

p + ylab("Impact Factor") + xlab("Journal") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position = "none")

The labels were a wonky afterthought. not elegant... and I wanted to get rid of the legend. Note I just made a blank vector for x and then added the labels later. It worked... not pretty.

I would like to thank the kind folks on stackoverflow:

Here it is - intentionally smushed so I don't give away too much. I hope. I can remove image if there's an issue, lmk, don't sue.

Where MPOW published 2010-2014 June

Where MPOW published 2010-2014 June

I ended up going back and zooming in on pieces. I shoulda made a function so I could just look at whatever part I wanted.. sigh.

 

Comments are off for this post

Instrument bibliographies, data citation, searching for data

Jan 11 2013 Published by under bibliometrics, Information Science

My place of work builds spacecraft and instruments that fly on other folks' spacecraft. So one of the things that we need to do is to come up with a list of publications that use our data. It's the same thing with telescopes and it ends up being a lot more difficult than you might expect. There are a few articles on how to do it from ADS staff and from librarians from ESO (Uta Grothkopf and co-authors), STSCI, and other telescopes. It turns out that you really have to do a full text search to get closer to comprehensive. ADS has a fulltext search covering some things, but I asked the experts in the Physics-Math-Astro Division of SLA and many of them also use the fulltext searches on the journal publisher pages (which are of varying quality). I found that Google Scholar was the only thing that got book chapters. This is all pretty complicated if your instrument has a pretty common name or a name that is a very common word.

Other suggestions were to use funding data from Web of Science or elsewhere (soon to be part of CrossRef data), but that really only gets the science team for the instrument. Our main point is to find out who is downloading the data from the NASA site (or elsewhere?) and doing good science with it.

Heather Piwowar has done a lot of research on data citation (is it done, how do you find it), but I believe mostly with life sciences data. Joe Hourclé has also presented several times on data citation and there is the DataCite organization to work on this issue. But this is all future stuff. Right now it's the wild west.

Comments are off for this post

ASIST2012: Metrics 2012

Oct 31 2012 Published by under bibliometrics, Conferences

I attended two days of the ASIST annual meeting. I'm actually quite bummed because I was sure I wasn't going to get to any conferences for a long time because of the twins, but this one was local so I thought I could go. Unfortunately, the superstorm Sandy shut down daycare and also caused Baltimore to shut the streets down 🙁   I did make it in for a workshop on Friday and most of the day on Sunday. Lesson learned - ASIST is more than happy to arrange a room for any mothers who need to pump, but you do need to ask in advance.

Metrics 2012: Symposium on Informetric and Scientometric Research

This is the second time this was held. It's a nice small group that discusses works in progress and recently completed work.

Kate McCain presented on "Assessing obliteration by incorporation" - I've been trying to remember the name for this phenomenon. This is when some concept becomes so ingrained in the field the original article is no longer cited - there's no citation for when the concept is mentioned. A similar idea is "palimpsestic syndrome" - that's when a newer article or review article is cited for a concept instead of the original source because that's where the person read about it and they're unaware of the originator of the idea. The way to find bibliometric evidence is to look for noun phrases with references. In the past this has been done 3 ways:

  • using only the metadata and citations from WoS (record level searching)
  • using the metadata and citations but then pulling the f/t and reviewing that (record level searching and text analysis)
  • now using the f/t

The problem with the first two ways is that you miss a lot of things that use the concept but not in the metadata, only somewhere down in the full text.  She looked for "bounded rationality" using JSTOR's collection - Drexel's subscription. This is somewhat limiting because they only have some JSTOR collections and the coverage of psychology is not good.

Dietmar Wolfram talked about journal similarity by incoming citing journal topicality. He did this for LIS because it annoys us all that the MIS and health informatics journals top the category in JCR - when they really maybe should be in another category. This seemed to work except for small journals or orphan journals (ones that are not cited).

Other things of interest:

  • lots of use of gephi
  • lots of student interest
  • open position at Drexel if anyone is interested

Comments are off for this post

Clustering articles using Carrot2

Sep 11 2012 Published by under bibliometrics, Information Science

I did a very basic intro to using some social network analysis tools for bibliometrics here. This post will also be a brief how I did something for people with similar skills to mine. In other words, if you're a computer programmer or like, this will be too basic.

I've been getting into information analysis more and more at work. I've tried to learn this stuff and keep up as time goes on because I know that it's a very useful tool but it's taken this long to get any real uptake at my place of work. Typically I'll use metadata from either WoS or Scopus or from a collection of specific research databases filtered through RefWorks. Once I have the metadata from the research database I'll often run it through Vantage Point for some cleaning and to make matrices (co-authorship, author x keyword, etc). More recently, I've been using Sci2 for some of this.

All of the tools I use so far work with metadata but I get a lot of calls for doing mining with the text. I do know of tons of tools to do this but I think they all take a little more commitment than I'm willing to give right now (learning to program, for example). Some things can be done in R, but I really haven't tried that either as there is still a steep learning curve.

Anyway, a software developer (well he's really a lot more than that - he does rapid prototyping for human language technologies) buddy of mine from work has recommended Carrot2 a bunch of times. I now have a project that gives me an excuse to give it a whirl. We're mapping an area of research to pick out the key institutions, key authors, key venues... but also looking at the major areas of research. This could be done with author co-citation or bibliographic coupling, but another way is to cluster the articles based on their abstracts - I used Carrot2 for this. A reason not to use Sci2 with WoS data to do ACA or bib coupling is that for this particular research area I was having a very hard time getting a nice clean tight search in WoS where as some social sciences databases were yielding great results. As I was just telling a group at work, a lot depends on your starting set - if you do a crap search with lots of noise, then your bibliometrics aren't reliable and can be embarrassing.

Carrot2 out of the box will cluster search engine results from Bing, Google, and PubMed. If you download it, you can incorporate it into various programming thingies, and you can also use the document clustering workbench on any xml file or feed (like rss). They have a very simple xml input format and you use an xslt to get your base file or feed to look like that. I exported my records from RefWorks in their xml and I started reading up on XSLT...after some playing around I had an epiphany - I could just make a custom export format to get the input format directly from RefWorks!

I started from the RW xml but could have gone from scratch.In the output style editor, bibliography settings:

reference list title: <?xml version="1.0"  ?>n<searchresult>n<query></query>n

text after bibliography: </searchresult>

Then I only defined generic and all the ref types use that:

refid precede with <document id="

follow with ">

Basically do the same for title, primary; abstract (call it snippet); and url

Then add text to output: </document>

You end up with

<?xml version="1.0" ?>
<searchresult>
<query>paste your subject here, its supposed to help</query>
<document id="ID">
<title>article title</title>
<url>url</url>
<snippet>abstract</snippet>
</document>
<document id="ID">
<title>article title</title>
<url>url</url>
<snippet>abstract</snippet>
</document>
</searchresult>

More or less, that is. I had some spaces I needed to remove. There was also one weird character that caused an error.

Then in Carrot2 workbench you select XML and then identify the location of the file and shazaam! You get 3 different visualizations and you can export the clusters. One of my biggest was copyright Sage but it can be tuned and you can add to the stopword list if you want. I still want to play with the tuning and the different clustering methods.

 

Comments are off for this post

Another tilt at a holy grail: identifying emerging research areas by mapping the literature

Jul 10 2011 Published by under bibliometrics, Information Science, STS

Technology surprise, disruptive technologies, or being caught unaware when managing a research portfolio or research funding are ResearchBlogging.orgsome of the fears that keep research managers and research funders up at night. Individual scientists might see some interesting things at conferences and might keep a mental note, but unless they can see the connection to their own work, will likely not bring it back. Even if they do, they might not be able to get funding for their new idea if the folks with the bucks don’t see where it’s going. Consequently, there are lots of different ways to do technology forecasting and the like. One of the main ways has been to mine the literature. Of course, as with anything using the literature you’re looking at some time delay. After all, a journal article may appear three years or more after the research was started.

I’ve been party to a bunch of conversations about this and I’ve also dabbled in the topic so I was intrigued when I saw this article in my feed from the journal. Plus, it uses a tool I’ve had a lot of success with recently in my work, Sci2.

Citation: Guo, H., Weingart, S., & Börner, K. (2011). Mixed-indicators model for identifying emerging research areas Scientometrics DOI: 10.1007/s11192-011-0433-7

The data set they are using is all the articles from PNAS and Scientometrics over a period of thirty years from 1980 to 2010. They’re using the information from Thomson Reuters Web of Science, not the full text of the articles.

Their indicators of emerging areas are:

  • Addition of lots of new authors
  • Increased interdisciplinarity of citations
  • Bursts of new phrases/words

This differs from other work from like Chen and Zitt and others that cluster on citation or co-citation networks and also look at nodes with high betweenness centrality for turning points.

The addition of new authors per year is pretty straight forward, but the other two methods deserve some description. For disciplinarity, each cited article is given a score based on its journal’s location on the UCSD map of science. Then a Rao-Stirling diversity score is calculated for each article to provide a interdisciplinarity of citations. For each pair of citations in the reference list, the probability of the first being in a discipline, the probability of the second in its discipline and then the great circle distance between the two disciplines are used for the score (the map is on a sphere hence not Euclidean distance). The limitations are pretty numerous. First the map is only journals, only 16k journals, and only journals that were around 2001-2005 (think of how many journals have come out in the last few years).  Articles with more than 50% of the citations not going to things on the map were dropped. They mentioned areas with a lot of citations to monographs, but I would think the bigger problem would be conferences. Newer research areas might have problems finding a home in established journals or might be too new for journals and might only be appearing at conferences.

For word bursts, I know from using Sci2 that they’re using Kleinburg’s (2002, $,free pdf on citeseer) algorithm, but I don’t believe they state that in the article. Their description of their implementation of the algorithm is here. I’ve been curious about it but haven’t had the time to read the original article.

 

In general, this article is pretty cool because it’s completely open science. You can go use their tool, use their dataset, and recreate their entire analysis – they encourage you to do so. However, I’m not completely convinced that their areas work for detecting emerging research areas given the plots of one of their methods against another. Their datasets might be to blame. They look at two individual journals so any given new research area might not catch on in a particular journal if there’s a conservative editor or something. PNAS is a general journal so the articles probably first appeared in more specialized journals (they would have to be big time before hitting Science, Nature, or PNAS). Also, the interesting thing with h-index and impact factor (the two emerging areas looked at for Scientometrics) is not their coverage in the information science disciplines, but h-index’s emergence from the physics literature and the coverage of both in biology, physics, and other areas. If you look across journals, your dataset quickly becomes huge, but you might get a better picture. Impact factor was first introduced decade(s) before the start of their window, but socially it’s become so controversial because of the funding tied to it and promotion and tenure tied to it – I believe a more recent phenomenon. Coding journals by discipline has often been done using JCR’s subject categories (lots of disagreement there) but others have mapped science journals by looking at citation networks. It would be computationally extremely expensive, but very cool to actually have a subdisciplinary category applied to each article – not taken from the journal in which it was published. Also, someone really needs to do more of this for areas with important conference papers like engineering and cs.

10 responses so far

« Newer posts Older posts »