Another tilt at a holy grail: identifying emerging research areas by mapping the literature

Jul 10 2011 Published by under bibliometrics, Information Science, STS

Technology surprise, disruptive technologies, or being caught unaware when managing a research portfolio or research funding are ResearchBlogging.orgsome of the fears that keep research managers and research funders up at night. Individual scientists might see some interesting things at conferences and might keep a mental note, but unless they can see the connection to their own work, will likely not bring it back. Even if they do, they might not be able to get funding for their new idea if the folks with the bucks don’t see where it’s going. Consequently, there are lots of different ways to do technology forecasting and the like. One of the main ways has been to mine the literature. Of course, as with anything using the literature you’re looking at some time delay. After all, a journal article may appear three years or more after the research was started.

I’ve been party to a bunch of conversations about this and I’ve also dabbled in the topic so I was intrigued when I saw this article in my feed from the journal. Plus, it uses a tool I’ve had a lot of success with recently in my work, Sci2.

Citation: Guo, H., Weingart, S., & Börner, K. (2011). Mixed-indicators model for identifying emerging research areas Scientometrics DOI: 10.1007/s11192-011-0433-7

The data set they are using is all the articles from PNAS and Scientometrics over a period of thirty years from 1980 to 2010. They’re using the information from Thomson Reuters Web of Science, not the full text of the articles.

Their indicators of emerging areas are:

  • Addition of lots of new authors
  • Increased interdisciplinarity of citations
  • Bursts of new phrases/words

This differs from other work from like Chen and Zitt and others that cluster on citation or co-citation networks and also look at nodes with high betweenness centrality for turning points.

The addition of new authors per year is pretty straight forward, but the other two methods deserve some description. For disciplinarity, each cited article is given a score based on its journal’s location on the UCSD map of science. Then a Rao-Stirling diversity score is calculated for each article to provide a interdisciplinarity of citations. For each pair of citations in the reference list, the probability of the first being in a discipline, the probability of the second in its discipline and then the great circle distance between the two disciplines are used for the score (the map is on a sphere hence not Euclidean distance). The limitations are pretty numerous. First the map is only journals, only 16k journals, and only journals that were around 2001-2005 (think of how many journals have come out in the last few years).  Articles with more than 50% of the citations not going to things on the map were dropped. They mentioned areas with a lot of citations to monographs, but I would think the bigger problem would be conferences. Newer research areas might have problems finding a home in established journals or might be too new for journals and might only be appearing at conferences.

For word bursts, I know from using Sci2 that they’re using Kleinburg’s (2002, $,free pdf on citeseer) algorithm, but I don’t believe they state that in the article. Their description of their implementation of the algorithm is here. I’ve been curious about it but haven’t had the time to read the original article.


In general, this article is pretty cool because it’s completely open science. You can go use their tool, use their dataset, and recreate their entire analysis – they encourage you to do so. However, I’m not completely convinced that their areas work for detecting emerging research areas given the plots of one of their methods against another. Their datasets might be to blame. They look at two individual journals so any given new research area might not catch on in a particular journal if there’s a conservative editor or something. PNAS is a general journal so the articles probably first appeared in more specialized journals (they would have to be big time before hitting Science, Nature, or PNAS). Also, the interesting thing with h-index and impact factor (the two emerging areas looked at for Scientometrics) is not their coverage in the information science disciplines, but h-index’s emergence from the physics literature and the coverage of both in biology, physics, and other areas. If you look across journals, your dataset quickly becomes huge, but you might get a better picture. Impact factor was first introduced decade(s) before the start of their window, but socially it’s become so controversial because of the funding tied to it and promotion and tenure tied to it – I believe a more recent phenomenon. Coding journals by discipline has often been done using JCR’s subject categories (lots of disagreement there) but others have mapped science journals by looking at citation networks. It would be computationally extremely expensive, but very cool to actually have a subdisciplinary category applied to each article – not taken from the journal in which it was published. Also, someone really needs to do more of this for areas with important conference papers like engineering and cs.

10 responses so far

  • Bashir says:

    This is very interesting. Wouldn't conference proceedings be very useful for this? In my field conference presentations (posters, etc) seem to better represent what is current and offer a preview of what will be in the journals in a year or so. The main society conference does have an extensive catalog going back a decade or so.

    • Christina Pikas says:

      It very much depends on the discipline, but yes, for many, conference proceedings would be very important.

  • Valeria de Paiva says:

    UCSD map of science? link, please?
    interesting post indeed. thanks!

    • Christina Pikas says:

      The citation for the UCSD map is: Klavans, R., & Boyack, K. W. (2009). Toward a consensus map of science. Journal of the American Society for Information Science and Technology, 60(3), 455–476. DOI: 10.1002/asi.20991

      • Christina Pikas says:

        If you look at that article, they refer to the UCSD map as coming from: Klavans, R., & Boyack, K.W. (2008). Thought leadership: A new indicator for national and institutional comparison. Scientometrics, 75, 239–250.

  • Dave C. says:

    As you point out, new discoveries seem to take a few years before reaching real world settings. There is publication lag (the time elapsed between an aritcle being accepted and published), and there is a lag between an article being published and the ideas making their way into real world settings. Einy experienced this phenomenon. After publishing his first big paper on special relativity, there was just silence for a couple of years. Science is inherently cautious and slow moving. It is nice to know that work is being done on exploring new ways of minimizing the lag.

  • TimC says:

    Hi there, I am just finding out about Sci2 tool, but have a generic question for you - I know it can read proprietary file formats (like .isi etc), but it seems to convert them to tables and so I wondered if I could just load up .csv files and run the same analyses? For example, can I load up .csv file with all the same fields as a .nsf and run the same analysis?

    • Christina Pikas says:

      Hi - As far as I know, yes, you can import csv files. For some reason you don't get as many options as if you import isi files. You can do a lot of things but you don't get that whole extra menu. I'm not the expert though.

      • TimC says:

        Thanks - that was the impression I got looking quickly through the documentation. Weird. Same with CiteSpace - no .csv support AT ALL!

        • TimC says:

          Follow-up request - I have a very large dataset that is in Excel only, but contains everything I need to work with. I wondered if anybody might be able to help out with a macro for word co-occurence? A long shot, but thanks for anyway in advance.