Technology surprise, disruptive technologies, or being caught unaware when managing a research portfolio or research funding are some of the fears that keep research managers and research funders up at night. Individual scientists might see some interesting things at conferences and might keep a mental note, but unless they can see the connection to their own work, will likely not bring it back. Even if they do, they might not be able to get funding for their new idea if the folks with the bucks don’t see where it’s going. Consequently, there are lots of different ways to do technology forecasting and the like. One of the main ways has been to mine the literature. Of course, as with anything using the literature you’re looking at some time delay. After all, a journal article may appear three years or more after the research was started.
I’ve been party to a bunch of conversations about this and I’ve also dabbled in the topic so I was intrigued when I saw this article in my feed from the journal. Plus, it uses a tool I’ve had a lot of success with recently in my work, Sci2.
Citation: Guo, H., Weingart, S., & Börner, K. (2011). Mixed-indicators model for identifying emerging research areas Scientometrics DOI: 10.1007/s11192-011-0433-7
The data set they are using is all the articles from PNAS and Scientometrics over a period of thirty years from 1980 to 2010. They’re using the information from Thomson Reuters Web of Science, not the full text of the articles.
Their indicators of emerging areas are:
- Addition of lots of new authors
- Increased interdisciplinarity of citations
- Bursts of new phrases/words
This differs from other work from like Chen and Zitt and others that cluster on citation or co-citation networks and also look at nodes with high betweenness centrality for turning points.
The addition of new authors per year is pretty straight forward, but the other two methods deserve some description. For disciplinarity, each cited article is given a score based on its journal’s location on the UCSD map of science. Then a Rao-Stirling diversity score is calculated for each article to provide a interdisciplinarity of citations. For each pair of citations in the reference list, the probability of the first being in a discipline, the probability of the second in its discipline and then the great circle distance between the two disciplines are used for the score (the map is on a sphere hence not Euclidean distance). The limitations are pretty numerous. First the map is only journals, only 16k journals, and only journals that were around 2001-2005 (think of how many journals have come out in the last few years). Articles with more than 50% of the citations not going to things on the map were dropped. They mentioned areas with a lot of citations to monographs, but I would think the bigger problem would be conferences. Newer research areas might have problems finding a home in established journals or might be too new for journals and might only be appearing at conferences.
For word bursts, I know from using Sci2 that they’re using Kleinburg’s (2002, $,free pdf on citeseer) algorithm, but I don’t believe they state that in the article. Their description of their implementation of the algorithm is here. I’ve been curious about it but haven’t had the time to read the original article.
In general, this article is pretty cool because it’s completely open science. You can go use their tool, use their dataset, and recreate their entire analysis – they encourage you to do so. However, I’m not completely convinced that their areas work for detecting emerging research areas given the plots of one of their methods against another. Their datasets might be to blame. They look at two individual journals so any given new research area might not catch on in a particular journal if there’s a conservative editor or something. PNAS is a general journal so the articles probably first appeared in more specialized journals (they would have to be big time before hitting Science, Nature, or PNAS). Also, the interesting thing with h-index and impact factor (the two emerging areas looked at for Scientometrics) is not their coverage in the information science disciplines, but h-index’s emergence from the physics literature and the coverage of both in biology, physics, and other areas. If you look across journals, your dataset quickly becomes huge, but you might get a better picture. Impact factor was first introduced decade(s) before the start of their window, but socially it’s become so controversial because of the funding tied to it and promotion and tenure tied to it – I believe a more recent phenomenon. Coding journals by discipline has often been done using JCR’s subject categories (lots of disagreement there) but others have mapped science journals by looking at citation networks. It would be computationally extremely expensive, but very cool to actually have a subdisciplinary category applied to each article – not taken from the journal in which it was published. Also, someone really needs to do more of this for areas with important conference papers like engineering and cs.