Here's my slides. Not awesome but I did find some nice pictures 🙂
Here's my slides. Not awesome but I did find some nice pictures 🙂
In which Christina goes into the weeds, yet not really thoroughly enough... anyhoo.
So MPOW is approaching an anniversary and we're looking at retrospectives of all sorts. What are the top articles we've published in the literature? What do you mean by top? Ok, so let's say that top means most cited - just for argument's sake. Is it really fair to compare a biomed article to an aerospace engineering article? An article published last week (ok, if in a special issue it might come complete with citations attached) with one published 5 years ago? 10? 20? Review articles with .... you see where this is going.
I had thought to normalize by 5 or 10 year periods and use the subject categories in WoS. But... 1) there are a lot of them 2) they overlap 3) argh. And things like Acoustics, for example. JASA covers biomed like hearing stuff and it covers underwater sound... but they're not cited the same... at...all. The acoustics category covers medical journals, physics journals, and maybe some math and engineering (I'd have to look again to be sure).
At the same time, the nice folks there on SIGMETRICS had a argument starting last weekend and going through the beginning of the week on various normalization schemes. One of the complaints against the impact factor is that it's an average and averages don't work on skewed distributions. And the WoS categories suck.
So... what I'm trying to do now is both fractional counting (and I'm checking to make sure I know what that is, but I think you don't get credit for 1 citation you get credit for 1/(total things cited by citing article) so like a citation from a review article is worth a lot less than one from a regular article because it may be like +1/200 vs. +1/30). And then I'm normalizing by percentile. Not even normal percentile but this Hazen(1914) percentile. Tricky.
I'll be sure to share the script once I've got it. So far the method looks like:
*note: if your thingy was cited more than 500 times, you can't export them all at once. Also this would not be practical if you have someone with like thousands of citations. If you do, I would just take the plunge and call that one of the best. We only had 5 over 500.
Next, I'll put them into the ISI.exe script and then the i3 script from here. See what happens.
As for normalizing by year. I was thinking about maybe omitting a couple of years or so and then doing 5 year bins 3 times and then doing 10 year bins. Not sure. Willing to take advice. It's a 75 year history, but there was a similar paper done in 1986 so I only agreed to go back to 1980. Before a certain time - no longer necessarily 1973 - the affiliation/address aren't there. One very nice retiree I had the pleasure to meet just died and I found that he was listed in Garfield's top cited articles. His work on polar gases is not coming up in the search so it's definitely not complete that far back.
Yet another stupid thing... but I did it so here's to remembering how.
I wanted to show of all the places my place of work (MPOW) published in the past 5 years, what their impact factors were and how many in each venue. (yes, caveat the IF but this is in response to a request)
So I have a citation manager collection with the articles we've written, collected through database alerts in all the major databases. I exported that and cleaned up the journal names in VantagePoint (not affiliated, yadda, yadda... use Open Refine if you don't have VP), and then laboriously a co-worker and I went through and added the IFs. Then I created a shortened name for each journal (woulda been easier if I kept the official abbr) by first replacing journal with j, transactions with trans, proceedings with proc, letters with let, etc. Then using the Excel
Then copied values and then saved as CSV: short name, number of articles published, IF.
Here's how I graphed it.
w <- mpow2010.2014jnl$total.articles pos <- 0.5 * (cumsum(w) + cumsum(c(0, w[-length(w)]))) x <- c (1:257) y <- mpow2010.2014jnl$Impact.Factor my.labs <- mpow2010.2014jnl.s$Jnl.Short p<-ggplot() + geom_bar(aes(x = pos, width = w, y = y, fill = x ), stat = "identity") + scale_x_continuous(labels = my.labs, breaks = pos) p + ylab("Impact Factor") + xlab("Journal") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position = "none")
The labels were a wonky afterthought. not elegant... and I wanted to get rid of the legend. Note I just made a blank vector for x and then added the labels later. It worked... not pretty.
I would like to thank the kind folks on stackoverflow:
Here it is - intentionally smushed so I don't give away too much. I hope. I can remove image if there's an issue, lmk, don't sue.
I ended up going back and zooming in on pieces. I shoulda made a function so I could just look at whatever part I wanted.. sigh.
My place of work builds spacecraft and instruments that fly on other folks' spacecraft. So one of the things that we need to do is to come up with a list of publications that use our data. It's the same thing with telescopes and it ends up being a lot more difficult than you might expect. There are a few articles on how to do it from ADS staff and from librarians from ESO (Uta Grothkopf and co-authors), STSCI, and other telescopes. It turns out that you really have to do a full text search to get closer to comprehensive. ADS has a fulltext search covering some things, but I asked the experts in the Physics-Math-Astro Division of SLA and many of them also use the fulltext searches on the journal publisher pages (which are of varying quality). I found that Google Scholar was the only thing that got book chapters. This is all pretty complicated if your instrument has a pretty common name or a name that is a very common word.
Other suggestions were to use funding data from Web of Science or elsewhere (soon to be part of CrossRef data), but that really only gets the science team for the instrument. Our main point is to find out who is downloading the data from the NASA site (or elsewhere?) and doing good science with it.
Heather Piwowar has done a lot of research on data citation (is it done, how do you find it), but I believe mostly with life sciences data. Joe Hourclé has also presented several times on data citation and there is the DataCite organization to work on this issue. But this is all future stuff. Right now it's the wild west.
I attended two days of the ASIST annual meeting. I'm actually quite bummed because I was sure I wasn't going to get to any conferences for a long time because of the twins, but this one was local so I thought I could go. Unfortunately, the superstorm Sandy shut down daycare and also caused Baltimore to shut the streets down 🙁 I did make it in for a workshop on Friday and most of the day on Sunday. Lesson learned - ASIST is more than happy to arrange a room for any mothers who need to pump, but you do need to ask in advance.
Metrics 2012: Symposium on Informetric and Scientometric Research
This is the second time this was held. It's a nice small group that discusses works in progress and recently completed work.
Kate McCain presented on "Assessing obliteration by incorporation" - I've been trying to remember the name for this phenomenon. This is when some concept becomes so ingrained in the field the original article is no longer cited - there's no citation for when the concept is mentioned. A similar idea is "palimpsestic syndrome" - that's when a newer article or review article is cited for a concept instead of the original source because that's where the person read about it and they're unaware of the originator of the idea. The way to find bibliometric evidence is to look for noun phrases with references. In the past this has been done 3 ways:
The problem with the first two ways is that you miss a lot of things that use the concept but not in the metadata, only somewhere down in the full text. She looked for "bounded rationality" using JSTOR's collection - Drexel's subscription. This is somewhat limiting because they only have some JSTOR collections and the coverage of psychology is not good.
Dietmar Wolfram talked about journal similarity by incoming citing journal topicality. He did this for LIS because it annoys us all that the MIS and health informatics journals top the category in JCR - when they really maybe should be in another category. This seemed to work except for small journals or orphan journals (ones that are not cited).
Other things of interest:
I did a very basic intro to using some social network analysis tools for bibliometrics here. This post will also be a brief how I did something for people with similar skills to mine. In other words, if you're a computer programmer or like, this will be too basic.
I've been getting into information analysis more and more at work. I've tried to learn this stuff and keep up as time goes on because I know that it's a very useful tool but it's taken this long to get any real uptake at my place of work. Typically I'll use metadata from either WoS or Scopus or from a collection of specific research databases filtered through RefWorks. Once I have the metadata from the research database I'll often run it through Vantage Point for some cleaning and to make matrices (co-authorship, author x keyword, etc). More recently, I've been using Sci2 for some of this.
All of the tools I use so far work with metadata but I get a lot of calls for doing mining with the text. I do know of tons of tools to do this but I think they all take a little more commitment than I'm willing to give right now (learning to program, for example). Some things can be done in R, but I really haven't tried that either as there is still a steep learning curve.
Anyway, a software developer (well he's really a lot more than that - he does rapid prototyping for human language technologies) buddy of mine from work has recommended Carrot2 a bunch of times. I now have a project that gives me an excuse to give it a whirl. We're mapping an area of research to pick out the key institutions, key authors, key venues... but also looking at the major areas of research. This could be done with author co-citation or bibliographic coupling, but another way is to cluster the articles based on their abstracts - I used Carrot2 for this. A reason not to use Sci2 with WoS data to do ACA or bib coupling is that for this particular research area I was having a very hard time getting a nice clean tight search in WoS where as some social sciences databases were yielding great results. As I was just telling a group at work, a lot depends on your starting set - if you do a crap search with lots of noise, then your bibliometrics aren't reliable and can be embarrassing.
Carrot2 out of the box will cluster search engine results from Bing, Google, and PubMed. If you download it, you can incorporate it into various programming thingies, and you can also use the document clustering workbench on any xml file or feed (like rss). They have a very simple xml input format and you use an xslt to get your base file or feed to look like that. I exported my records from RefWorks in their xml and I started reading up on XSLT...after some playing around I had an epiphany - I could just make a custom export format to get the input format directly from RefWorks!
I started from the RW xml but could have gone from scratch.In the output style editor, bibliography settings:
reference list title: <?xml version="1.0" ?>n<searchresult>n<query></query>n
text after bibliography: </searchresult>
Then I only defined generic and all the ref types use that:
refid precede with <document id="
follow with ">
Basically do the same for title, primary; abstract (call it snippet); and url
Then add text to output: </document>
You end up with
<?xml version="1.0" ?>
<query>paste your subject here, its supposed to help</query>
More or less, that is. I had some spaces I needed to remove. There was also one weird character that caused an error.
Then in Carrot2 workbench you select XML and then identify the location of the file and shazaam! You get 3 different visualizations and you can export the clusters. One of my biggest was copyright Sage but it can be tuned and you can add to the stopword list if you want. I still want to play with the tuning and the different clustering methods.
Technology surprise, disruptive technologies, or being caught unaware when managing a research portfolio or research funding are some of the fears that keep research managers and research funders up at night. Individual scientists might see some interesting things at conferences and might keep a mental note, but unless they can see the connection to their own work, will likely not bring it back. Even if they do, they might not be able to get funding for their new idea if the folks with the bucks don’t see where it’s going. Consequently, there are lots of different ways to do technology forecasting and the like. One of the main ways has been to mine the literature. Of course, as with anything using the literature you’re looking at some time delay. After all, a journal article may appear three years or more after the research was started.
I’ve been party to a bunch of conversations about this and I’ve also dabbled in the topic so I was intrigued when I saw this article in my feed from the journal. Plus, it uses a tool I’ve had a lot of success with recently in my work, Sci2.
Citation: Guo, H., Weingart, S., & Börner, K. (2011). Mixed-indicators model for identifying emerging research areas Scientometrics DOI: 10.1007/s11192-011-0433-7
The data set they are using is all the articles from PNAS and Scientometrics over a period of thirty years from 1980 to 2010. They’re using the information from Thomson Reuters Web of Science, not the full text of the articles.
Their indicators of emerging areas are:
This differs from other work from like Chen and Zitt and others that cluster on citation or co-citation networks and also look at nodes with high betweenness centrality for turning points.
The addition of new authors per year is pretty straight forward, but the other two methods deserve some description. For disciplinarity, each cited article is given a score based on its journal’s location on the UCSD map of science. Then a Rao-Stirling diversity score is calculated for each article to provide a interdisciplinarity of citations. For each pair of citations in the reference list, the probability of the first being in a discipline, the probability of the second in its discipline and then the great circle distance between the two disciplines are used for the score (the map is on a sphere hence not Euclidean distance). The limitations are pretty numerous. First the map is only journals, only 16k journals, and only journals that were around 2001-2005 (think of how many journals have come out in the last few years). Articles with more than 50% of the citations not going to things on the map were dropped. They mentioned areas with a lot of citations to monographs, but I would think the bigger problem would be conferences. Newer research areas might have problems finding a home in established journals or might be too new for journals and might only be appearing at conferences.
For word bursts, I know from using Sci2 that they’re using Kleinburg’s (2002, $,free pdf on citeseer) algorithm, but I don’t believe they state that in the article. Their description of their implementation of the algorithm is here. I’ve been curious about it but haven’t had the time to read the original article.
In general, this article is pretty cool because it’s completely open science. You can go use their tool, use their dataset, and recreate their entire analysis – they encourage you to do so. However, I’m not completely convinced that their areas work for detecting emerging research areas given the plots of one of their methods against another. Their datasets might be to blame. They look at two individual journals so any given new research area might not catch on in a particular journal if there’s a conservative editor or something. PNAS is a general journal so the articles probably first appeared in more specialized journals (they would have to be big time before hitting Science, Nature, or PNAS). Also, the interesting thing with h-index and impact factor (the two emerging areas looked at for Scientometrics) is not their coverage in the information science disciplines, but h-index’s emergence from the physics literature and the coverage of both in biology, physics, and other areas. If you look across journals, your dataset quickly becomes huge, but you might get a better picture. Impact factor was first introduced decade(s) before the start of their window, but socially it’s become so controversial because of the funding tied to it and promotion and tenure tied to it – I believe a more recent phenomenon. Coding journals by discipline has often been done using JCR’s subject categories (lots of disagreement there) but others have mapped science journals by looking at citation networks. It would be computationally extremely expensive, but very cool to actually have a subdisciplinary category applied to each article – not taken from the journal in which it was published. Also, someone really needs to do more of this for areas with important conference papers like engineering and cs.
Waiting for time to craft the perfect post apparently means 3 weeks without a post. Anyway, I'm so very grateful to a colleague at work who has me working on a project with him to do some science mapping/scientometrics, etc. Fun stuff I don't get to do all the time even though it seems like I spend a lot of time thinking about how to do it, training on how to do it, learning about it, etc.
I'm using some of the usual suspects, research databases (Inspec, WoS, Scopus- working surprisingly well for the topic), RefWorks to gather citations, VantagePoint to clean up and rank things (conference venues, people, affiliations, etc)... but now I've totally gone off on a Sci2 trip.
I did some bib coupling networks and then did pathfinder pruning and then graphed the bad boy... but the thing that had me dancing around and sharing with my co-workers was my beautiful geographic plot. I got a Yahoo api ID and then used the Yahoo geocoder on the correspondence addresses and then mapped where the articles were from. It would have been better if I had rolled up the locations and then had a column with a summary number of publications for that location... but no matter. I used times cited. I'll try to post a picture later.
This should be good enough for a conference poster. Hm. That's July 1 for ASIST. But then I'd have to reveal the topic... well, we'll see.
I’ve been seeing a lot about Sci2, a tool for scientometrics, social network analysis, science of science… all sorts of stuff from Indiana University. After watching some videos and reading through the very detailed instructions, I figured I would give it a try. So in a little while this afternoon, I got it going on my computer and worked through some of the workflows.
Pretty cool and pretty straight forward. Probably not as easy as, say, NetDraw for working with the graph. A lot more powerful. It also can do huge data sets and do things like co-word that the other programs don’t do out of the box (you need to get a helper program or write your own). It also takes the output directly from both Scopus and Web of Science without any messing around.
After using some of their sample data, I tried some of my own. That did cause a little bit of frustration. My data are in RefWorks – this is the research output from a particular department at MPOW. It made sense to try BibTeX. Turned out that it wasn’t able to import because there were lots of extra @ , < , >, and continuation lines didn’t work. I made a custom export format to get rid of the notes, abstract, and affiliation fields. Then it turned out that there were <sup> in the titles and @ in the page number fields (dunno why for the latter). After i got it all cleaned up it imported and I was able to clean it up and “preprocess” the data. Then it turned out that I needed to have allocated more memory for the file size so I arbitrarily cut off half the data and tried again. It was calculating the bursts of keywords over time when I noticed it was getting really late so I left it going and headed home. I’ll post a pretty picture when I have one.
This is definitely worth a try. I’m all about lots of handholding and their wiki is very useful that way.
Update 5/18/2011: A product marketing manager from Wiley has informed me that she has made this article free to view for 30 days because of this blog post. Thanks!
Update 6/22/2011: The first author contacted me to let me know the full citation since the article is fully published now.
Science mapping is using the network formed from the links between articles (citations, co-authorship), patents, or other information things, to understand the structure of science (Börner, Chen, & Boyack, 2003), look for turning points or bursty periods in which lots of publishing happens, locate research specialties (Morris & Martens, 2008), locate important research institutes, look for geographic concentrations, and trace the history of an idea. It’s been around for a while, but it’s gotten even more popular with better visualization techniques and more powerful computers. It’s particularly hot right now with the two major data providers, Thomson Reuters and Elsevier, heavily marketing analysis products. Elsevier sponsored a webinar last week in their Research Connect series on this topic and had more than 500 people dialed in (slides will be posted but I haven’t been notified that they are yet).
In addition to a bunch of different really expensive commercial products that cover some aspects of this process, there are a ton of tools available from universities and research centers that are either free or inexpensive for non-profit or educational use. This article reviews the field – the general techniques – but mostly reviews the tools. For more on the whys and whats refer to the ARIST articles cited earlier or, really, read articles that cite them or are cited by them.
Cobo, M., López-Herrera, A., Herrera-Viedma, E., & Herrera, F. (2011). Science mapping software tools: Review, analysis, and cooperative study among tools Journal of the American Society for Information Science and Technology, 62, 1382–1402. DOI: 10.1002/asi.21525
This article is pretty readable and really useful – with the proliferating tools, it’s nice to compare them without having to install them all. They review: Bibexcel, CiteSpace, CoPalRed, IN-SPIRE, Ledeysdorff’s software, Network Workbench, Sci2, VantagePoint, and VOSViewer. Hey CiteSpace now takes data from ADS – that’s cool! IN-SPIRE is kind of weird – different – since it doesn’t do the thing with bibliographic data. Likewise Leydesdorff’s software is more a series of utilities to deal with bibliographic data. VantagePoint is commercial and somewhat expensive (I have it at work) but it’s really cool how it does data cleanup. They also provide helpful tables to compare the tools.
Interestingly, they tested things like I typically use them: all at the same time. They use one to to clean, another to analyze, and a third to visualize. Science Sci2 is free, powerful, and well documented, it looks like a good bet. I’ve tried CiteSpace before but it looks like it has improved a lot since then.
Börner, K., Chen, C., & Boyack, K. W. (2003). Visualizing knowledge domains. Annual Review of Information Science and Technology, 37, 179–255. doi: 10.1002/aris.1440370106
Morris, S. A. and Van der Veer Martens, B. (2008), Mapping research specialties. Annual Review of Information Science and Technology 42, 213–295. doi: 10.1002/aris.2008.1440420113