As part of a benchmarking activity, I'm comparing our scholarly output with a few of our peers. Since we're comparing and we're normalizing by number of technical professional staff, we're not particularly concerned with being absolutely comprehensive.
Our strategy is to use Web of Science* and Scopus* and to use their profile page for each (instead of developing our own searches for each which would be more comprehensive but perhaps not evenly). I export these records into (now) EndNote* client and de-dup from there. As previously mentioned, I need to have a more powerful way to de-dup.
Here are some strange findings that really do impact the results that should be attended to if you want to get a decent number:
- Book chapters for books with only one author. I do get the idea of a book chapter or two from the same book counting as individual items. In this case, coming from WoS, there were >50 chapters (including "introduction to section III") all from the same author. I will count that as 1 book, not as 50+ articles
- Errata: should be easy to weed out yet apparently still came through. I definitely don't think these should be counted as a separate contribution!
- The case of OSA and the 6 conference papers.
Take a look at this screenshot:
There are 6 identical papers showing as having been presented at 6 different conferences. Scopus exported these as 6 different contributions. When I contacted OSA, they told me that this was correct as the paper was presented at one of these conferences that were all collocated. I'm sure the author only thinks of this as one paper, but if I had accepted the number uncritically or not de-duplicated, I wouldn't have noticed.
It's worth pointing out, too, that articles with Greek characters in their titles come through to EndNote* in several different ways making them appear distinct to the algorithm.
FWIW, I do know that typically bibliometric projects only use journal articles and one database to avoid these problems; however, a large portion of our output is in engineering fields in which conferences play an important role. I also know that I'm probably unfairly under counting in Computer Science by using these databases. I don't think that unfairly targets any one of the individual organizations in the benchmarking study.
Another issue of concern is the conference paper that is expanded and reprinted in a journal (I count that) vs. meeting abstracts that appear in a journal (I do not count these).
*Not affiliated, not an endorsement, yadda, yadda