(by Christina Pikas) Oct 27 2017

This event was held Friday October 27, 2017

Kate McCain  - Undercounting the gift givers: issues when tallying acknowledgements in life sciences research

ongoing research effort - she originally worked on this 20 years ago but has come back to it recently. Background - model organisms - useful to organize research around. Community databases, stock centers, community databases, community ethos wrt sharing.

Ways to focus research - by journal is often done, but she uses this model organism. She is looking at 1980-2004 during growth phase when there is more sharing because nascent research area. And she is looking at acknowledgements.

Compared to citations - acknowledged most likely to be alive.

Personal ack vs. funding - she's interested in personal ackn. "peer interactive communication"

May be lots of different places: end note, methods section, end of text with no section label, ... No control or standardization of how people are named, what granularity they are thanked for, etc.

WoS mostly gets funding ack, and only secondarily sweeps up some personal ack (if they are in the same block, which is not always the case).

Undercounting big deal: text extraction relying on formal ack section. personal name disambiguation. Sampling or single year studies.

Check her slides to see what she found where. She also categorized types of ack - animals, software, data, editing, etc.

Top 15 individuals listed - first few time periods dominated by University of Oregon - founders and suppliers of fish early on.

She then went through profiles of some individuals with the diversity of how they appeared.

Trends - fewer examples of thanking for research materials - have their own, get from repository, or get from stock center

questions: manually - yes? learn things to help automate - yes, but lots and lots and lots of ways to trip up. Also just picking up surnames is not enough because then get some citations mixed in, named equations/methods, etc.

Reminds me of:

questions: in the lab outside of the lab. also tracking people who are frequently acknowledged and not often co-authors/cited

questions: comment - collaboration - set up something from PMC data (already coded in XML), but only using ack section and not the Materials & Methods (M&M) section.


Isabelle Dorsch - Relative Visibility

How well known. She's comparing personal publication list and information services (like WoS).

Relative visibility (IS) = (d/r)*100
d= in information services, r=publication list

Previous work - Cronin & Stock, and ISSI board study

Issues - finding the personal list, is it up to date and complete, is it structured to be used at all, what types of documents to keep (novels? newspaper articles?), keep in press?

(*discussion of this on SIGMETRICS really found that a combined edited list is probably best, but these aren't universally available - list maintained by information service but updated by author)

Which information service matters (of course)  -  visibility to one field when author publishes in multiple. Conference paper coverage, book coverage, etc.

questions: new author - only two publications - 100% (they only looked at established authors). Very dependent on the database

Judit Bar-Ilan - CiteScore vs JIF and Other Journal Indicators

Criticisms of JIF but still heavily used. Standard definition. Criticisms like lack of transparency. Things in the numerator not included as "citable items" in the denominator. Also now offer a 5year JIF

Citescore - publication window 3 years. They count all items so no numerator/denominator coverage mismatch. Transparent - can see all the citations that are covered. Freely available. Some criticism that covers too many different document types

EigenFactor, SJR, pagerank type indicators - more weight to more influential sources

Article Influence - normalized - average journal is 1.

She looked at correlations - for those sources that appear on most sources.

Quite high - CS-JIF 0.94,

HOWEVER - Lancet is 5 in JIF, 314 in CS - so huge differences and she suspects due to notes, editorials, etc.

Top 20 by CS are almost all review journals (Annual Review of... , Progress in... )

Eigenfactor doesn't include journal self-citation, and doesn't correlate as well with others.

Note also that even though high correlation, there are these big differences.

question: comment - real correlation between size of journal and JIF, Eigenfactor is the only one that corrects for this.


Student papers

Zhao, Mao, & Kun Lu (speaking, not student) - An Exploratory Study on Co-word Network Simulation

Network centrality and other network measures for co-word network. Are they correlated. Are there differences in disciplines in these measures. Looking at generative process of a co-word network.

Q: co-word can mean 3 different things: words that appear in the text, co-descriptor - uses carefully assigned things, keywords plus - is another thing separately (not controlled, but titles of articles cited). Are you simulating second hand natural language assigned things.

Antoine Archambault, Philippe Mongeon (speaking), Vincent Larivière  - The concentration of journal use in Canadian universities

As Canadian universities have to cut big packages due to budgetary issues.

Evaluating - downloads statistics from the 28 universities (~300 Excel files, 5M lines), references (articles written by authors at these universities citing these journals) perceived importance of journals (what journals do you perceive important to your research, your teaching) 23 of 28 universities, 5,500 participants (of which 3k from their university so actually disappointing response)

Cleaning important journals - title disambiguation, manual validation, classification by major disciplinary area (AH, SS, BM, NSE) - WoS, NSF, Ulrich's, Google, also verified research journal and not newsletter, etc.

47k unique journals.

Priority journals - 80/20 rule - anything in top of 80% downloads, references, mentions (10% of subscriptions account for 80% of any of these measures)

66% of the 47k journals are not in the top anywhere.

Broke out by publishers - Springer 80% of publications were not in anyone's top. Sage only 22% were not in anyone's

Only 41.6% overlap of core journals between universities

Correlation of cites, downloads, mentions (cites are super lengthy for institutions to do themselves can they just use downloads?) - answer is no. Have to use the 3 measures, not completely correlated.

Q: can you some sort of demand driven acquisition

Q: are there libraries of record - keep even if don't use locally

Q: combining visibility presentation earlier with this.

Christine Meschede Cross-Metric Compatibility of Altmetrics: Consistency of the Metrics from PlumX and

(posting before notes - battery going right now - will try to update)

Comments are off for this post

Brief notes from Maryland SLA's Storytelling with Data

(by Christina Pikas) Oct 20 2017

This one-day meeting/course/workshop/seminar (?) was held at the University of Maryland (go Terps!) on October 12, 2017. As with all events planned by my local SLA chapter, it was very well organized and run. The speakers were all excellent. Amazingly, the parking was close and pre-paid. The food was great, too.

Keith Marzullo - the dean of the iSchool - gave some welcoming remarks. He was so positive and seemed to really get the point of the day.

The opening keynote was by Ya-Ling Lu from the National Institutes of Health library (not NLM but the campus library). I have mostly heard her speak tag-teaming with Chris Belter on bibliometrics techniques but it was wonderful to have the opportunity to hear a long presentation just by her on visualization. She talked about having a low floor - starting at the beginning - and a high ceiling - keep learning and improving.

She talked about learning design and how choices convey emotion and meaning. Her example was from Picture This: How Pictures Work by Molly Bang

  WorldCat link

It was amazing to see how simple rectangles and triangles, their color, size, and location really told the story.

She also provided examples of developing information products. The first was to celebrate the life and career of someone retiring. She needed data and visualizations and a story for people, research, and leadership.

A second example was graphing how she spends her day to try to find more time for the things she wants to do.

Finally, she skipped over an example of how she successfully fought a traffic ticket using data and visualizations.

Oh, and she often uses Excel for her visualizations - even when she can make them in R or Matlab.


Jessie Sigman from University of Maryland spoke next about using cytoscape and gephi to do graphs showing coverage of agricultural topics across research databases.

Vendor updates were provided by the sponsoring companies: Clarivate, Ebsco, and Cambridge University Press. CUP is doing a neat new thing that's sort of like Morgan & Claypool - it's like a monographic series, but the volumes are 40-70 pages. Peer reviewed and series are edited like journals.

David Durden and Joseph Koivisto of University of Maryland spoke next about the different stories that can be told with repository usage data. So it turns out that D-Space has separate data for the content (say PDF) and the metadata and integrating this mess to get a real, accurate picture of how the system is being used is a bit of a bitch. It's indexed by Solr, but Solr doesn't keep the same index number for the content - it assigns its own. Google Analytics does a lot, but maybe not the right things. RAMP, a project out of the University of Montana, helps with Google data but also has shortcomings. Things based on Google do the best they can to filter out bots. HOWEVER, if it's a bot a professor on campus wrote to analyze data, then that's a great use to track. Also Google doesn't capture the full text downloads.


Brynne Norton from NASA Goddard spoke of a cool visualization using interlibrary loan data. Standard statistics are just like time to get things filled and % requests filled. The data are horribly messy, with some citations lacking even an article title. She compiled the article titles using a series of regex searches and searched them through the Web of Science GUI. Yeah, the GUI. Apparently you can OR about 500 articles at a time! (as an aside: yes, there is indeed a WoS API, but you cannot use it for this purpose. You are only allowed to search for yourself. I know.) Then she loaded into VosViewer and did a topic map. It was really cool and she narrated how it showed certain areas they might consider collecting in.


Sally Gore did the closing keynote and boy is she awesome. I highly recommend librarians sign up for her webinar when SLA schedules it. She was also super encouraging. She spoke of how she figured out how to do these amazing infographics on her own - she even uses PowerPoint and sometimes draws her own icons. She recommended books by Stephanie Evergreen to learn design.  I have more notes, but they're at work and I'm trying to get this published - so I'll add if I find anything else I wanted to note

The closing remarks were actually terrible. The guy who gave them had not actually attended any of the day or really read the descriptions of the speakers. His comments were like on research data management which is irrelevant to the day's topic. Boo.

But then we drank wine and had some more food so it was ok 🙂

Comments are off for this post

Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

(by Christina Pikas) Aug 28 2017

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.


This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.


filenames <- list.files("directory", full.names=TRUE)


WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.


I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.


* no endorsement intended, for real.


2 responses so far

Data point on if we need expensive specialty research databases

(by Christina Pikas) Aug 17 2017

***NOTE: Not endorsing any commercial products***

A search for "quantum computing" in the Abstract, Title, Keywords in Scopus yields 6,415 records

A search for DE (quantum computing) in Inspec yields 20,403.   By DE I mean descriptor - "quantum computing" has been in Inspec's thesaurus since 1999. Use the correct formatting for your platform.

One really quick insight: US is top publisher in Scopus and China is in Inspec.

One response so far

Notes from 1.5 days of: Collections as Data: Hack-to-Learn

(by Christina Pikas) Aug 10 2017

You guys - this post has been in draft since May 22, 2017! I'm just posting it...

Collections as Data: Hack-to-Learn was a fabulous workshop put on by the Library of Congress, George Washington University Libraries, and George Mason University Libraries. It was a neat gathering of interesting and talented people, nifty data, and very cool tools.  It didn't hurt either that it was in a beautiful conference room with a view of the Capitol the first day and at the renovated Winston Churchill Center at GWU the second. A lot of it was geared toward metadata librarians and digital humanities librarians, but I felt welcomed. Readers of this blog will know that I really want to bring these tools to more public services/liaison/etc., librarians so it was good.

Unfortunately, I had to leave mid-day on day 2 because of a family emergency 🙁 (everybody is ok) but here are some interesting tidbits to save and share.

Data Sets:

LoC MARC Records

Have you heard that LoC freed a ton of their cataloging data? FREE. Should have always been freely available. Actually this is only up to December 2013 and the remainder are still under paid subscription ... but ... still! People are already doing cool things with it (neat example). We had a part of this that the organizers had kindly already done some editing on.

Phyllis Diller Gag File

This was a sort of poorly formatted csv of several drawers of the file. Hard not to just sit and chuckle instead of analyzing

Eleanor Roosevelt's My Day Columns

Apparently Roosevelt wrote these from the 1930s to her death in 1962. Originally she wrote them 5 days a week but tapered to 3 when her health failed. They are a few paragraphs and more or less dryly list her activities.

End of Term Tumblr Archive (no link)

This was archived as part of the efforts to capture the outgoing administration's stuff before it disappeared. It was a very interesting collection of things from museums to astronauts.


Somewhere in here we covered TEI - I had no idea this existed. How cool. So like when you're doing transcripts of interviews you can, for example, keep the erm, uh, coughs... or ignore depending on the level of analysis?  TEI lets you annotate texts with all sorts of detail and make it linked data for entities, etc.


  • OpenRefine - more detailed use and examples of reconciliation
  • Voyant - very, very cool tool to at least do preliminary analysis of text. NB: installing on my work windows machine was a bit rough. I ended up getting a Linux VM and it works well/easily. The visualizations are great. Limitation in number of texts you can import at a time.
  • MALLET - did you think this one was too hard and required java or some such? Turns out there's a command line one anyone can use. We did topic models for some of the sets. I think I will probably stay with the way I've been doing them in R because seems like they're easier to understand.
  • Gephi - yeah, again, and i still can't get along with it. I have to face that it's just me.
  • Carto - a cool mapping tool

Also, day 2 someone suggested spaCy instead of NLTK for natural language processing in Python. This is another thing I couldn't get working for anything on my windows box from work. I don't know if there is something being blocked or what. Installs and works beautifully on the Linux machine, though.



Comments are off for this post

Peer Review Review Paper for Review by Peers

(by Christina Pikas) Aug 10 2017

Jon Tennant led a crowd of us to write a new article that is out on F1000 Research:

Tennant JP, Dugan JM, Graziotin D et al. A multi-disciplinary perspective on emergent and future innovations in peer review [version 1; referees: 1 approved with reservations]F1000Research 2017, 6:1151
(doi: 10.12688/f1000research.12037.1)

Peer review of research articles is a core part of our scholarly communication system. In spite of its importance, the status and purpose of peer review is often contested. What is its role in our modern digital research and communications infrastructure? Does it perform to the high standards with which it is generally regarded? Studies of peer review have shown that it is prone to bias and abuse in numerous dimensions, frequently unreliable, and can fail to detect even fraudulent research. With the advent of Web technologies, we are now witnessing a phase of innovation and experimentation in our approaches to peer review. These developments prompted us to examine emerging models of peer review from a range of disciplines and venues, and to ask how they might address some of the issues with our current systems of peer review. We examine the functionality of a range of social Web platforms, and compare these with the traits underlying a viable peer review system: quality control, quantified performance metrics as engagement incentives, and certification and reputation. Ideally, any new systems will demonstrate that they out-perform current models while avoiding as many of the biases of existing systems as possible. We conclude that there is considerable scope for new peer review initiatives to be developed, each with their own potential issues and advantages. We also propose a novel hybrid platform model that, at least partially, resolves many of the technical and social issues associated with peer review, and can potentially disrupt the entire scholarly communication system. Success for any such development relies on reaching a critical threshold of research community engagement with both the process and the platform, and therefore cannot be achieved without a significant change of incentives in research environments.


We welcome comments, reviews, etc.

Comments are off for this post

The degree, being a professional, some options... stop whining

(by Christina Pikas) May 07 2017

Notable library school lecturer and bird, Gavia Libraria has had it with this whole whiny business of fussing that the library school people attended 5, 10, 15, 20, 30... years ago didn't offer x or y or z.

If you know anything about our field, or any other profession, you know that the degree is very much a beginning. It gets you the foundation, the jargon, and a method for acquiring more. It is not, nor could it ever be the end. That's just crazy.

Now, there are no doubt library school classes that are better than others.  Also ones that are more relevant than others for the thing you need to do today. I am one of the few who actually believes in "core" classes because there are some standard things you should be able to at least look up how to do if you all the sudden become a solo librarian. Like a reference interview. Searching a library catalog, a research database, and the web. Cataloging a book. Unjamming a stapler, a copier, and a toilet 🙂

The other piece of her argument is that libraries often think grass is greener and want to hire someone in with the needed skills instead of developing staff to meet new needs. This is pretty crap if you're the existing staff who is running as fast as she can to move into a new position.

So you're now out of school, what next? Are you in your professional associations? Are you engaging online? Are you attending training given by the vendors - free online! Are you reading? Are you thinking critically?

I, for one, went to an awesome library school that gave me an excellent foundation. I also work very hard to learn all the time.

Here's an example: I am forever trying to learn Python and R and bibliometrics. I can do some things but I'm very slow. My boss didn't tell me to learn. I'm not mad at my library school for not teaching me. I don't even know if either one really existed when I was undergrad (we used Mathematica and Pascal but I'm old).

So here I diverge greatly from some of my librarian colleagues who really hate code academies and MOOCs. I think they are awesome for people like me. I've done a couple Google ones. Some R ones and some Python ones. I abandoned two on machine learning.  The R ones from JHSPH were fine but the one that was really good was the intro to programming with Python from MIT on EdX. It was by far the best platform and the best instruction.  I'm messing with a Michigan one on data science now and it is not as good at all.

Anyway, pull up your big girl pants and get to work learning. While you're at it, work with your leadership on where you see things going and how to best meet the need. If you're in some level of management, work with your librarians to make sure they are moving ahead and advocate for them to upper management.

Remember, I'm pulling for you. We're all in this together. Keep your stick on the ice. Good luck with the staplers.

One response so far

I will be going to the March for Science (DC)

(by Christina Pikas) Apr 21 2017

I, like many others, have been more than a bit put out by the DC organizers. When I first heard of the march I was all for it. Then I heard an organizer on Science Friday. OMG. Really? "if they just knew science was cool they would fund it!" Um. No.

And the absolute mess they made of trying to be inclusive and support diversity - although they have come a long way.

Anyway, I will be going and I will be bringing my children. So, fwiw, they will have the opportunity. Wish us luck!

One response so far

Animating a graph

(by Christina Pikas) Apr 21 2017

What I really want is an interactive network graph that my viewer can click on, slide things, flip switches, etc. I'll get there someday. In the mean time, this was fairly easy to show the evolution of a network over time.

The network is a company network generated in Quid*. I used the time filter and exported a png every 3-5 years for this network. The point being that there has been a lot of growth and in particular clusters of companies.

Here it is:

Probably need to click to enlarge to see the animation

ImageMagick is an awesome open source image tool. There have been other ways to get to it from R or python, but the magick package is the easiest I've seen. The vignettes and a blog post from rOpenSci were helpful.


library("magick", lib.loc="~/R/win-library/3.3")
newcomp_files<-list.files("folder", pattern="*.png", full.names=TRUE)
newcomp<-lapply(newcomp_files, image_read)
#need image_join because the above produces a list instead of a magick vector 
#dispose deletes previous image
newcomp_animate<-image_animate(image_join(newcomp), fps = 0.5, dispose = "previous")
image_write(newcomp_animate, 'newcomp_animate_82-.gif')

Note: the final version you see also used magick to crop (not great) and to annotate with the year. I slowed the frame rate down quite a bit.

I tried morphing and it was pretty slow and really not worth it so much.

*not affiliated and not endorsing

Comments are off for this post

Who are my researchers citing? A quick and dirty way to find out

(by Christina Pikas) Apr 21 2017

This probably shouldn't warrant a post, but a few very experienced and up-to-date librarians didn't immediately know so probably worth sharing. Also, it turns out that Scopus* has hidden or removed a lot of the useful stuff and made it confusing to even know where to enter the search.**

In Scopus, search for your researchers. In my case, an affiliation search works.***


The affiliation search is a tab from the home page. There's no way to know it's a tab. It's just an underlined word. You actually then click around on the page until you find some place to type your query.

From the results list - and mind you I do this exact search regularly so yours might not be so precise - go ahead and click on all the documents. If you're at NIH then crap, because you have more then 10,000 journal articles per year so you have to do some major slicing and dicing. I just limited to 2016 and journal articles just because.

Then you look  for the "..." but then you realize it's grayed out and you can't actually click on it.

So then you click to highlight all, and then you click on "..." and you see view references.

From here, you can list the top sources and, theoretically, analyze them. They're not completely clean though. My set had JGR as well as the spelled out and the specific ones. Likewise with ApJ. So how quick and how dirty is ok? For collections development, you're probably cool with reading off. Otherwise you could export and then use OpenRefine or similar to clean.

* Not affiliated, not endorsing!

** plus - this thing in advanced search in which it is forever putting in codes I do not want ... anyway...

***hey, all the branding for my larger institution is gone? aw come on.

Comments are off for this post

« Newer posts Older posts »