Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

(by Christina Pikas) Aug 28 2017

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.


This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.


filenames <- list.files("directory", full.names=TRUE)


WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.


I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.


* no endorsement intended, for real.


2 responses so far

Data point on if we need expensive specialty research databases

(by Christina Pikas) Aug 17 2017

***NOTE: Not endorsing any commercial products***

A search for "quantum computing" in the Abstract, Title, Keywords in Scopus yields 6,415 records

A search for DE (quantum computing) in Inspec yields 20,403.   By DE I mean descriptor - "quantum computing" has been in Inspec's thesaurus since 1999. Use the correct formatting for your platform.

One really quick insight: US is top publisher in Scopus and China is in Inspec.

One response so far

Notes from 1.5 days of: Collections as Data: Hack-to-Learn

(by Christina Pikas) Aug 10 2017

You guys - this post has been in draft since May 22, 2017! I'm just posting it...

Collections as Data: Hack-to-Learn was a fabulous workshop put on by the Library of Congress, George Washington University Libraries, and George Mason University Libraries. It was a neat gathering of interesting and talented people, nifty data, and very cool tools.  It didn't hurt either that it was in a beautiful conference room with a view of the Capitol the first day and at the renovated Winston Churchill Center at GWU the second. A lot of it was geared toward metadata librarians and digital humanities librarians, but I felt welcomed. Readers of this blog will know that I really want to bring these tools to more public services/liaison/etc., librarians so it was good.

Unfortunately, I had to leave mid-day on day 2 because of a family emergency 🙁 (everybody is ok) but here are some interesting tidbits to save and share.

Data Sets:

LoC MARC Records

Have you heard that LoC freed a ton of their cataloging data? FREE. Should have always been freely available. Actually this is only up to December 2013 and the remainder are still under paid subscription ... but ... still! People are already doing cool things with it (neat example). We had a part of this that the organizers had kindly already done some editing on.

Phyllis Diller Gag File

This was a sort of poorly formatted csv of several drawers of the file. Hard not to just sit and chuckle instead of analyzing

Eleanor Roosevelt's My Day Columns

Apparently Roosevelt wrote these from the 1930s to her death in 1962. Originally she wrote them 5 days a week but tapered to 3 when her health failed. They are a few paragraphs and more or less dryly list her activities.

End of Term Tumblr Archive (no link)

This was archived as part of the efforts to capture the outgoing administration's stuff before it disappeared. It was a very interesting collection of things from museums to astronauts.


Somewhere in here we covered TEI - I had no idea this existed. How cool. So like when you're doing transcripts of interviews you can, for example, keep the erm, uh, coughs... or ignore depending on the level of analysis?  TEI lets you annotate texts with all sorts of detail and make it linked data for entities, etc.


  • OpenRefine - more detailed use and examples of reconciliation
  • Voyant - very, very cool tool to at least do preliminary analysis of text. NB: installing on my work windows machine was a bit rough. I ended up getting a Linux VM and it works well/easily. The visualizations are great. Limitation in number of texts you can import at a time.
  • MALLET - did you think this one was too hard and required java or some such? Turns out there's a command line one anyone can use. We did topic models for some of the sets. I think I will probably stay with the way I've been doing them in R because seems like they're easier to understand.
  • Gephi - yeah, again, and i still can't get along with it. I have to face that it's just me.
  • Carto - a cool mapping tool

Also, day 2 someone suggested spaCy instead of NLTK for natural language processing in Python. This is another thing I couldn't get working for anything on my windows box from work. I don't know if there is something being blocked or what. Installs and works beautifully on the Linux machine, though.



No responses yet

Peer Review Review Paper for Review by Peers

(by Christina Pikas) Aug 10 2017

Jon Tennant led a crowd of us to write a new article that is out on F1000 Research:

Tennant JP, Dugan JM, Graziotin D et al. A multi-disciplinary perspective on emergent and future innovations in peer review [version 1; referees: 1 approved with reservations]F1000Research 2017, 6:1151
(doi: 10.12688/f1000research.12037.1)

Peer review of research articles is a core part of our scholarly communication system. In spite of its importance, the status and purpose of peer review is often contested. What is its role in our modern digital research and communications infrastructure? Does it perform to the high standards with which it is generally regarded? Studies of peer review have shown that it is prone to bias and abuse in numerous dimensions, frequently unreliable, and can fail to detect even fraudulent research. With the advent of Web technologies, we are now witnessing a phase of innovation and experimentation in our approaches to peer review. These developments prompted us to examine emerging models of peer review from a range of disciplines and venues, and to ask how they might address some of the issues with our current systems of peer review. We examine the functionality of a range of social Web platforms, and compare these with the traits underlying a viable peer review system: quality control, quantified performance metrics as engagement incentives, and certification and reputation. Ideally, any new systems will demonstrate that they out-perform current models while avoiding as many of the biases of existing systems as possible. We conclude that there is considerable scope for new peer review initiatives to be developed, each with their own potential issues and advantages. We also propose a novel hybrid platform model that, at least partially, resolves many of the technical and social issues associated with peer review, and can potentially disrupt the entire scholarly communication system. Success for any such development relies on reaching a critical threshold of research community engagement with both the process and the platform, and therefore cannot be achieved without a significant change of incentives in research environments.


We welcome comments, reviews, etc.

No responses yet

The degree, being a professional, some options... stop whining

(by Christina Pikas) May 07 2017

Notable library school lecturer and bird, Gavia Libraria has had it with this whole whiny business of fussing that the library school people attended 5, 10, 15, 20, 30... years ago didn't offer x or y or z.

If you know anything about our field, or any other profession, you know that the degree is very much a beginning. It gets you the foundation, the jargon, and a method for acquiring more. It is not, nor could it ever be the end. That's just crazy.

Now, there are no doubt library school classes that are better than others.  Also ones that are more relevant than others for the thing you need to do today. I am one of the few who actually believes in "core" classes because there are some standard things you should be able to at least look up how to do if you all the sudden become a solo librarian. Like a reference interview. Searching a library catalog, a research database, and the web. Cataloging a book. Unjamming a stapler, a copier, and a toilet 🙂

The other piece of her argument is that libraries often think grass is greener and want to hire someone in with the needed skills instead of developing staff to meet new needs. This is pretty crap if you're the existing staff who is running as fast as she can to move into a new position.

So you're now out of school, what next? Are you in your professional associations? Are you engaging online? Are you attending training given by the vendors - free online! Are you reading? Are you thinking critically?

I, for one, went to an awesome library school that gave me an excellent foundation. I also work very hard to learn all the time.

Here's an example: I am forever trying to learn Python and R and bibliometrics. I can do some things but I'm very slow. My boss didn't tell me to learn. I'm not mad at my library school for not teaching me. I don't even know if either one really existed when I was undergrad (we used Mathematica and Pascal but I'm old).

So here I diverge greatly from some of my librarian colleagues who really hate code academies and MOOCs. I think they are awesome for people like me. I've done a couple Google ones. Some R ones and some Python ones. I abandoned two on machine learning.  The R ones from JHSPH were fine but the one that was really good was the intro to programming with Python from MIT on EdX. It was by far the best platform and the best instruction.  I'm messing with a Michigan one on data science now and it is not as good at all.

Anyway, pull up your big girl pants and get to work learning. While you're at it, work with your leadership on where you see things going and how to best meet the need. If you're in some level of management, work with your librarians to make sure they are moving ahead and advocate for them to upper management.

Remember, I'm pulling for you. We're all in this together. Keep your stick on the ice. Good luck with the staplers.

One response so far

I will be going to the March for Science (DC)

(by Christina Pikas) Apr 21 2017

I, like many others, have been more than a bit put out by the DC organizers. When I first heard of the march I was all for it. Then I heard an organizer on Science Friday. OMG. Really? "if they just knew science was cool they would fund it!" Um. No.

And the absolute mess they made of trying to be inclusive and support diversity - although they have come a long way.

Anyway, I will be going and I will be bringing my children. So, fwiw, they will have the opportunity. Wish us luck!

One response so far

Animating a graph

(by Christina Pikas) Apr 21 2017

What I really want is an interactive network graph that my viewer can click on, slide things, flip switches, etc. I'll get there someday. In the mean time, this was fairly easy to show the evolution of a network over time.

The network is a company network generated in Quid*. I used the time filter and exported a png every 3-5 years for this network. The point being that there has been a lot of growth and in particular clusters of companies.

Here it is:

Probably need to click to enlarge to see the animation

ImageMagick is an awesome open source image tool. There have been other ways to get to it from R or python, but the magick package is the easiest I've seen. The vignettes and a blog post from rOpenSci were helpful.


library("magick", lib.loc="~/R/win-library/3.3")
newcomp_files<-list.files("folder", pattern="*.png", full.names=TRUE)
newcomp<-lapply(newcomp_files, image_read)
#need image_join because the above produces a list instead of a magick vector 
#dispose deletes previous image
newcomp_animate<-image_animate(image_join(newcomp), fps = 0.5, dispose = "previous")
image_write(newcomp_animate, 'newcomp_animate_82-.gif')

Note: the final version you see also used magick to crop (not great) and to annotate with the year. I slowed the frame rate down quite a bit.

I tried morphing and it was pretty slow and really not worth it so much.

*not affiliated and not endorsing

No responses yet

Who are my researchers citing? A quick and dirty way to find out

(by Christina Pikas) Apr 21 2017

This probably shouldn't warrant a post, but a few very experienced and up-to-date librarians didn't immediately know so probably worth sharing. Also, it turns out that Scopus* has hidden or removed a lot of the useful stuff and made it confusing to even know where to enter the search.**

In Scopus, search for your researchers. In my case, an affiliation search works.***


The affiliation search is a tab from the home page. There's no way to know it's a tab. It's just an underlined word. You actually then click around on the page until you find some place to type your query.

From the results list - and mind you I do this exact search regularly so yours might not be so precise - go ahead and click on all the documents. If you're at NIH then crap, because you have more then 10,000 journal articles per year so you have to do some major slicing and dicing. I just limited to 2016 and journal articles just because.

Then you look  for the "..." but then you realize it's grayed out and you can't actually click on it.

So then you click to highlight all, and then you click on "..." and you see view references.

From here, you can list the top sources and, theoretically, analyze them. They're not completely clean though. My set had JGR as well as the spelled out and the specific ones. Likewise with ApJ. So how quick and how dirty is ok? For collections development, you're probably cool with reading off. Otherwise you could export and then use OpenRefine or similar to clean.

* Not affiliated, not endorsing!

** plus - this thing in advanced search in which it is forever putting in codes I do not want ... anyway...

***hey, all the branding for my larger institution is gone? aw come on.

No responses yet

What *are* guides for?

(by Christina Pikas) Apr 17 2017 of all flavors have these things that gather together resources on a topic for their users/patrons/customers. In a huge portion of academic libraries, these are database driven lists of online resources managed by a web content management system separate from the rest of their web page. Other libraries do these on blog software or their regular web content management system.

These things have been around forever, from mimeographed or even typeset bibliographies available as handouts to these web things.  The old thinking was the only people who learned from guides were their creators. But that's not fair. They can be a real lifesaver if you find the perfect one that addresses your need when you actually need it.

Obviously, that's also the major problem: What to include, what to say about the things that are included, and how to get the guide where it's needed, when it's needed? Of course there have been a million studies and there are best practices but you do see a lot of variety in the wild.

Some guides are for classes in response to a particular assignment so are targeted. A lot of guides are for entire fields of study: physics, geosciences (not even breaking atmospheric sciences from geology), etc.  In certain fields there are standard tasks students have to deal with like in business, industry and company research. There are basic steps to be taken and for each step, there is a preferred resource.

How about those general guides? What to include, in what order, with what verbiage? No point in writing too much if it causes the guide not to be read. Yet you want to really make the point of exactly in which situations that resource will help and how it can be used most efficiently.

Or maybe not for that last part - tips and tricks can go on a blog and in classes or demonstrations.  If you get the chance... and if you can get that training opportunity to the people when and where they need it.

On LSW there was some discussion about this recently and a member brought up a screenshot of a famously poor guide that, if printed, would have been 36 feet long at 75 dpi. Another guide had 38 tabs (basically menu across the top in this software everyone uses) and many of the tabs had dropdown menus.

At MPOW we have to have ours on SharePoint and we are not actually allowed to edit or design them... but I really, really think these database driven ones are often not the best information design to get the point across. I mean, there's no way to keep up with the URLs if you hand code something but at the same time, it's really awkward to try to make various pieces of content stand out. It's often difficult to embed training information, tips, and links to these things. In addition, resources are often listed in alphabetical order which may not really make sense depending on what they are.

For my news guide, I went off the rails and have it divided differently: by time, by location, by topic.  But I don't actually know that ours are any use, either.  Theoretically our pages were tested and users were able to "find an e-book" (they weren't asked to find a particular ebook, mind you)

My professor for my business reference class made a point of saying how guides (and training) should be problem oriented... So maybe, we should leave the lists of resources to the A-Z lists (with topic tags?) and guides should be reworked to be problem based? We do try to make our training problem based not just here's-a-laundry-list... but alas....


No responses yet

Nebulous Connections Presentation: Using bibliometrics to keep up with the Joneses

(by Christina Pikas) Apr 04 2017

I presented this today, April 4, 2017 at the SLA Maryland event held at NASA Goddard's Library

No responses yet

Older posts »