Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

Aug 28 2017 Published by under bibliometrics, Uncategorized

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.


This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.


filenames <- list.files("directory", full.names=TRUE)


WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.


I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.


* no endorsement intended, for real.


2 responses so far

  • John McLevey says:

    Hi Christina! Nice to meet you.

    Thanks for trying metaknowledge. It's nice to see it mentioned on your blog. Also, thanks to you, I've just discovered the new bibliometrix R package. It looks great! Can't wait to dig into it.

    Sorry to hear you had trouble with the scopus data. I'm not sure what's going on there. I'll look into it and see if I can figure out what the problem is.

    I agree that the graphs are not especially attractive. This, unfortunately, is a limitation of matplotlib, not of metaknowledge. As a longtime use of ggolot2 (and more recently great libraries like ggnetwork and ggnet2) this is frustrating for me too. The graphing tools are vastly superior in R.

    If you are interested, I've just posted a couple of long-overdue blog posts that show how to produce better looking graphs with metaknowledge. They are using, which introduces other types of limitations. So far there is a post on RecordCollections ( and one on networks ( I'll put more up once the new teaching term is under way.

    The warning you got also comes from matplotlib and networkx, by the way. It's not actually an issue with the metaknowledge code itself.

    And of course you are right that metaknowledge doesn't have built-in tools for cleaning WoS records. This was a difficult decision we made when designing the package. We decided not to get too deep into cleaning methods for the early releases because Python is really good at manipulating strings already. With a bit of Python knowledge, you can do a lot of cleaning outside of metaknowledge. That said, there will likely be methods for cleaning WoS data in future releases.

    Thanks again! Great to see metaknowledge show up on your blog.


    • Christina Pikas says:

      Thanks for a great package! Looking forward to your blog posts and any cleaning you add.