Archive for the 'Uncategorized' category

Metaknowledge (python) and Bibliometrix (R) - more or less comprehensive bibliometrics packages for standard data science environments

Aug 28 2017 Published by under bibliometrics, Uncategorized

I thought for sure I had mentioned Metaknowledge here before but I can't find it so I must have mis-remembered. ...

There are tons of tools for bibliometrics and a lot of people really just code their own for simplicity sake even if they eventually visualize their results using an off the shelf network analysis tool or other. Sci2,VOSviewer, and CiteSpace are all close to comprehensive, freely available, and pretty easy to use tools. What need is there for another product? If you want to use the rest of your workflow or experiment with new algorithms that are not available in the above, then these two packages are good options.

When I was doing the longitudinal clustering for citation trajectories, I inadvertently saved the 6,666 (I know!) records from 1980-2015 in the regular WoS* format instead of a more useful tab delimited.  I quite easily pulled out the pub year, accession number, times cited, and other simple fields using R. ... it's just now when I actually want to follow up with some natural language processing on the titles and abstracts that I realize my kludge won't actually work for either the title or abstract. So I fooled with it a couple of different ways before heading out to see if there was anything new out there for processing these since they were such a hassle to get in the first place. It turns out there is a new, fairly comprehensive R package: Bibliometrix. I had already experimented with Metaknowledge in Python. The extensive instructions (paywall) are very helpful, but I really just wanted to stay in R.

What follows is a general intro to these two tools and my observations.

Bibliometrix

http://www.bibliometrix.org/

This package appears to be quite new with recent releases. The first thing I tried - reading in a directory full of WoS export files was like magic. In a snap, I had a dataframe with everything in the right column.

Literally:

filenames <- list.files("directory", full.names=TRUE)

getWoSdf<-function(filename){
 holdrecs<-readLines(filename)
 recsdf<-isi2df(holdrecs)
 return(recsdf)
}

WoSall<- ldply(filenames, getWoSdf)

Seems like BibTeX files are preferred over this format, but it was plenty quick for the 500 records per file I had. A nice feature is that it tells you every hundred records that it's making progress.

A nice thing is that there are pre-built basic summary/descriptive functions. It exports the standard networks but it also does co-word with a pretty neat visualization.

Multiple Correspondence Analysis (MCA) using keywords

This tool doesn't do anything to help you clean the data or pick out any weirdnesses. The visualizations shown aren't super pretty, but it's quite easy to use another R graphing tool with the data.

MetaKnowledge

http://networkslab.org/metaknowledge/

I worked through the journal article but using my own WoS data. For WoS data, everything worked as expected and I was able to quickly get really nice results.  You can also download a Jupyter notebook with their sample data to work through the process. A neat thing you don't see every day is that it will break down by male/female by guessing using a popular algorithm.  It also does Reference Publication Year Spectroscopy (meh) and besides extracting all of the standard networks you might want, it also has ways to extract text for text mining.

Some negatives based on my brief experiments. I couldn't get Scopus data to work for whatever reason. Also, it doesn't really facilitate any sort of cleaning. Data sets that come out of WoS still have issues. The graph examples were not very pretty and some sort of error came out of the way they had you do one graphing tool. It's quite easy to export data or just substitute your favorite graph because there are a million.

 

* no endorsement intended, for real.

 

2 responses so far

Who do I want to rescue me?

Mar 09 2016 Published by under Uncategorized

DM has continued a meme - who do you want to rescue you?

These are not ranked, necessarily:

  1. Dr. Who
  2. Paw Patrol
  3. Mark Watney
  4. the guys in the Scott Lynch books (all both men and women, not just Locke)
  5. Kvothe

 

Comments are off for this post

What's old is new again

Feb 10 2016 Published by under Uncategorized

Everybody's back starting up an online community for their publishing platform. IEEE with Collabratec. ACS with ChemWorx. Science has one, too.

Seems like everyone did this 15 years ago. The only difference now seems to be the addition of authoring tools. We'll see.

(I posted about ChemWorx before)

Comments are off for this post

How to get unbound by non-forward thinking users...

Jan 16 2016 Published by under information policy, Uncategorized

Last post I described a system that was stuck by its own commitment to user-driven development. They're really stuck. So what are possible ways out? Particularly for a government system?

I really don't know and particularly for a government system but that doesn't mean I can't think about it.

One thought was that maybe they need to make their case more clearly. How could they describe the projects better to make them more attractive in the rankings? This is probably impossible and maybe even insulting as they probably tried very hard to get their point across in the past. They seemed frustrated. Of course, they could hire a consultant to tell them exactly what they already knew - some people will listen to consultants.

I was wondering if acquisition rules would allow them to set aside like 20% or something to do their projects - ones that they thought were best but not necessarily voted on by the users. This would work for things that were less expensive to do or could be piloted.

Part of the problem is that the system may need to be re-architected and might need major redesign. Some of the pieces can be kept, but need to be integrated. That would have to wait for the next major version. Maybe if their key software underneath has to be upgraded, they could use that as a reason to do some things?

Sigh. I don't know. It sure is easier just to dream of a cool system.

Comments are off for this post

2015 in Review

Jan 04 2016 Published by under Admin, Uncategorized

Well, 2015 was sort of a meh year for me. Definitely on the blog.

January: Using more of the possible dimensions in a network graph - I was glad I shared this and glad I was able to make it work in the first place.

February: So... um... what if I'm still enjoying it? - about my dissertation.

March: Polar and Ellipsoid Graphs in iGraph in R

April: Which are the bestest? Top articles from a diverse organization - part 1 - never did part 2 AND still need to write this up for publication

May: ACS and Just Accepted Manuscripts

June: Notes from a presentation on library spaces by Keith Webster

July: none

August: Why special librarians should be active on their organization's intranet social media - the title of the post is not really descriptive. This is a research blogging post about the use of social media on a company's intranet.

September: The smart phone and parenting children - two articles

October: "Theory" for the immigrant to social sciences

November: Citation Manager Frustration - I actually had 3 posts on the same date, but this is the most important. I really don't like the way things are going with citation managers. As an update: the folks from RefWorks did contact me and I described a bunch of the issues. I think they'll have other ways to solve the same problems I'm encountering than what I proposed but they definitely seemed interested.

December: Bibliometrics: Getting an accurate (+/-) count of articles published by an organization in a year

I'm shocked that I posted at least something every month but July.

 

Comments are off for this post

ACS and Just Accepted Manuscripts

A colleague posted on Chminf-l asking about the American Chemical Society's Just Accepted Manuscripts program. Most of the immediate responses were to explain the program, which is not what she asked. Here's the site's description:

"Just Accepted" manuscripts are peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society is posting just accepted, unredacted manuscripts as a service to the research community in order to expedite the dissemination of scientific information as soon as possible after acceptance. "Just Accepted" manuscripts appear in full as PDF documents accompanied by an HTML abstract. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). The manuscripts posted on the "Just Accepted" Web site are not the final scientific version of record; the ASAP (As Soon As Publishable) article (which has been technically edited and formatted) represents the final scientific article of record. The "Just Accepted" manuscript is removed from the Web site upon publication of the ASAP article, and the ASAP article has the same DOI as the "Just Accepted" manuscript. The DOI remains constant to ensure that citations to "Just Accepted" manuscripts link to the final scientific article of record when it becomes available.

The FAQ explains that this is opt-in and these copies will be removed when the ASAP and final versions are live.

Chemistry is kind of a funny field when you talk about scholarly communication and sharing (see and read everything from Theresa Velden's dissertation research on this, in particular). Journals are dominated by ACS with RSC and the other scholarly publishers following. In some areas like synthetic chemistry, there's a real reluctance to even share at meetings, no desire to post pre-prints, and tight control over data access. In more computational and analytic areas, it's a little more relaxed.

Pre-print server efforts in chemistry have been mostly unsuccessful. For one thing, the journals will not take articles posted elsewhere first. Second, there's this big tension with priority (now moving to first to file maybe will change patent things but there's still recognition issues).

With all that, there are still efforts to require self-archiving broadly across fields and to have disciplinary pre-print servers. The big publishers who are rolling in dough from the subscriptions from all the ACS accredited programs do not want to see these archives and self-archiving succeed, even though it's been shown that it doesn't harm subscriptions in physics.

Anyway, as I said on the list, this is a pretty smart move by ACS. It solves the problem of getting the science out there sooner, but still with peer review, and on the hosted platform. This version disappears and the doi points you to the official version when available so they keep the traffic in house. I'm sure the embargoes go from official publication, too, so this is more time the publisher has to disseminate the content and get attention before government funders and institutional repositories can share it.

I think it will be accepted by chemists because it is from ACS and it is after peer review. We'll see, though, if there are any typos and whatnot that offend people.

 

Edit to add: Thurston Miller points to a few viewpoint papers in Journal of Physical Chemistry Letters on OA (the papers themselves are not OA).

2 responses so far

OAuth in TwitteR much easier now, whew!

Mar 18 2015 Published by under Uncategorized

Not like I should be messing with this at this point, but I wanted to retrieve a tweet to provide evidence for a point. Anyway, instead of the like 50 step process in the past, you now follow the instructions in the TwitterR readme: http://cran.r-project.org/web/packages/twitteR/README.html with the exception of you now put your access token and secret in the *single command* now, too, like so:

setup_twitter_oauth(consumer_key, consumer_secret, access_token=NULL, access_secret=NULL)

Then you can just search or whatever. Wow!

Very nice. How much time did I spend playing with the old method?

Comments are off for this post

Polar and Ellipsoid Graphs in iGraph in R

Mar 12 2015 Published by under Uncategorized

I'm still working to do some additional graphs for the project mentioned in this earlier post. It was too crowded with the Fruchterman Reingold layout, so my customer suggested we do a circular layout with one category in the center and the remaining on the outer ring. I said sure! But when I went to do it, I found only star layout (one in the center) and ring layout. No polar layout. I tried a few things but finally broke down and asked. Quick perfect answer on StackOverflow (as often happens).

That led to this:

Polar Layout

But hey, still pretty jammed up. So what about an ellipse? Sure!

What's that equation again?

 \frac {x^2}{a^2} + \frac {y^2}{b^2} =1

 

But that's a hard way to do it when I need x and y values in a matrix. This looks better:

x = a \cos(\theta) , y=b \sin(\theta)

And this is how I did it.

ellip.layout <- function(a,b, theta) {
cbind(a*cos(theta), -b*sin(theta))
}

systems <- which(V(g)$category == "System")
comp <- which(V(g)$category != "System")

a<- ifelse(V(g)$category == "System",4,5)
b<- ifelse(V(g)$category == "System",0.5,1)

theta <- rep.int(0, vcount(g)) #creates a blank vector
theta[systems] <- (1:length(systems)-1) * 2 * pi / length(systems)
theta[comp] <- (1:length(comp)-1) * 2 * pi / length(comp)

layout<- ellip.layout(a,b,theta)

plot.igraph(g, layout=layout, asp=0)

Originally I was getting the outer ring to be a circle anyway, but then I asked the mailing list and it was a matter of setting asp (aspect ratio) to 0.

Here's where I ended up:

EllipseETA: If you do labels, there's a neat trick to make them always outside the circle. See here: https://gist.github.com/kjhealy/834774

Comments are off for this post

Ebook Explosion

Dec 17 2014 Published by under Information Science, libraries, Uncategorized

Seems like all the publishers and all the societies are trying to get into the eBook game. The newest announcement is from AAS (using IOP as a publisher). Considering the fact that a lot of these domains are not particularly known for monographs - like Computer Science and ACM's new ebook line - but instead for conference proceedings and journal articles, seems kinda weird.

Someone mentioned that maybe it was due to the ebook aggregator demand driven acquisition plans - but I think it's just the opposite. Many major publishers have jacked up prices (pdf) on EBL and Ebrary recently - all to push libraries in to licensing "big deal" bundles of the entire front list or entire subject categories. And it is super attractive to buy from the publishers because they're often without DRM, PDFs (one big publisher even offers a whole book in a single pdf, most are one pdf per chapter), ways to view online, easily findable using Google and also nice MARC records for adding to the catalog.

The ebook aggregators have nasty DRM. They have concurrent user rules. They have special rules for things that are considered textbooks.  We have to login with our enterprise login (which isn't my lab's day-to-day login) and the data about what books we view is tied to our identities. The new prices end up being as much as 30-40% of the cover price for a 1 day loan. That's right. The customer can look and maybe print a couple of pages for 24 hours and the library is charged a third the cover price of the book.

But for the society and publisher own pages, what seems like a one time purchase has now become yet another subscription. If you buy the 2014 front list will you not feel the pressure to buy the 2015 and 2016 publications?

Aggregators had seemed like some of the answer, but not so much with these prices. We've already mugged all the other budgets for our journal habit so where do these new things come from? The print budget was gone ages ago. Reference budget was also raided.  The ones we've licensed do get used a lot at MPOW.

Comments are off for this post

What I want/need in a programming class

Aug 08 2014 Published by under Off Topic, Uncategorized

Abigail Goben (Hedgehog Librarian) has a recent blog post discussing some of the shortcomings she's identified in the various coding courses she's taken online and the self-study she has done.

I think my view overlaps hers but is not the same. Instead of try to compare and contrast, I'll say what I've seen and what I need.

I'm probably pretty typical of my age: I had BASIC programming in elementary and high school. This was literally BASIC and was like

10 print "hello"
20 goto 10

I think we did something with graphics in high school, but it was more BASIC.  In college, they felt very strongly that physics majors should learn code, so I took the Pascal for non-CS majors in my freshman year.  That was almost like the BASIC programming: no functions, no objects... kinda do this, do this, do this... turn it in. I never did see any connection whatsoever with my coursework in physics. I never understood why I would use that instead of the Mathematica we had to use in diffeq

In the workforce, I did some self study javascript (before it was cool), html, CSS - not programming, obviously. And then I needed to get data for an independent study I was doing and my mentor for that study wrote a little Perl script to get web pages and pull out links. The script she wrote broke with any modifications to the website template, so after waiting for her to fix for me, I ended up fixing it myself... which I should have done to start with. ... In the second stats class another student and I asked if we could use R instead of Stata. He was going back to a country with less research funding and I was going to work independently. But then, we just used the regression functions already written out and followed from a book. Elsewhere in the workforce I've read a lot about R and some co-workers and I worked through a book... I did the CodeAcademy class on Python.

All of these classes - if they weren't in interactive mode, they could have been. What are the various data types. How do you get data in there and back out again. How do you do a for loop. Nobody really goes into any depth about lists in R and they pop up all over the place. I couldn't even get Python installed on my computer at first by myself because everyone teaching me was on a Mac. (btw, use active python and active perl if you're on Windows - not affiliated, but they just work).

The R class on Coursera (same one she complains about) and the data science class by JH there were the first that even really made me do functions. What a difference. I really appreciated them for that.

So here's what I think:

People new to programming - truly new - need to understand the basics of how any program works including data types, getting data in and out, for loops. But also architectural things like functions and objects. They probably need to spend some time with pseudocode just getting through the practice.

Then if you're not new to programming, but you're new to a language - different course. In that course you say this is how this language varies, this is what it does well with, here's where it fails.

Then there needs to be an all about software design or engineering or process course that talks about version control and how to use it. How to adequately document your code. How to write programs in a computationally efficient way. The difference between doing things in memory or not.  What are integrated development environments and when would you use one. This is what I need right now.

If it's something basic, I can follow along a recipe I can read off of stack overflow, but I know nothing about efficiency. Like why use sapply vs. a for loop? Is there a better way to load the data in? Why is it slow? Is it slower than I should expect? I love RStudio - love, love, love! But I tried something like that for Python and could never get it to work. I'm still learning git, but I don't really understand the process of it even though I can go through the steps.

Anyhow, more about me, but I think I'm probably pretty typical. I think there's a huge gap in the middle in what's being taught and I also think that a lot of people need the very basics of programming almost minus the specific language.

6 responses so far

Older posts »