Scientopia Fundraising

(by Christina Pikas) Oct 26 2018

Quick off-topic comment - if you'd like to support Scientopia, you can find a button on my right hand sidebar to do so.

No responses yet

Filling in missing data: books

(by Christina Pikas) May 17 2018

Photo by Janko Ferlič on Unsplash

My normal area to do information analysis and bibliometrics is technology - not even really science, but technology. Current project I'm obsessing on (it's actually really cool and interesting and fun) spans political science, history, philosophy, and even some sociology and criminal justice. So I played all my reindeer games on the journal articles, but when it comes to analyzing the abstracts, a lot were missing. I wanted to ignore these, but they were whole collections from extremely relevant journals and also nearly all of the books. The journals I filled in what i could from Microsoft Academic (no endorsement intended).

Books though.

I hoped I could use WorldCat - my library doesn't have a cataloging subscription with them so I don't think so? I hoped LoC - I could download ALL records before like 2014 which would help a lot, but wow, I can't really deal with that much data for right now. OpenLibrary- no.

Amazon - they've changed the rules a lot and it wasn't clear to me that - if I took all the time to figure it out, if I wouldn't be actually breaking the current terms of service (they have changed).

I asked the wonderful group on the Code4Lib list - one of them (thanks Mark!) pointed to Harvard's api and records. Woo-hoo! I hope they log my stats to justify their expense. I hope my messing around doesn't cause them any problems.

I'm not a cataloger although played one for 4 months 15 years ago. And I don't know MODS or Dublin Core (although it did exist when i was in library school). I pulled a couple of records and poked through them to see where what I wanted was located. Originally, I pulled the title from the record, but that proved to be too messy.

The data I needed to fill in:

  • gathered from reference managers of a team of analysts over a number of years (as many as 9)
  • searched in general (WoS, Scopus) and specific (World Political Science Abs, PAIS, etc) databases
  • gathered in EndNote, de-duped, exported in tab del

in R

  • limited to books and book sections
  • for titles, removed punctuation, made lower case, removed numbers
  • for authors, split to take first name provided of the first author (mostly the last name)
  • for ISBN, removed ISSNs, kept first one, took out - (yes to be more thorough, I should have or'd them)


getharvabs <- function(citeNo){
  if (!$holdisbn2[citeNo])){
    holding<-GET(harvardurl, query=query)
    if (holding$status_code==200){
      bookContent <- rawToChar(holding$content)
      if (booktxt$pagination$numFound==0){
        } else {
                              AB=booktxt$items$mods$abstract$`#text`) }
    } else {
      holding<-GET(harvardurl, query=query)
      if (holding$status_code==200){
        bookContent <- rawToChar(holding$content)
        if (booktxt$pagination$numFound==0){
          citerow<-data.frame(id=citeNo, TI=missingbooks1$X4[citeNo], AB=NA)
          } else {
  print("this is citerow")

holdbookabs<-data.frame(id=as.integer(0),TI=as.character("test"), AB=as.character("test"))

for (i in 1:length(missingbooks1$holdti)){

This choked a couple of times. Maybe I don't always have the right book - but then again, if it's on the same subject. Meh.
I also considered getting the LCSH, but decided against. Too much time spent on this already.

No responses yet

Getting abstracts back from Microsoft Academic's Inverted Index

(by Christina Pikas) May 14 2018

Some time ago, I posted about using Microsoft Academic to fill in missing data from other searches. Jamie and I were going to do a package to wrap the API, but bureaucracy more or less killed our enthusiasm (well, not his, that would be impossible).

Here I am obsessing over a really, really cool bibliometrics project, and have lots of citations missing abstracts. I'm sort of thinking I won't be able to do much with the books even though catalogs seem to have descriptions for a lot of them (happy to take suggestions). I've already looked at using other sources, so I'm back at Academic.

Pulled out my script. Found I lost my keys, retrieved new keys and found there's a new endpoint url, updated that and hit go. ....

Ladies and gentlemen, they moved the abstracts... no more paragraph, you now get an "inverted index." People who studied information retrieval may know what that is, but in this case it's a list of terms with each having a numeric vector of locations the term appears. Stop words are included so "the" might have 20 locations and "sassafras" has 1.

Here it is. Jamie helped with the strategy and the rest comes from lots of searches:





options(stringsAsFactors = FALSE)

#keys- put yours in:

#current 05142018 url


#sample search left in
searchexpr<-"Ti='identity as a variable'"

#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#could look for errors first

#get the json out into usable format
#comes out raw so need to make into text

testciteContent <- rawToChar(testcite$content)
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')


#use the following to get an abstract from the inverted index


testII %>%, .)




ab<-collapse(testII.m$L1, sep=" ")


2 responses so far

Proliferation of free databases of literature

(by Christina Pikas) Apr 27 2018

Google Scholar by now is the big player in free places to search across the literature. There are some oldies that are more specific in coverage: CiteSeer, DBLP, etc. And of course big government specialty: PubMed, Eric, TRID...

But now there are lots of other choices, which is mostly a good thing:

Microsoft Academic is back being actively developed. There's an API which allows some free usage and you can download the whole thing.

Base  - this has been around for a bit and people do really like it

Lens - also looks at patents

1findr - has a free and a paid version. They have a commercial product to help you locate open access when available

dimensions  - also free and paid versions


maybe Share is also in this category? and ScienceOpen ?


As I say a lot, the problem isn't enough, it's probably one more of precision and efficiency. Also access to quality clean data in quantity.


Follow Aaron Tay specifically, and in general - he's keeping an eye on the space.


Editing to add: Ryan Regier (also keeping an eye on the space and worth a follow) adds:

Science Open

Semantic Scholar



Maybe also could add Mendeley - although it's built differently.

No responses yet

The MacRoberts cite this blog (sorta)

(by Christina Pikas) Jan 22 2018

So a quick check of Twitter led me to this very true tweet of Jason's:


I've read a bunch by the authors (and cited it - particularly the part that deals with informal scholarly communication), but I hadn't read the current one:

MacRoberts, M.H. and MacRoberts, B.R. (in press). The mismeasure of science: Citation analysis. Journal of the Association for Information Science and Technology. doi: 10.1002/asi.23970

It deserves in depth treatment, but check this out:

screen shot showing them referencing this blogeta: It's a cite to a comment by Croft on current url: (old does redirect). (so not to me reviewing their article, but to a commenter, but still)

No responses yet

An ephemeral platform, used for other than ephemeral, and the death of Storify

(by Christina Pikas) Dec 15 2017

As I say in my dissertation and elsewhere, informal scholarly communication in social media is both ephemeral and archival. Maybe this is new because some online traces intended to be for a limited number of recipients for immediate use have longer life and wide reach. Some utterances in social media live on well after the originator intended (for good and bad). But maybe it's not entirely new as certainly letters among scientists have been preserved (some of these were no doubt sent specifically for preservation purposes).

I've long been a fan of blogs for personal knowledge management, that is, thinking through readings, partial results, tutorials for how to do things. Blogs are easily searched, archived, migrated, shared, and don't enforce an artificial or proprietary structure found in other tools. However, I also know that long-term bloggers who have established a readership through careful, well-edited posts impose new barriers on themselves for using their blogs for this purpose. I found in my studies that some superstar bloggers almost entirely stopped blogging because they didn't want to post anything incomplete or partial and there were too many other things to do.

I think this has been one of the motivating factors for the use of Twitter for long threads of stories and analysis. Twitter has great reach and immediacy, and interactivity... but at the expense of search (although it is certainly better than it was) and preservation. Who of us hasn't dug through our likes and RT to try to find something interesting we saw ages ago?

We're using a platform specifically built for ephemeral communication for communication that should be saved and preserved.

So individuals who value this knowledge management function, or who appreciate careful analysis or good storytelling serialized over 10s of tweets have adopted Storify to gather and order and preserve and contextualize the pieces. Storify added tools to make it a bit easier. Instead of Storify, you could embed individual tweets (this embedding function also calls back to Twitter so really doesn't preserve). You could <eek> screenshot. you could even just write it up and quote the text.

And Storify is going away this Spring. We do have notice, luckily, but we still have a problem. We need to back our stuff up - we need to back other people's stuff up. Not everything is of the same value to the originator as it is to someone else.

My plea - and it will go unheard - is to put things back into blogs which you then tweet. Or back your useful tweets up to a blog?

FWIW, I'm trying to capture AGU meeting tweets and I'll load them into FigShare ... but the odds of some researcher capturing and saving your stuff is actually quite slim.

This post was inspired by a tweet that has a thread and interesting points by her interlocutors :


3 responses so far

More evidence for the discovery layer as pile of crap metaphor

(by Christina Pikas) Dec 04 2017

this Cambridge University Report (pdf) via Aaron Tay

page 16:

The key insight was the expectation from users that the simple iDiscover search function would automatically return a list of results as sophisticated and relevant as they would expect from other, more powerful search platforms. This led to frustration when, for example, a search for a journal title returned a number of articles and other results before the link to the journal holdings and links to online access. At this point, when asked what they would do next, many of our participants answered by saying that they would start using another search tool.


Some of the problems were a mismatch with the user's perception of the tool (as a catalog):

page 18

“Book reviews above books just don’t make sense!” (Archaeology PhD student)
“When looking for a book, you’ll end up with a random science article.” (English undergraduate student)
“If you search for a title that only has a few words in it, even if you type it in correctly, other less relevant titles will come up first.” (Education MEd student).”


page 22

When asked what was most important to them in terms of platforms used to search for information resources, the words ‘relevance’ and ‘relevant’ were used by a large number of our participants. This was directly linked to a desire for seamless, efficient searches which yielded appropriate and useful results, without the need to use pre- or post-search options to limit or refine them. People were often frustrated at the lack of percieved [sic] relevancy in the initial results list, after having used the main iDiscover search function

[lol, we had a vendor here to help us get our enterprise search going many moons ago... they said "relevance is dead!" I was like "nope!"]

One response so far

No, vendor, we don't want a pile of crap actually

(by Christina Pikas) Dec 02 2017

Large Copper Dung Beetle (Kheper nigroaeneus) on top of its dung ball

Yes, I have posted about this a number of times, and no this will probably not be too different.   Our vendors have swept up the little competition and then redone their boutique databases to make them - generally - work like piles of crap.

So there are two massive 3rd party aggregators that sell massive piles of crap. Don't get me wrong, these are super attractive to libraries who can then say: look at all these titles we cover! Look at how much content we have! The problem is that with our current state of information abundance, with lots of big package deals, with more and more open access, and with informal scholarly sharing < cough >, getting the full text of recent articles from big name journals really isn't a thing.

The thing is efficient, precise, thorough, appropriate information at the right time and place. I say: I need exactly information on this thing! The aggregators go: here's a massive pile of crap!  I'm like, well I don't need a pile of crap, I need exactly this thing. System returns: here's another pile of crap!

Look at the Aerospace database, for example. Used to be the only real database that covered hypersonics and was thorough at all at covering AIAA and NASA technical reports. It was CSA when I got to know it. Compendex, in comparison, is just adding AIAA stuff this year and isn't going back to the 60s. CSA databases got sold to ProQuest. I have no idea what the hell they've done with it because every time I do a search I end up with trade pubs and press releases - even when I go through the facets to try to get rid of them.

CSA used to have a computer science database, too. The current computer collection in ProQuest doesn't even allow affiliation searching. Also, a search I did there yesterday - for a fairly large topic - didn't return *any* conference papers. For CS. Really.

This is not to pick on PQ, ok maybe it is, but their competitors really aren't any better.


At the same time, we keep having people tell us at my larger organization, that we *must* get/have a discovery layer. Let me just tell you again, that we did a lot of testing, and they did not provide us *any* value over the no additional cost search of a 3rd party aggregator. They are super expensive, and really just give you - guess what - all your stuff in a huge pile of crap. I hear nothing but complaints from my colleagues who have to deal with these. The supposition was that we wanted a Google interface. Ok, maybe a sensible quick search is fine, but that only works when you, like Google, have extremely sophisticated information retrieval engines under the hood. Saying - hey we cover the same journals as your fancy well-indexed database but without the pesky indexing and also lumped together with things like newspapers, press releases, and trade pubs... is not really effective. It's a pile of crap.

You may say, "But think of the children!" The poor freshman dears who can't search to save their lives and who just need 3-5 random articles after they've already written their paper just to fill in their bibliography due in the morning....

Is that really who and what we're supporting? Should we rather train them in scholarly research and how to get the best information? And anyway, for my larger institution, we hardly have any freshmen at all.

No, vendors, we do not want a large pile of crap, but thanks for offering!

2 responses so far

Welcoming Confessions of a Science Librarian to Scientopia!

(by Christina Pikas) Nov 28 2017

I'm pleased to point to John's new home here: 

His first post rounding up best science books is live already. We'll get him linked from the home page - but check it out!

2 responses so far

Providing real, useful intellectual access to reference materials from current library pages

(by Christina Pikas) Nov 13 2017

Those of use who study or teach about scientific information have this model of how it goes:

(this image is lifted from my dissertation, fwiw, and it more or less reproduces Garvey & Griffith, 1967; Garvey & Griffith, 1972)

Conference papers (in many fields) are supposed to be more cutting edge - really understandable to people in the field with a deep understanding but who need that icing on the cake of what's new. Journal articles are for more or less after substantive parts of the work are complete and take a while for review and publication (letters journals are supposed to be much faster), and then monographs and textbooks are more for when the information is more stable. More recently, there's a category of shorter books that are sort of like extended reviews but are faster than monographs. Morgan & Claypool, Foundations and Trends, and the new series coming from Cambridge University Press (no endorsement here) are examples. (Note the model omits things like protocols, videos, and datasets).

Reference books are even slower moving. They are used to look up fairly stable information. Here are some examples:

  • encyclopedias (and not just Worldbook, but Kirk-Othmer, Ullman's, and technical encylopedias)
  • dictionaries
  • handbooks (not just for engineers!)
  • directories
  • gazetteers (well, maybe less so for the sciences), maps
  • guidebooks (like in geology, biology)
  • sometimes things like catalogs...

You may think, hey, all I really need are the journal articles and Google and maybe Wikipedia. Or at least publishers and librarians think you're thinking that. And reference books are sort of disappearing. It doesn't make any sense to devote precious real estate to the print versions and the online versions are super expensive and also often not used.

The thing is that these tools are really still needed and they have condensed very useful information down into small(er) packages. If you're concerned about efficiency and authority then starting with a reference book is probably a good idea if you want an overview or to look up a detail.

The publishers don't want to lose our money so they're taking a few different approaches. Some are making large topical digital libraries that combine journal articles, book chapters, and reference materials. This can be really good - you can look up information on a topic when you're reading a journal article or look up a definition, etc. You can start with an overview from an encyclopedia and then dive deeper to learn what's new. The problem from a librarian and user point of view is that the best information may come from multiple different publishers and you just won't get that. You won't get a recommendation for someone else's product.

Another thing publishers are doing is to make reference materials more dynamic. First, they can charge you more and and more frequently. Second, even if the updates are quite small, it makes the resource more attractive to potential users to have a recent date updated. One publisher in particular has commissioned sort of a portal approach that gathers materials from various places and has commissioned new overviews.

There's a tool to sort of search across more traditional reference materials, but... meh.

Of course if you have a well-developed model of what type of reference tool will have your needed information, then you can use the catalog (subjects like engineering - handbooks, engineering - encyclopedias). Back in the day, I wrote about how senior engineers gathered and created their own handbooks from pieces they'd found useful over time.

So here's where librarians come in. I've never taught the basic undergrad science welcome-to-the-library class (I attended one <cough> years ago), so I really don't know if they go over these distinctions or not. So that leaves our guides to try to get people to the best source of information. Guides that are merely laundry lists of tools by format/type are frowned upon because they are generally not useful. That's what we used to do though: here's a list of dictionaries, here's a list of encyclopedias... etc. What we try to do more now is make them problem based. Somewhat easier in like business: need to understand an industry? need to look up a company? Also maybe in materials science and or chemistry (although SciFinder and Reaxys' way of doing properties may be supplanting).

Ok, so beyond the difficulty of expressing the value of each of these tools and in which situations they are useful, we have the affordances of our websites and the tools that produce them. Most are database driven now, which makes sense because you don't want to have to go a million places to update a url. Except... one reference might be useful for one purpose in one guide, and another in another, and then how do you get that to display? How do you balance chatty to educate when needed verses quick links for when not?

Also, do you list a digital library collection of handbooks or, more commonly, monographs mixed with handbooks, as a database? As what?

The reviews and overviews and encyclopedias... do you call them out separately? By series?

Users sometimes happen upon reference books from web searches - but that's mostly things like encyclopedias. If they need an equation or a property... well, if they're an engineer they probably know exactly what handbook... so then, I guess, if they don't have their own copy, they would use the catalog and get to the current edition which we may have online. Getting a phase diagram or other property of a material - I'm guessing users would probably start online but for some materials we have entire references (like titanium, aluminum... and then things like hydrazine).

I'm thinking we could have on an engineering guide, a feed from the catalog with engineering - handbooks? Likewise a feed physics-handbooks?  What about things like encyclopedia of optics. Call out "major reference works" and then catalog feed of [subject] - handbooks|encyclopedias|etc....

OR.. hey... what about the shelf display model:

But, instead of all books, just the books for that guide that match [guide name] -- encyclopedia|dictionary|handbook, etc.

What other methods can we use?

Comments are off for this post

Older posts »