What do you do: publisher wants a higher increase because you use their journals more

(by Christina Pikas) Dec 06 2018

Here's the situation. My larger place of work's electronic resources folks are processing renewals when one digital library comes back and says: your normal increase would be 3% but since your usage increased 24% since last year, it's a 6% increase. Ladies and gentlemen: we do not have an extra 3% to give them, nor did we agree to a usage based subscription fee (we think).

First thought was that the pricing model is AFU and what on earth actual publisher wants more money for more usage?

Second thought was whoa - 24% increase in one year is weird for something that is basically the same and a topic of perennial interest? Like not CRISPR-R-Us or all-AI-all-the-time or anything super hot... just a topic that all of the sci/tech divisions of the larger institution are interested in to some extent.  No new programs in the area and no dramatic changes to the content on the platform. So maybe the logs are messed up?

The electronic resources folks got two bits of usage data for us: a file with all the IPs that have downloaded articles 2016-2018 and a file with article identification info and date-time stamp (to the minute only) of when it was downloaded 2016-2018.

First thing to check - are these all our IPs (we've gotten usage logs of other universities in the past from other vendors)? Yes. And interestingly, it divides up 1/3, 1/3, 1/3 between the 3 major sci/tech divisions of my larger institution.

Second thing - are they double counting? We can see by visual examination that there are 2 entries with the same time stamp for a single article. There are also a bunch that are 1 minute off. Question: is the double counting consistent across years or does it happen significantly more often in 2018?

So I imported the log into R, then counted up all the records that had the same article name identifier as the one below them and either the same time stamp or 1 minute off. Well bummer. Turns out the double counting was worse in 2016, the year with the lowest usage. (FWIW, we had a 2% increase in usage from 2016-2017 and the data show an 18% increase not a 24% increase from 2017-2018)

Third thing - is this one of the publishers that's weird about reserves?  I'm not at all fluent in that since I don't work with students, but we can see articles with 230 downloads all clustering around the beginnings of semesters (August-September, January). So... ?

Another thought: we tell people to just send around the link and everyone should download their own copy. Sounds like a bad idea for this publisher!

This isn't a done deal but I'm not sure beyond looking at the license what we can do. We're definitely going to push back. Wish us luck!

2 responses so far

Scientopia Fundraising

(by Christina Pikas) Oct 26 2018

Quick off-topic comment - if you'd like to support Scientopia, you can find a button on my right hand sidebar to do so.

No responses yet

Filling in missing data: books

(by Christina Pikas) May 17 2018

Photo by Janko Ferlič on Unsplash

My normal area to do information analysis and bibliometrics is technology - not even really science, but technology. Current project I'm obsessing on (it's actually really cool and interesting and fun) spans political science, history, philosophy, and even some sociology and criminal justice. So I played all my reindeer games on the journal articles, but when it comes to analyzing the abstracts, a lot were missing. I wanted to ignore these, but they were whole collections from extremely relevant journals and also nearly all of the books. The journals I filled in what i could from Microsoft Academic (no endorsement intended).

Books though.

I hoped I could use WorldCat - my library doesn't have a cataloging subscription with them so I don't think so? I hoped LoC - I could download ALL records before like 2014 which would help a lot, but wow, I can't really deal with that much data for right now. OpenLibrary- no.

Amazon - they've changed the rules a lot and it wasn't clear to me that - if I took all the time to figure it out, if I wouldn't be actually breaking the current terms of service (they have changed).

I asked the wonderful group on the Code4Lib list - one of them (thanks Mark!) pointed to Harvard's api and records. Woo-hoo! I hope they log my stats to justify their expense. I hope my messing around doesn't cause them any problems.

I'm not a cataloger although played one for 4 months 15 years ago. And I don't know MODS or Dublin Core (although it did exist when i was in library school). I pulled a couple of records and poked through them to see where what I wanted was located. Originally, I pulled the title from the record, but that proved to be too messy.

The data I needed to fill in:

  • gathered from reference managers of a team of analysts over a number of years (as many as 9)
  • searched in general (WoS, Scopus) and specific (World Political Science Abs, PAIS, etc) databases
  • gathered in EndNote, de-duped, exported in tab del

in R

  • limited to books and book sections
  • for titles, removed punctuation, made lower case, removed numbers
  • for authors, split to take first name provided of the first author (mostly the last name)
  • for ISBN, removed ISSNs, kept first one, took out - (yes to be more thorough, I should have or'd them)

 

getharvabs <- function(citeNo){
  if (!is.na(missingbooks1$holdisbn2[citeNo])){
    query<-paste0("identifier=",missingbooks1$holdisbn2[citeNo],"&limit=1")
    holding<-GET(harvardurl, query=query)
    print(holding$status_code)
    if (holding$status_code==200){
      bookContent <- rawToChar(holding$content)
      booktxt<-fromJSON(bookContent)
      if (booktxt$pagination$numFound==0){
        citerow<-data.frame(id=citeNo, 
                            TI=missingbooks1$X4[citeNo],
                            AB=NA)
        } else {
          ifelse(is.null(booktxt$items$mods$abstract$`#text`),
                 booktxt$items$mods$abstract$`#text`<-NA,
                 booktxt$items$mods$abstract$`#text`<-as.character(booktxt$items$mods$abstract$`#text`))
          citerow<-data.frame(id=citeNo,
                              TI=missingbooks1$X4[citeNo],
                              AB=booktxt$items$mods$abstract$`#text`) }
    }
    } else {
      query<-paste0("title=",missingbooks1$holdti[citeNo],
                    "&name",missingbooks1$hold1au[citeNo],
                    "&limit=1")
      holding<-GET(harvardurl, query=query)
      print(holding$status_code)
      if (holding$status_code==200){
        bookContent <- rawToChar(holding$content)
        booktxt<-fromJSON(bookContent)
        if (booktxt$pagination$numFound==0){
          citerow<-data.frame(id=citeNo, TI=missingbooks1$X4[citeNo], AB=NA)
          } else {
            ifelse(is.null(booktxt$items$mods$abstract$`#text`),
                   booktxt$items$mods$abstract$`#text`<-NA,
                   booktxt$items$mods$abstract$`#text`<-as.character(booktxt$items$mods$abstract$`#text`))
            citerow<-data.frame(id=i,
                                TI=missingbooks1$X4[citeNo],
                                AB=booktxt$items$mods$abstract$`#text`)
          }
        }
    }
  print("this is citerow")
  print(citeNo)
  print(citerow)
  return(citerow)
  }


holdbookabs<-data.frame(id=as.integer(0),TI=as.character("test"), AB=as.character("test"))


for (i in 1:length(missingbooks1$holdti)){
  temp<-getharvabs(i)
  holdbookabs<-rbind(holdbookabs,temp)
  Sys.sleep(2)
}

This choked a couple of times. Maybe I don't always have the right book - but then again, if it's on the same subject. Meh.
I also considered getting the LCSH, but decided against. Too much time spent on this already.

No responses yet

Getting abstracts back from Microsoft Academic's Inverted Index

(by Christina Pikas) May 14 2018

Some time ago, I posted about using Microsoft Academic to fill in missing data from other searches. Jamie and I were going to do a package to wrap the API, but bureaucracy more or less killed our enthusiasm (well, not his, that would be impossible).

Here I am obsessing over a really, really cool bibliometrics project, and have lots of citations missing abstracts. I'm sort of thinking I won't be able to do much with the books even though catalogs seem to have descriptions for a lot of them (happy to take suggestions). I've already looked at using other sources, so I'm back at Academic.

Pulled out my script. Found I lost my keys, retrieved new keys and found there's a new endpoint url, updated that and hit go. ....

Ladies and gentlemen, they moved the abstracts... no more paragraph, you now get an "inverted index." People who studied information retrieval may know what that is, but in this case it's a list of terms with each having a numeric vector of locations the term appears. Stop words are included so "the" might have 20 locations and "sassafras" has 1.

Here it is. Jamie helped with the strategy and the rest comes from lots of searches:

 

library("httr")

library("tm")
library("jsonlite")
library("reshape2")
library("plyr")
library("glue")

setwd("I:/Christina's/")

options(stringsAsFactors = FALSE)



#keys- put yours in:
msakey1<-""
msakey2<-""

#current 05142018 url
#https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate[?expr][&model][&count][&offset][&orderby][&attributes]

apiurl<-"https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?"

#sample search left in
searchexpr<-"Ti='identity as a variable'"
apiattrib<-"Ti,Y,AA.AuN,AA.AfN,C.CN,J.JN,E"

#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))


#could look for errors first
testcite$status_code

#get the json out into usable format
#comes out raw so need to make into text

testciteContent <- rawToChar(testcite$content)
  
test<-fromJSON(testciteContent)
test$entities$AA
test$entities$AA[[1]]$AuN
#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')

test$entities$AA[[1]]$AfN
test$entities$J$JN
test$entities$Y
test$entities$Ti

###################
#use the following to get an abstract from the inverted index

holdE<-fromJSON(test$entities$E)

testII<-holdE$IA$InvertedIndex
testII %>% do.call(rbind, .)

testII.m<-melt(testII)

testII.m<-unique(testII.m)

arrange(testII.m,value)

ab<-collapse(testII.m$L1, sep=" ")

####################

2 responses so far

Proliferation of free databases of literature

(by Christina Pikas) Apr 27 2018

Google Scholar by now is the big player in free places to search across the literature. There are some oldies that are more specific in coverage: CiteSeer, DBLP, etc. And of course big government specialty: PubMed, Eric, TRID...

But now there are lots of other choices, which is mostly a good thing:

Microsoft Academic is back being actively developed. There's an API which allows some free usage and you can download the whole thing.

Base https://www.base-search.net/about/en/  - this has been around for a bit and people do really like it

Lens https://www.lens.org - also looks at patents

1findr www.1findr.com - has a free and a paid version. They have a commercial product to help you locate open access when available

dimensions  https://app.dimensions.ai  - also free and paid versions

 

maybe Share https://share.osf.io/ is also in this category? and ScienceOpen https://www.scienceopen.com/ ?

 

As I say a lot, the problem isn't enough, it's probably one more of precision and efficiency. Also access to quality clean data in quantity.

 

Follow Aaron Tay specifically, and in general - he's keeping an eye on the space.

 

Editing to add: Ryan Regier (also keeping an eye on the space and worth a follow) adds:

Science Open

Semantic Scholar

Scilit

Core

Maybe also could add Mendeley - although it's built differently.

No responses yet

The MacRoberts cite this blog (sorta)

(by Christina Pikas) Jan 22 2018

So a quick check of Twitter led me to this very true tweet of Jason's:

 

I've read a bunch by the authors (and cited it - particularly the part that deals with informal scholarly communication), but I hadn't read the current one:

MacRoberts, M.H. and MacRoberts, B.R. (in press). The mismeasure of science: Citation analysis. Journal of the Association for Information Science and Technology. doi: 10.1002/asi.23970

It deserves in depth treatment, but check this out:

screen shot showing them referencing this blogeta: It's a cite to a comment by Croft on current url: http://christinaslisrant.scientopia.org/2010/01/09/very-quick-note-on-things-that-are-used-but-not-cited/ (old does redirect). (so not to me reviewing their article, but to a commenter, but still)

No responses yet

An ephemeral platform, used for other than ephemeral, and the death of Storify

(by Christina Pikas) Dec 15 2017

As I say in my dissertation and elsewhere, informal scholarly communication in social media is both ephemeral and archival. Maybe this is new because some online traces intended to be for a limited number of recipients for immediate use have longer life and wide reach. Some utterances in social media live on well after the originator intended (for good and bad). But maybe it's not entirely new as certainly letters among scientists have been preserved (some of these were no doubt sent specifically for preservation purposes).

I've long been a fan of blogs for personal knowledge management, that is, thinking through readings, partial results, tutorials for how to do things. Blogs are easily searched, archived, migrated, shared, and don't enforce an artificial or proprietary structure found in other tools. However, I also know that long-term bloggers who have established a readership through careful, well-edited posts impose new barriers on themselves for using their blogs for this purpose. I found in my studies that some superstar bloggers almost entirely stopped blogging because they didn't want to post anything incomplete or partial and there were too many other things to do.

I think this has been one of the motivating factors for the use of Twitter for long threads of stories and analysis. Twitter has great reach and immediacy, and interactivity... but at the expense of search (although it is certainly better than it was) and preservation. Who of us hasn't dug through our likes and RT to try to find something interesting we saw ages ago?

We're using a platform specifically built for ephemeral communication for communication that should be saved and preserved.

So individuals who value this knowledge management function, or who appreciate careful analysis or good storytelling serialized over 10s of tweets have adopted Storify to gather and order and preserve and contextualize the pieces. Storify added tools to make it a bit easier. Instead of Storify, you could embed individual tweets (this embedding function also calls back to Twitter so really doesn't preserve). You could <eek> screenshot. you could even just write it up and quote the text.

And Storify is going away this Spring. We do have notice, luckily, but we still have a problem. We need to back our stuff up - we need to back other people's stuff up. Not everything is of the same value to the originator as it is to someone else.

My plea - and it will go unheard - is to put things back into blogs which you then tweet. Or back your useful tweets up to a blog?

FWIW, I'm trying to capture AGU meeting tweets and I'll load them into FigShare ... but the odds of some researcher capturing and saving your stuff is actually quite slim.

This post was inspired by a tweet that has a thread and interesting points by her interlocutors :

 

3 responses so far

More evidence for the discovery layer as pile of crap metaphor

(by Christina Pikas) Dec 04 2017

this Cambridge University Report (pdf) via Aaron Tay

page 16:

The key insight was the expectation from users that the simple iDiscover search function would automatically return a list of results as sophisticated and relevant as they would expect from other, more powerful search platforms. This led to frustration when, for example, a search for a journal title returned a number of articles and other results before the link to the journal holdings and links to online access. At this point, when asked what they would do next, many of our participants answered by saying that they would start using another search tool.

 

Some of the problems were a mismatch with the user's perception of the tool (as a catalog):

page 18

“Book reviews above books just don’t make sense!” (Archaeology PhD student)
“When looking for a book, you’ll end up with a random science article.” (English undergraduate student)
“If you search for a title that only has a few words in it, even if you type it in correctly, other less relevant titles will come up first.” (Education MEd student).”

 

page 22

When asked what was most important to them in terms of platforms used to search for information resources, the words ‘relevance’ and ‘relevant’ were used by a large number of our participants. This was directly linked to a desire for seamless, efficient searches which yielded appropriate and useful results, without the need to use pre- or post-search options to limit or refine them. People were often frustrated at the lack of percieved [sic] relevancy in the initial results list, after having used the main iDiscover search function

[lol, we had a vendor here to help us get our enterprise search going many moons ago... they said "relevance is dead!" I was like "nope!"]

One response so far

No, vendor, we don't want a pile of crap actually

(by Christina Pikas) Dec 02 2017

Large Copper Dung Beetle (Kheper nigroaeneus) on top of its dung ball https://commons.wikimedia.org/wiki/File:Large_Copper_Dung_Beetle_(Kheper_nigroaeneus)_on_top_of_its_dung_ball_(12615241475).jpg

Yes, I have posted about this a number of times, and no this will probably not be too different.   Our vendors have swept up the little competition and then redone their boutique databases to make them - generally - work like piles of crap.

So there are two massive 3rd party aggregators that sell massive piles of crap. Don't get me wrong, these are super attractive to libraries who can then say: look at all these titles we cover! Look at how much content we have! The problem is that with our current state of information abundance, with lots of big package deals, with more and more open access, and with informal scholarly sharing < cough >, getting the full text of recent articles from big name journals really isn't a thing.

The thing is efficient, precise, thorough, appropriate information at the right time and place. I say: I need exactly information on this thing! The aggregators go: here's a massive pile of crap!  I'm like, well I don't need a pile of crap, I need exactly this thing. System returns: here's another pile of crap!

Look at the Aerospace database, for example. Used to be the only real database that covered hypersonics and was thorough at all at covering AIAA and NASA technical reports. It was CSA when I got to know it. Compendex, in comparison, is just adding AIAA stuff this year and isn't going back to the 60s. CSA databases got sold to ProQuest. I have no idea what the hell they've done with it because every time I do a search I end up with trade pubs and press releases - even when I go through the facets to try to get rid of them.

CSA used to have a computer science database, too. The current computer collection in ProQuest doesn't even allow affiliation searching. Also, a search I did there yesterday - for a fairly large topic - didn't return *any* conference papers. For CS. Really.

This is not to pick on PQ, ok maybe it is, but their competitors really aren't any better.

 

At the same time, we keep having people tell us at my larger organization, that we *must* get/have a discovery layer. Let me just tell you again, that we did a lot of testing, and they did not provide us *any* value over the no additional cost search of a 3rd party aggregator. They are super expensive, and really just give you - guess what - all your stuff in a huge pile of crap. I hear nothing but complaints from my colleagues who have to deal with these. The supposition was that we wanted a Google interface. Ok, maybe a sensible quick search is fine, but that only works when you, like Google, have extremely sophisticated information retrieval engines under the hood. Saying - hey we cover the same journals as your fancy well-indexed database but without the pesky indexing and also lumped together with things like newspapers, press releases, and trade pubs... is not really effective. It's a pile of crap.

You may say, "But think of the children!" The poor freshman dears who can't search to save their lives and who just need 3-5 random articles after they've already written their paper just to fill in their bibliography due in the morning....

Is that really who and what we're supporting? Should we rather train them in scholarly research and how to get the best information? And anyway, for my larger institution, we hardly have any freshmen at all.

No, vendors, we do not want a large pile of crap, but thanks for offering!

2 responses so far

Welcoming Confessions of a Science Librarian to Scientopia!

(by Christina Pikas) Nov 28 2017

I'm pleased to point to John's new home here: http://confessions.scientopia.org/ 

His first post rounding up best science books is live already. We'll get him linked from the home page - but check it out!

2 responses so far

Older posts »