Archive for the 'information retrieval' category

Proliferation of free databases of literature

Apr 27 2018 Published by under information retrieval

Google Scholar by now is the big player in free places to search across the literature. There are some oldies that are more specific in coverage: CiteSeer, DBLP, etc. And of course big government specialty: PubMed, Eric, TRID...

But now there are lots of other choices, which is mostly a good thing:

Microsoft Academic is back being actively developed. There's an API which allows some free usage and you can download the whole thing.

Base  - this has been around for a bit and people do really like it

Lens - also looks at patents

1findr - has a free and a paid version. They have a commercial product to help you locate open access when available

dimensions  - also free and paid versions


maybe Share is also in this category? and ScienceOpen ?


As I say a lot, the problem isn't enough, it's probably one more of precision and efficiency. Also access to quality clean data in quantity.


Follow Aaron Tay specifically, and in general - he's keeping an eye on the space.


Editing to add: Ryan Regier (also keeping an eye on the space and worth a follow) adds:

Science Open

Semantic Scholar



Maybe also could add Mendeley - although it's built differently.

No responses yet

Data point on if we need expensive specialty research databases

Aug 17 2017 Published by under information retrieval, libraries

***NOTE: Not endorsing any commercial products***

A search for "quantum computing" in the Abstract, Title, Keywords in Scopus yields 6,415 records

A search for DE (quantum computing) in Inspec yields 20,403.   By DE I mean descriptor - "quantum computing" has been in Inspec's thesaurus since 1999. Use the correct formatting for your platform.

One really quick insight: US is top publisher in Scopus and China is in Inspec.

One response so far

Getting article metadata from MS Academic: some R code

As promised, I went back and did this myself instead of relying on a partner in crime (earlier referred to as an SME but he outed himself). It's funny because I had his code, but he did things differently than I do them so I needed to do it myself.

First mostly successful run I ended up with about 44% of the rows missing the metadata. I discovered fairly quickly that using TM's removePunctuation was, of course (in retrospect), closing up instead of leaving a space for intraword dashes. You can have it ignore those, but you can't have it go ahead and leave a space. I first did some finding and replacing in Excel but that got me down to 32%. Then I was like, duh, just do the gsub for [[:punct:]] and see if that's better. I hope I haven't used my quota!

Here's the code. Sign up for your key here. Also note: not affiliated, not endorsing.

#microsoft academic to try to find affiliations for article titles

library("httr", lib.loc="~/R/win-library/3.3")
library("tm", lib.loc="~/R/win-library/3.3")
library("jsonlite", lib.loc="~/R/win-library/3.3")


#don't forget the following or you will regret it
options(stringsAsFactors = FALSE)

# api info


msakey1<-"put yours here"


#test on one to see how it works
testcite <- GET(apiurl, 
         query = list(expr = searchexpr,count = 1, attributes = apiattrib), add_headers("Ocp-Apim-Subscription-Key"= msakey1))

#get the json out into usable format
#could look for errors first

#comes out raw so need to make into text
testciteContent <- rawToChar(testcite$content)

#this will get a ; separated vector
paste(test$entities$AA[[1]]$AuN, collapse = ';')


# initiate a dataframe
# for each title, go out and search using that title
# could add in a warn_for_status(r)  when status is not 200 (happy)
# if status !200 go to the next one,  if status =200
# extract ti, y, authors (paste), affil (paste), jn, cn, and out of entities VFN, V, FP LP DOI D
# write them to the data frame
#1904 is the length of my article title list

CitesOut<- data.frame(ti = rep(NA,1904),
                      y = integer(1904),
                      au = rep(NA,1904),
                      af = rep(NA,1904),
                      jn = rep(NA,1904),
                      cn = rep(NA,1904),
                      vfn = rep(NA,1904),
                      v = rep(NA,1904),
                      fp = rep(NA,1904),
                      lp = rep(NA,1904),
                      doi = rep(NA,1904),
                      abs = rep(NA,1904),
                      stringsAsFactors = FALSE)
getMScites<- function(citeNo){
  searchexpr<- paste0("Ti='",TitlesToFindf[citeNo],"'")
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))
  print(paste("cite number", citeNo,"status is:", holding$status_code))
  holdingContent <- rawToChar(holding$content)
    au=ifelse(is.null(holdC$entities$AA[[1]]$AuN),NA,paste(holdC$entities$AA[[1]]$AuN, collapse = ';')),
    af=ifelse(is.null(holdC$entities$AA[[1]]$AfN),NA,paste(holdC$entities$AA[[1]]$AfN, collapse = ';')),
  } else {
  citerow<-cbind(cciterow,eciterow, stringsAsFactors=FALSE)
  print("this is citerow")

searchexpr<- paste0("Ti='",TitlesToFindf[4],"'")
               query = list(expr = searchexpr,count = 1, attributes = apiattrib), 
               add_headers("Ocp-Apim-Subscription-Key"= msakey1))



#prepare title list
## IMPORTANT - all the titles have to be lower case and there can't be any punctuation
TitlesToFind <- read.delim("~/DataScienceResearchInitiative/TitlesToFind.csv", header=FALSE)


TitlesToFindf<- gsub("[[:punct:]]"," ",TitlesToFindl)


#use the sys.sleep so you don't get an error for too many requests too quickly
for (i in 21:1904){



for (i in 1:length(missCites)) {

edited: to fix formatting. also the missing cites were writing to the wrong rows.. sigh.

Comments are off for this post

Notes from Dan Russell Advanced Skills for Investigative Searching

This class was held 4/15/2016 at University of Maryland at the Journalism School, hosted by the Future of Information Alliance. Some information is here. Slides are here. Updated Tip Sheet is here.

I've previously taken his MOOC and enjoyed tips on his blog but things change so quickly it was good to get an update.

Of course I didn't bring my laptop so... these are from handwritten notes.

  • Capitalization doesn't matter except for OR when it's crucial. don't use AND it doesn't do anything.
  • Diacriticals do matter. e and é are basically interchangeable but a and å are not. (it does offend native speakers of countries that use these....)
  • If you need to search for emoji you'll have to use Baidu. This is relevant searching for businesses in Japan, for example
  • filetype:   works for any extension. If you're looking for datasets you may use filetype:csv . Regular google searches don't search docs, you'll need to search them separately
  • site:  it's different if you use,, or . To be most general use that . after the : acts like a * if there are subdomains
  • There is no NOT. Instead use -<term>.  No space between the minus and the term.
  • Synonyms are automatic. Use quotes around a single term to search it verbatim (also turns off spell check for that term). If quotes are around a phrase, it does not do a verbatim search.
  • There are no stop words
  • inurl:   ... this is useful if pages have a certain format like profile pages on Google Plus
  • If you want to get an advanced search screen. Click on the gear to select it. Gear is in the upper right hand corner. That's the only way to get limiting by region (region limiting isn't always domain), number search, language search. Some advanced search things can also be gotten by using dropdown boxes after searching or using things like inurl: filetype:
  • related:<url> gets you sites with term overlap (not linking/linked similarity).
  • Google custom search engine  - lets you basically OR a bunch of site: searches to always search across them.

Image Search

  • Tabs across the top of results for topic clusters found
  • Search by image - click on camera and then point to or upload image. Can drag an image in or control click on an image. After search can then add in terms to narrow to domain.
  • Example - find a tool in the basement, take a picture on a white background with it in a normal orientation, then search to find it in catalogs, etc.
  • Crop images to the salient bit.
  • On mobile devices the standard search is actually a google appliance search - not as powerful. Open chrome and search from there if you need more.

Other notes

  • Things are changing all the time because of adversarial arrangements with optimization people.
  • link:   was removed this week.
  • results are an estimate. When you narrow you sometimes get more results because it starts by searching only the first tier of resources. First tier has millions of results in it - and the ones that have been assessed as highest quality. If it doesn't find enough in the first tier - like when you narrow a lot - it will bump down to the second tier with like billions more results
  • consider using alerts.
  • to find any of these services - just Google for them
  • google trends is interesting. can narrow by time or region. Also look for suggestions when searching. Can search for an entity or for search term. remember trends are worldwide
  • Google correlate - example: Spanish tourism authorities want to know what UK tourists are looking for. Find the search for Spain and tourism, and see what keywords use by UK searchers correlate.
  • Country versions are more than just languages. Consider using a different country version to get a different point of view.
  • Wikipedia country versions are useful for national heros and also controversial subjects (example: Armenian genocide)
  • define   (apparently no : needed)

I think all librarians should probably take his class. Good stuff.

Comments are off for this post

Search as Conversation

Mar 11 2016 Published by under information retrieval

Not a new idea but seemingly ignored by research databases, no?

I just read: Beyond algorithms: Optimizing the search experience. Making search smarter through better human-computer interaction. by Daniel Tunkelang (yes, I am turning into his fan girl, but it's because he does have interesting things to say!) posted in October 2015.

I immediately wanted to bookmark, tweet, e-mail, print and waive it around... yes, this.

Some search tools - like Google and supposedly* like Siri with whatever lies beneath - do take a series of queries together to try to answer a bigger question.

Our databases do have facets. Some also have type ahead or auto suggest but the results are often hilarious and are not using query understanding techniques but just matching terms off a frequency list.

The one search box but then segment the experience - I think this is where bento is trying to go... but doesn't really? We can for sure do better.

Anyway. Read the blog post.


*my Siri has gotten stupider. It really has. It used to provide better results.

Comments are off for this post

Searching Scopus by Date Added to the Database

In my previous post, I complained that my metrics weren't comparable over the course of a few months, even for articles published in 2009.

I looked in the instructions, and I couldn't find anything that discussed searching by date added to the database. I looked at all the fields on the detailed view and there wasn't anything to help. No accession number. No date added. Hmph.

So I started to think about the alerts I had set up.When you click through "view all new results in Scopus", you get a search like so:

(AFFIL((my place of work)) AND ORIG-LOAD-DATE AFT 1390059048 AND ORIG-LOAD-DATE BEF 1390674349
Huh. So I wondered... can you just find the right AFT and search in advanced search for that?  Yup. Sure can!
What are these crazy numbers though? (most people will know right away - I didn't, and I should have). So I looked around - no I didn't have any from that time period to use. I chatted with the Scopus help and they insisted 1) can't search on that field (I told them I already proved you could) 2) it was part of the alert system and not part of the database (????) 3) they couldn't give me the numbers for the time period I want, because you can't search for them anyway.
So then I asked LSW and the brilliant Deborah and as brilliant but time delayed Meg told me it was Unix time - seconds since 1/1/1970.  Stephanie also provided me with a search string from early January (thank you!). I read about that in R, but Deborah even linked to an online converter and boom - Bob's your uncle.
So, if you want to find articles added to the database before or after a certain time, convert the time to Unix time and then use

adding 5/7: I was contacted by Scopus - I would like to post detailed information from the e-mail but haven't gotten permission. She did verify that this search will work, but only so far back. That information isn't kept indefinitely. Also, you can use RECENT(n) where (n) is the number of days. You can AND that on to any advanced search.

Comments are off for this post

Looking at ROI or at least financial justification for SDI or alerting services

Dec 30 2013 Published by under information retrieval, libraries

Used to be that special libraries were the ones always asked to show their return on investment and justify budget expenditures - it was obvious that universities needed libraries and they were judged on the number of volumes (like that was ever a sensible metric for the value of a school!). In the past decade or so public libraries have been under more pressure to show ROI and they do so by showing economic benefits to the community from having library services in an area (there are also many more dimensions used in a nuanced way - see iPac's work). There's a nice (if a tad dated) review by Cory Lown and Hilary Davis here:

The term SDI - selective dissemination of information - was coined in the 60's but no doubt librarians have always performed this function. Whether formally or informally, we are asked to keep a look out for things of interest to our customers/users/patrons, etc., and bring them to their attention. Formally, we might have a budget we charge our time or even resources to and we do a detailed reference interview in which we establish the topic but also the type of information desired, the treatment, time period, frequency, and some gauge of whether the person would rather have some junk but be less likely to miss something (high recall) or is ok with being more likely to miss something but wants only things that are likely to be very relevant (high precision).

With this information the librarian might just set up a search in a database, tweak it a bit, and then walk away. She might have the results come to her and weed them every week. Alternatively, she might actually schedule some time to read the news and look out for things and then write up summaries.  Informally, it might be just that a librarian knows a customer's interests and if/when she sees something, she forwards it.

Once a database alert has been set up and is running, further intervention is only needed if the database vendor changes or if there's a request. The problem with this is that the end customer can (and often will) forget how they came to get this useful little email every week. We found when we needed to clean up our Dialog account that there were alerts from the former librarian who died in maybe 2002 or so (before I got here). They were super useful to the users and they passed them around within their group, but we were able to re-write them using our current site license to the database and save that money. If there wasn't a bill, we wouldn't have known and certainly those engineers had forgotten.

So what if one of those alerts had a gem in it that the recipient wouldn't have heard about otherwise and that caused them to start a new research program or innovate on a project or save money or .... ? Would the library, or more importantly, the people who pay for the library ever hear about it? No.

For the informal mode in which we keep an eye out for customers. That can be really hit or miss. Sometimes there's all kinds of interesting things going on and other times there's nothing. Maybe we point out 100 things of interest for 1 home run. Maybe allowing ourselves the time to look - to read the news, the industry newsletters, the science magazines (like society member magazines like Physics Today, EOS, etc) isn't do-able. That's a huge problem. It looks like you're doing nothing but fooling around on the internet. When you do send something good, they might be like "great - send one this good every week!" or "more like this!"

We were going to start up sector newsletters here, but it's really not sustainable because you have to look around and read for a while to see new and interesting things worth alerting people on. Sure, it's super useful but how many hours go into each home run? The bosses very much appreciate these tips they get, but they do not want to pay for the time for people to look for the information.

My old boss used to say that we needed to be just-in-time not just-in-case and that's total baloney. Libraries by definition are just-in-case. These alerting services are just-in-case. Metrics like number of alerts sent out are not useful. Stories of times these alerts were truly useful and used are great - but you have to hear them and record them.

My library has lost some big battles in justifying our existence so I am clearly not that effective at this. It's a sticky question, I think. My blog posts always peter off like Crichton novels, but oh well. Happy New Year - hopefully we'll still be providing library services in the new year after we're re-organized again, sigh.


One response so far

How not to support advanced users

Oct 29 2011 Published by under information retrieval, interfaces

At first I wasn’t going to name names, but it seems like this won’t make sense unless I do.

Over the years Cambridge Scientific Abstracts became CSA and then now is just part of ProQuest. The old peachy tan-colored interface always supported advanced searching. When the tabbed olive colored interface came out a few years ago, some of the advanced search features were a little buried, but you could still find them (I blogged about it then, but was corrected by someone who showed me where they were). The databases I’ve always used on CSA are very specialized. I use Aerospace and High Technology the most, but I also use Oceanic Abstracts and Meteorological and Geoastrophysical Abstracts. For my own work, I also use LISA.

I find that for topics like missile design, including hypersonics and propellant formulations, and spacecraft design, Aerospace and High Technology does much better than the general databases like Compendex. Oceanic abstracts is a great complement to GEOBASE (and GeoRef, but meh) on other topics I research.

I have search alerts set up in these various databases. Some I review and forward to my customers whereas others I keep for my own use. The alerts take advantage of the advanced searching available and are tweaked over time to be as precise as possible.

So now that we’re all moving to the new ProQuest interface, it was time to translate my searches to the new format. Luckily, ProQuest has a help page that takes you from the searches in the old interface to the new. I have to say, though, that there are pieces missing. I found in Illumina (the olive colored interface), I could just use kw to get the primary fields out of the record and leave off the references. In the new interface, I had to list all of the fields individually. Also, I had a real problem nesting all of the searches I needed to do. Long story short, I did manage to figure out some satisfactory searches for the alerts.

Now, here’s what actually prompted me to write this post. I am an advanced user and I do have a lot of experience with different interfaces. When I do find a problem in the interface, I’ll report it – particularly if it’s keeping me from performing some task.

In the new interface, if you have something more than the basic search, it often will not let you see the last few pages of results.

For example, in Aerospace (the name now leaves off high tech, let’s hope it still covers the same content):

propellant friction sensitivity – is just fine and you can see all the results

propellant AND “friction sensitivity” – either done through the basic search screen or done through the advanced search, will not let you see the third page. It gives an error.

Fine, so I reported this to their help desk. They replied a week later and we’ve been exchanging e-mails ever since. They’ve assumed I was technologically inept, that my computer was broken, that my library had set up something wrong with the database, that our network was messed up, and that we had a proxy server causing errors. I sent them the error messages from the screen. I sent them screenshots. I tried the same search on three browsers and got another librarian to try from her computer. We could all replicate the problem. They said they visited my library’s web page and couldn’t find a link to the database. Well, *my library* doesn’t have an external web presence – at all! Further, I had already given them the direct URL and told them at least three times that I wasn’t going through a proxy server because I was on campus.  They wanted a screenshot of the search screen (?!?) so I sent that.

Yesterday morning, I got another e-mail. Upon further investigation, they found that this was… a known error… and that technical services was working to fix it. The work around is to re-sort the records until I had seen them all.

Do they have any idea how mad that makes me? How much time I spent proving I was seeing what they already knew was happening?  Did they even check their knowledge base or did they decide to screw with me for three weeks before even checking?

I’ve had it, but damn it, I need that stinking database for my work and there’s no other real option. GRRR.

Is this how to treat your advanced users?  The first search string I sent them should have clued them in (it’s not the one above, it’s much longer). Plus, they asked and I told them I was a librarian when I submitted the report.

3 responses so far

Research Database Vendors Should Know

Research database vendors - the ones who design the interfaces that the end users use - should know that data export is not a trivial addition. Rather it is an essential part of their product.

Over and over and over again, librarians complain about one interface that works one day and doesn't work the next. The one that doesn't output the DOI unless you select complete format. The one that all the sudden stopped exporting the journal name. The interfaces that don't work with any known citation manager. The ones that download a text file with 3 random fields instead of direct exporting the full citation and abstract.

But you blow us off and you act like it's not important.

Well. I was just talking to a faculty member at another institution - even though a particular database is most appropriate for her research area and she finds interesting papers there, she now refuses to use it because it doesn't export to EndNote right. She's tired of the frustration and she is tired of finding that she has to edit everything she's imported so she's just given it up.

Once again librarians are telling you something and you need to listen. Researchers and faculty are super busy. They will not keep using your product if it makes their life harder. If they don't use your product then we'll stop subscribing. That's all there is to it.

2 responses so far

Lamenting the poor support of the expert user

Apr 01 2011 Published by under information retrieval

For some project I’m on at work, I had cause to look up “flow” and how to basically support/encourage that in information systems. Ben Bederson’s 2004 piece in Ubiquity seems to be one of the standard articles.* Being in the flow is

When we are fully engaged and in control of an activity, we sometimes sense that time passes more quickly and we feel immersed in that activity to the exclusion of all else.

Bederson emphasizes these characteristics of flow from Csikszentmihalyi:

  • Challenge and require skill
  • Concentrate and avoid interruption
  • Maintain control
  • Speed and feedback
  • Transformation of time

Flow doesn’t happen when it’s all easy and floating past you, it’s when you’re engaged and challenged. Bederson talks about Emacs and Adobe Photoshop- they’re difficult for newbies to learn, but once you’re an expert you can immerse yourself in your work and not interrupt your flow.

Taking this to the obvious next step (for me), how can/do/should interfaces support scientists, scholars, and librarians in the flow of literature-based discovery/analysis and writing?

I guess scientists think more of this flow when they’re doing the bench work or maybe designing a study (or do they?)… but there’s no reason that you can’t get into the flow when you’re searching and reading the literature or writing. I’ve had it happen where I’m reading an article and it gives me a million research ideas and it is really great and really useful. What about the actual search process?

Let’s look at those features with the current systems:

Challenge and require skill. For the most part current information retrieval systems advertise that they require no skill and that they aren’t challenging at all. Some that do pose a challenge are more like frustrating instead of powerful. Libraries and others who do a lot of online searching know that there is skill involved in getting the best results, even from the most simple Google interface.

Concentrate and avoid interruption. For the most part, this seems out of the system. Few of the information retrieval systems popup distracting windows. It would be nice, however, if systems didn’t time you out. That’s miserable: you go to follow a lead and look at an article, but when you come back you need to click to start a new session.

Maintain control. Well, this is something that few systems now really emphasize. You can do a fielded search on many of the different systems, but they don’t all have proximity operators or really advanced functions. Some do, but it’s hidden (if you’re an advanced searcher you’ll find it, but still). They also seem to hide the controlled vocabulary – if they have one. Some systems have added automatic stemming (generally a good thing) or even some mild automatic query expansion, but they have to let you turn this off. It’s easy to turn off in EngineeringVillage2, but I’m not sure about other tools.

Speed and feedback. Most are pretty fast – except for Leadership Directories (why on earth is that thing sooo sloooow?). Feedback varies. Certainly faceted presentation of search results is a great addition that really provides a ton of feedback of how your query was interpreted in the system. Spelling suggestions from EbscoHost have been hilarious (did you mean caber warfare? um no)  - that’s feedback, but meh.

Transformation of time. The one system that has all the power and control, is a challenge and requires skill, doesn’t let me forget the time at all! Dialog classic, I’m lookin’ at you! What about other systems? When I’m on the hunt and I’m finding great stuff I can get pretty immersed – I have to remember to write down my steps so I don’t wander too far from my path.


In blog posts about the design of a new overlay for the library catalog for the larger institution of which MPOW is a division, Jonathan Rochkind talks about design of system he believes provides the magical combination of utility for the newbie or casual searcher as well as powerful features for the advanced searcher. Don’t get me wrong, I’m not saying the library catalog could ever really be an immersive place, but it is heartening that the team paid such careful attention to what the users said more than the tripe we’re fed about the simplistic search box.

I think the system that will do the best for the advanced searcher will offer the power and precision of an expert search- not just dropdown fields, but the ability to use au= AND ti= AND su= ( OR ), etc., within a box. Stemming is fine but you need to be able to turn it off. It’s fine to expand on a term (like a system that OR’s on the alternate spellings of Qadaffi would be most useful) but you need to be able to turn that on and off within a search, term by term ( used a system that literally used metaphone for every term in the query and *you couldn’t turn it off*). I think the underlying data has to be very high quality: accurate citations, a tight thesaurus that is consistently applied, coverage of the appropriate resources… while I’m dreaming, authority control on the author names.  There has to be support for using the subject indexing (locating the right term). Once the search has been retrieved, there should be some visualization and analysis tools. Faceted presentation of search results is a start, and visualization of the citation network is good, but there could be some other analysis tools. Jumping off from linked authors or subject terms or citations is useful, too.

Will all this lead to flow? Not sure. It’s still very separate from the writing process and the synthesis of information. A project at MPOW is looking at how to integrate search into the report writing process (hints of what they did in this paper, pdf). I’m not convinced this is going to work but it’s worth some thought. Others have integrated sense making with information retrieval, not sure how this goes with flow either.

The perfect system (of course varies by user, by task, etc.) may very well integrate the query formulation (reference interview-like support), retrieval, sense making, writing, fact checking, visualization….

Back to the title of the piece. The primary users of research databases in the science really aren’t the 18 year old undergrads. Why do we keep hearing from proud database vendors that their products are optimized for such users? As in the flow article, as well as some of my favorite articles by Soergel and Bates (individually – no co-authorships), it’s ok to require the user to learn, to adapt to the system. Don’t feed me a line about the benefit of no information on the screen and an empty box while you take away the features that work for me, one of your heaviest users.


* clearly there was a ton of stuff before this – enough to be summarized in a popular book by Csikszentmihalyi , but this article pops up when you apply the idea to interfaces


Bederson, B.B. (2004) Interfaces for staying in the flow. Ubiquity 5, 27. DOI: 10.1145/1074068.1074069 (goofy citation, but this is all I can get from the ACM DL and his website)

2 responses so far

Older posts »