Archive for the 'information retrieval' category

Discovery layers a little way on

One of the reasons Google Scholar is so attractive is that it covers every area of research. Another is that it’s quick. Another is that it’s free. But, it doesn’t necessarily go that far back, it’s unclear exactly what it does cover, it lags publication (and we don’t know how long), and it doesn’t support very sophisticated searches. Plus there’s no controlled vocabulary behind it so you don’t get all the results that are relevant. Of course no substructure searching 🙂

Library tools such as catalogs and research databases have other sets of problems. The research databases that have the powerful search tools and are really well indexed with a good controlled vocabulary tend to cover a fairly narrow research area: physics, medicine, chemistry. Other tools that cover a broad group of topics probably don’t have as good indexing as GoogleScholar, but offer full text. Catalogs are typically miserable to search. It’s very hard to find what you’re looking for in most catalogs, particularly if you’re doing a subject search and not looking for a known item.

The narrow bit is a particular problem in the newer disciplinary areas or interdisciplinary areas. That’s one of the reasons libraries started licensing federated search products maybe like 10 years ago? I should probably explain again, although I’m fairly certain I must have before. A federated search takes your query translates it for a set of pre-selected database, sends it out, and compiles a list of results. Compare this to something like google that has already gone around to all the websites, crawled them, stored the results, and then created optimized indexes and what not. So you begin to see that federated searches seem really slow. Plus the search language ends up being lowest common denominator with only a limited number of fields. What’s worse is that out of the box the results pages weren’t very well done (there’s an add on that we have that improves this a lot).

So, a few vendors managed to negotiate a deal with the databases to – yes – index them in advance, and show the results in a slick interface. These things are being called discovery layers. You would only get the results of the databases you pay for from the original vendor. (well, that brings up a question – for something like, say, Inspec, we pay the database producer fee plus a markup for the interface provider… I wonder if you just pay the first part? dunno). Anyhow, you get the speed of something that is indexed in advance and the benefits of having the underlying databases. Typically they’ll suck up your catalog and institutional repository, too.

Your reaction is probably like mine was: how do you get all of the underlying databases to sell to you? Without them, it falls apart.

So that brings us up to the current part of the story. I’ve mentioned how Ebsco is really on a power grabbing mission. They own a bunch of databases. They are also developing one of these discovery layer deals. Well one of the discovery layer deals wrote a letter to all of their customers saying Ebsco had pulled information out. Iris has all the information on her post – so I won’t repeat it, but apparently none of the discovery layers are providing information to any of the others. That leaves us with crawl and cache for one big vendor and federating the competition’s, no matter which vendor we pick.

Other things called discovery layers are actually just overlays for the catalog. That’s what ours is going to be. That nibbles away at one part of the problem but really doesn’t approach the elephant in the room.

Sigh, a bit depressing, but now that I ponder the whole thing, I’m not sure how much of a loss. There are lots of research efforts about dealing with multiple ontologies for dealing with scientific data coming from multiple sources. Maybe we can figure out something better like something that really uses the controlled vocabulary.

3 responses so far

Research database data export kvetch

I'm sure I must have bitched about this before, but argh!  I don't understand, when you have nice structured data all in clean little fields, how you could so horribly and repeatedly screw up exporting to citation managers. The worst part is, even after a database has it right, they'll often screw it up when nothing else has changed.

I do A LOT of searching of research databases. Like I probably spend a quarter to three quarters of every day at work searching in some research database or another.  As I mentioned in my post on packaging results, I'll typically export results from the various places to RefWorks and then use that to compile. I'll then export from RefWorks to APA annotated with abstract for my report. I also maintain a listing of articles written by MPOW and I export from research databases to RefWorks to populate that.

After I've updated the listing of articles, I'm usually so frustrated I can barely see straight.

Here's a list of things that make me most angry:

  • the DOI field not being exported
  • a period being added to the end of the DOI field  (WTF?!?)
  • being added to the front of the DOI in the DOI field
  • the DOI being routed to the Links field with the in front of it
  • NTIS database results coming out as Journal Articles - it's NTIS, all of the entries are by definition technical reports even if the content was presented at a meeting or whatever
  • NTIS database results not transfering the report number
  • Conferences seem to never come in right from anywhere

Yes, I do global edits to fix all of these but the period but that just adds on more time.

I used to recommend EngineeringVillage2 for good data export - but they've "fixed" it so now it does a few of these things. My current best data source is Web of Science - now that we've got an export to RefWorks button. It's the *cleanest* data export you'll find.  People  complain about it for analytics or JIF purposes - but really, take a look at the competition!

2 responses so far

I'm thankful for... citation linkers!

Nov 22 2010 Published by under information retrieval

So there are these great research databases, like PubMed, that basically tell you, "A solution exists!"  Or really, there exists an article described with these metadata* that might answer your question.  So how do you get from knowing that an article exists to having it open on your screen?

Otherwise, if you're looking at said article, and it's got a fascinating reference to another article... how do you go from knowing that article exists to having it on your screen?

We at fancy research institutions have this awesome tool that goes from there exists an article to having it on your screen. There are 3-4 major vendors of this kind of product. Ours is SFX from Ex Libris** and we've branded it Findit! We have tons of e-journals, books, conference papers, etc. So we load into this database our holdings, including what year and all. For the big deals, I believe we can just pick off the right package. Then, when you're in a research database you click on the happy little FindIt button, and it does!  If you're in PubMed, and you have the libx toolbar, you can click on the PubMed ID.  Or, you can go to the FindIt page, and copy the PMID into there or the citation into there. SHAZAM!  Even if our access is through an aggregator, ADS, or JSTOR, you can get to it from there.

FindIt - an example of a citation linker

An example of a fabulous citation linker landing page

AND, if that isn't enough to be thankful for, ours is blinged up with the Xerxes add on so it calls APIs from Scopus and WoS - you can see right from there if the article has been cited and click through to the citing articles. You can also see if it's on the shelf in the library (important for book).***

Are these perfect? No, far from it... but they are *so* much better than not having them. Thank you!

*data about data. Like author, title, journal title, year, volume, page, doi, ISSN, MeSH or keywords...

** not affiliated/not an endorsement.

*** for more details on Xerxes Umlaut, see Jonathan Rochkind's blog or this from code4lib wiki

2 responses so far

A federated chemical information search?

There’s been an ongoing thread on the ChmInf-List – it started discussing Reaxys (what used to be Beilstein, Gmelin, and a property database, from Elsevier, sold as a site license, I think without limits on concurrent users) and whether it was worth it given searches of Beilstein on Crossfire Commander were dropping.It then morphed into a discussion of Scifinder (the primary way to search Chemical Abstracts, the primary literature database in chemistry. Scholar is the academic version – licensed by concurrent users. Scifinder itself is typically licensed by selling packages of searches called tasks).  The thread ranged on to discussions of privacy (ceded with requiring individual registrations for the web version of scifinder) and licenses for the two, and then the fact that the license for Scifinder Scholar says that institutions can’t share usage information with each other, the fact that SciFinder is required to have your chemistry major accredited by the ACS…

Along the way, someone mentioned that there was a limited version of Chem Abstracts available to search on other services (DIALOG) – it doesn’t have the structures and maybe some other things. I don’t know how I got on this topic – but I then suggested that if the license would allow it, it would be really cool to design a federated search specifically to deal with chemical properties.

Now I’ve been as skeptical of federated search as the next librarian and I also am a big fan of sophisticated searches that take advantage of the power of the local interface. But I also know that it’s not about the thrill of the chase, it’s about solving a problem. The ideal system is the one that gives you the best solution, the quickest, and also teaches you and gives you confidence in the answer.

What would be excellent is if you could federate a search across:

  • SciFinder
  • Reaxys
  • ChemNetBase (includes CRC handbook and the Combined Chemical Dictionary and other things)
  • ASM Handbooks and Phase Diagrams (this would of course be the materials science ASM)
  • SpringerMaterials (includes Landolt-Boernstein)
  • ChemSpider
  • maybe the CSA Materials Databases
  • whatever other appropriate stuff like if you have Knovel or BioRad’s Know it All U or whatever

Just as MetaLib (our federated search provider) has a fairly crappy interface out of the box, and required a lot of fixing… even with that fixing you couldn’t just throw that same solution at this search. First, this is a dream anyway because hardly any of these chemistry sources can be federated. Second, even if they could be federated, would the fields be there or would we only have title, author, subject? What would you display in the results beyond the value (if possible) and the citation?

So anyway, I suggested this and a retired professor posted to the list essentially yelling at me and others on the list for daring to

  1. discuss Scifinder on an open list (he said it should be on scholartalk but HELLO, I don’t have access to scholar!)
  2. call SciFinder a database (well it is, so there)
  3. suggest that SciFinder could be federated

I’m amazed and thankful for the chemical information pioneers who made what we have today possible, but I am nowhere near content to leave it as such. By tying our hands about what we can discuss, they’re hurting themselves in the end, because a large innovative community should be a resource not a threat.

Enough for now – none of the comments on this particular idea came from any official representatives of CAS or Elsevier. CAS’ policies are clear in the licensing agreements.

4 responses so far

Let’s Help ScienceBlogging: What design features are useful in a science blog aggregator?

First, the great news: Bora, Dave, and Anton got together and developed a website to aggregate science blog postings. It’s at . This is really still at it’s first stages and they plan to continue to add to it and refine it as they go.

Here’s a screenshot (I’m guessing the page will look different over time so this way you can see it as I see it).

It's got three columns, and the top five stories from each source. The title links to the source as do the story titles. There’s also a blogroll – an alphabetically arranged list of the sources.

The sources are a combination of blog networks like this one, Discover’s, Nature’s,, etc., and some news feeds. Some of the sources are in other languages (Brazilian Portuguese, German, Chinese, French).

It’s clear from the design (and the delighted reactions) that this is meant as a place to go to read a diverse collection of science posts – to get a sampling. It doesn’t link to independent blogs – except when they are aggregated by “All-geo”. It also doesn’t have any way to export the contents or really explore the contents besides browsing the titles on the front page. If you mouse over the article titles you do get a snippet of information.

What features could help the current setup?

  • some way to expand and read a snippet without mousing over. People with twitchy hands might not do well with that
  • some indication of the blog name where the article comes from
  • separately, a page providing information about each source – I know some of these, but I’m assuming a lot of people don’t
  • an opml file or some way to export the rss feeds to your reader (you could, of course, visit the original site or just keep coming back)
  • I’m not sure what order things are on the page. Maybe they should be in some categories? Some explicit organization? (Blake makes that comment here)
  • Blake also makes the comment that these various aggregators have different rates, so 5 posts might stay there for a while or there might be 5 posts an hour – it’s hard to see how to deal with that.

Could independent blogs be added, and how?

This post from Dave puts forth some ideas for adding “science blogs”. The first problem is defining what’s a science blog. I faced this in both of my previous studies, and I solved it two different ways. One I was very strict: self-identified scientists posting mostly on scientific topics. The other I was more broad – the above plus scientists posting on life in science, plus everyone else blogging about science.

What no one mentions on that post is: what is science? Are social sciences included? Librarianship? Areas of the humanities like anthropology, archaeology, communication, history? It’s really hard. Science librarians yes, others no? Well, then we’d lose Dorothea. So academic librarians? Then I’d drop off 🙂

First, selection and maintenance

  • Nature Blogs takes nominations and then requires two members to confirm. They require:
      1. composed mostly of original material - no press releases or lists of links
      2. primarily concerned with scientific research
      3. updated (on average) at least once a fortnight
  • Other suggestions – like from Jonathan Eisen on twitter – were to take nominations and have a curator say yes or no. This could be way, way too overwhelming and there could easily be hurt feelings if someone didn’t get included and they thought they should.
  • A variation on that is to have one or a few committees. Maybe for each subject area.
  • Maintenance is also an issue – keep dead blogs? Use an automated link checker? Manually go back and check if the person is still blogging and still blogging about science? How often? Have a way for visitors to report. (Oh and for heaven’s sake, Nature won’t let me change my url from blogspot – let the bloggers update their urls).

I sort of think the Nature way pretty much works. It’s crowd sourced, so less load. But the maintenance stuff needs to be added.

Second, organization

  • There needs to be some organization scheme. It might go deeper (with sub categories) in areas where there are a lot of bloggers
  • The organization scheme could have a couple of different facets (topical/subject – chemistry, gender, work setting – industry)
  • Should be able to look at an aggregation on each subject category, and export rss feeds from that category
  • Some of the others aggregate around what journal article or molecule is being discussed – this might be too hard and there might not be enough content to do that.
  • There could be some organization around links. See who links to this blog, see who has commented on this blog – but that would also take a lot of work.

Personally, I’m not so much interested in links to press releases and main stream media – the bloggers pick up things like that that are interesting (I pick up some from the information industry). I’ve already spent way to long on this for incremental help to the founders – they have already done an amazing job. Maybe some information architect-y or user experience person might weigh in?

5 responses so far

The danger of using only sources with recent coverage

… is well documented. Consider, for example, the tragic case of the JHU researcher who only searched Medline 1966 forward and so missed an association between the intervention in a study and lung toxicity that had been reported in earlier literature [1-2]. In biomedicine, there is a huge emphasis on recency – and for good reason, science moves fast. The cited half life (a measure of how far back citations generally go) is way shorter in medical fields than in say, math (which is always >10 years). The engineering databases that I use most frequently go back to 1898 and 1884. I also use Web of Science and Chem Abstracts which go back to the very early 1900s (~1908).

But anyhow, Biochembelle, re-tweeted by Scicurious, pointed to an editorial from Nature Reviews Microbiology [3] that says youngsters today aren’t getting the proper baseline literature because they’re relying on PubMed and Google Scholar. They cite the subject area of bacteriophage biology – developed well before the Medline era. Some researchers in this area have created their own bibliography of articles prior to PubMed, but they are concerned about losing access to the publications as they are moved out of the library to storage.

There are like a ton of things wrong with these statements. First, have they tried Biological Abstracts? As far as I can tell it goes back to at least 1917 (my parent institution has it stored because we have the online version, BIOSIS, and we have the backfile). Second, libraries typically don’t move journal runs off site unless they have the electronic equivalent or at least until they’ve shown that there’s very little if any usage. Many scholars wish more were moved off site – they get free scanning and electronic delivery on those articles instead of having to photocopy themselves! Libraries are also buying electronic backfiles – don’t assume that just because it’s old, we don’t have it online! In fact, some pre-1923 biology texts are freely available in the Biodiversity Heritage Library.

My points in a nutshell:

  • yes, it is very dangerous to rely on incomplete resources like GoogleScholar
  • yes, it is very dangerous to only use recent information
  • if you’re at a research institution, you don’t HAVE to rely on PubMed and GoogleScholar, you have access to other resources and it’s no one's fault but your own if you don’t ask your librarian what to use

[1] McLellan, F. (2001) 1966 and all that-when is a literature search done? Lancet 358(9282),646.doi:10.1016/S0140-6736(01)05826-3

[2] Ramsay, S.(2001) Johns Hopkins takes responsibility for volunteer's death. Lancet 358(9277), 213. doi:10.1016/S0140-6736(01)05449-6

[3] Raiders of the lost articles. Nature Reviews Microbiology 8, 610. doi:10.1038/nrmicro2435

12 responses so far

Disciplined tagging or how Stack Overflow plans to control their vocabulary

Carol H tweeted this blog post today from Stack Overflow, the wildly popular question and answer site for IT, CS, software dev, etc. Essentially, if you get stuck, you submit a question and you provide subject tags to help people find it. Answering questions gets you reputation points.

A collection of user-generated tags becomes a “folksonomy” (to use a worn out term), but typically in social software sites, the choice of the tag is completely up to the user so you get multiple versions of the same term (US, United States, USA, usa, U.S.A., etc), you have meta terms (to-do, to-read), and sometimes some unpleasant stuff. LIS researchers in information organization have done a ton of papers on these things and people who do taxonomies for a living sometimes use them to help determine “preferred” terms.

So according to this blog post. SO seeds new sites with a few sample terms and they started by letting everyone add new terms. Then they allowed moderators to merge terms. Then they required higher and higher reputation scores to be able to add new terms. But the terms were getting out of control. So this is cool, they now have wiki scope notes and synonyms for terms.

My CS colleague from work (hi Jack!) gives me a hard time – generically as a librarian – that I think all of the vocabularies should be determined in advance and human assigned, etc. He thinks these things should be emergent and machine assigned where possible. Obviously neither of us entirely subscribes to either of these views. If you have the luxury of the funding and time to have a good controlled vocabulary and human machine-aided indexing, your information system will be easier and better to search (better recall, better precision, more user satisfaction). However, it’s hardly ever the case that you have all of these things, and even if you do, user suggested terms are important to add to and maintain your CV.

Ok, one of my ongoing jokes is how CS keeps reinventing LIS (well indeed they’ve taken over the term “information science” in some places) – so now Stack Overflow has reinvented taxonomy (not quite a thesaurus though, right, because no BT or NT just UF and U, lol)

Edit 8/7: Promoting this from the comments. Joe Hourclé tells us that they've addressed some of the issues discussed here (I doubt they read this though 🙂 ) see:

4 responses so far

Well, sometimes you just have to Google it

So there I was, try all kinds of librarian ninja tricks on the fanciest, most expensive research databases money can buy (SciFinder, Reaxys, Inspec...) and no joy. Couldn't find what I needed. I'm perfectly willing to admit that I don't know all that much chemistry, but usually I do ok since I work with one chemist quite a bit. Finally I gave up and googled it. After a few tries, I found way down in the results an article about something else (like I needed a chemical in an aqueous solution and it had the chemical in alcohol), but the snippet drew my eye. Sure enough - had a table with my data in it. An ACS journal from 1945.

The data I needed were not the focus of the paper - they were there sort of as a calibration or reference type thingy - to show what the setup would do with no alcohol. So it's absolutely right that the document wouldn't have come up in my search, because technically the article didn't match. That's why the full text search worked.

It could be that I could locate the info using SpringerImages (but it's an ACS article) or using CSA's deep indexing (is illustrata still around? I did try Aerospace & High Tech).  Lesson learned.

8 responses so far

SIGIR 2010 is going on right now

Jul 21 2010 Published by under information retrieval

in Geneva. This is the ACM Special Interest Group on Information Retrieval. Besides the academic types, this huge conference pulls a lot from the search engine industry and there’s a lot of interesting stuff.

The twitter tag is #sigir2010 and there’s some bloggy coverage. (heh, today’s keynote – is the Cranfield paradigm outdated – um, yes, if it was ever dated!). Danield Tunkelang is blogging (I think he’s at Google now).  A search on Google blog search yields a few more.

3 responses so far

Using the fact that sometimes scientists look at the pictures first

ResearchBlogging.orgI was happy to see that the authors published this article in PlosOne. I was following their work a while ago, but had lost track (plus, when asked, the last author implied that they had moved on to new projects). So here's the citation and then I'll summarize and comment.

Divoli, A., Wooldridge, M., & Hearst, M. (2010). Full Text and Figure Display Improves Bioscience Literature Search PLoS ONE, 5 (4) DOI: 10.1371/journal.pone.0009619

The authors created a prototype information system that used Lucene to index the metadata for open access biomed articles, the full text, and the captions for images and tables. The interface is set up to allow you to use one search box and then radio buttons to select full text and abstracts, figure captions, or tables. In the first, the results are sort of like the standard metadata and abstract with key word in context excerpts and extracted images. For figure captions, you can either have a grid of figures, or a list. For tables, you get a citation, the table caption, and the table. The article spends a good deal of time discussing design decisions, providing a tutorial for creating your own.

image To build the prototype, they got the XML from PubMed Central, pulled out authors, images, captions, abstracts... They made different sizes of the images for quick retrieval later.They then included different fields with different weights depending on what you select to search. They then got a group of biologists (n=20 although number isn't really important for qualitative studies), and ran them through a study. The participants provided the query and looked at it in each view, thinking aloud about their reactions and steps. They were then asked a few questions about each interface

The majority of the participants would choose to use this type of interface for at least some of their searching. Seems like they got the full text search, but were not quite as sure about the table search. Some thought it would be useful for getting right to the results but several didn't think they would use it.

Now for some commentary...

I was somewhat critical in my post I linked to above, but I really think this is promising stuff. The authors point out that this is very dependent on access to the full text and also won't be universally useful. There are plenty of search situations in which the images wouldn't be used, but they should be an option. Since my earlier post, CSA has added "deep indexing" to more of their files.  It's not the sameCSA Materials Database as their dedicated Illustrata product, which is more like Biotext.

Publishers have the full text, so some of them are also making the images and tables available outside of the article. For example, both ACS and RSC have added images to their RSS feeds. ScienceDirect has a tables and images tab on their articles - which is nice for scanning to see if the article is relevant.  PlosOne lets you look through a list of the tables and images, download a ppt or high quality image.

Springer Images also lets you search the tables and captions to get pictures. It also indexes the context of the reference to the image in the text. You also get a link to the article and excerpts like on Google Books. My colleague at work pointed out that it is useful for finding phase diagrams.


But more than all of that, there's been a lot of talk recently about disaggregating the journal article or even doing away with the whole and just using the pieces. If so, maybe this is an intermediate step.

3 responses so far

« Newer posts Older posts »