Can’t a machine do that?

Sep 05 2009 Published by under Information Science

...'cause I thought I heard of a software and I know people at x conference said and seems like....

I get this all the time. Most recently I did a pretty detailed presentation of some analysis I did. Once I was done, I got the question: can you demo the tool that provided these answers for our boss? Another time that sticks out in memory - a customer saying to me: oh, find out what software or algorithm P (a male member of our group) uses to give all of that helpful info. Just as I thought, P had a reminder set up so that every Wednesday morning he opened the web page and searched and scrolled through the new entries.  There was no software. There was P and his knowledge of what a ton of different groups around the lab needed (alas, P was laid off retired).

Similarly, ever since I wrote an article on blog searching for competitive intelligence, I've sort of browsed the alternatives for retrieval and analysis of blogs specifically, but also internet and literature content. I've also been chatting with some folks who do geolocation(pdf) and sentiment analysis of blogs as well as chatting very briefly with a computational linguistics guy who asked some very pertinent and probing questions about what research databases, human indexers, and librarians actually do or could do better than google or a smarter version of google.  Beyond the whole reference interview - which IS important and part of the answer - there's this horrible mismatch between what people think the content/information analysis tools can do and what they actually deliver in the way of automatic categorization, summarization, question answering, sentiment analysis, etc. 

If you need more than my experience, consider the disastrous search engine described by Tunkelang (among others). The mismatch between people's hopes (and the hype) about Wolfram alpha and the reality (sometimes funny).  Another example is the over hyping of Scopus as 95% precision and recall for author identification (first of all, not even the right metrics, second, no freakin way). Scopus author and institutional identifications are often wildly inaccurate, IMO, and I hope they don't sue me for saying so!

There are lots of researchers working in this field - thousands - and there are lots of companies providing these services to businesses and government. Ideally, you'd have some sort of dashboard so that for a new product or campaign you can see how it plays in Peoria. You might also be able to figure out what features are needed or what new products are needed. Other types of dashboards would show what research is being funded, how your researchers rank, and what your competitor is getting ready to do based on job announcements, press releases, etc. Don't get me wrong, people do sell very expensive products that do each of these things (some in the 6 figures US$ or higher), but they're really not where they replace an analyst, they just make analysts better.  See this discussion by someone who has used a bunch of the social media monitoring technologies (via enro).

In practice - and I do this for a living although I don't consider myself an expert - I use these tools when I can (not the \(\) ones), but the biggest part of what I do is to use these tools to find "interesting" places in the graph that I can then investigate to see why or what's going on.  A lot of creativity is needed in searching and understanding how different people use different words to describe the same thing. Then, once you get a bunch of content that might have the answer, you need to visualize it and run some of these tools against it to find out where to look. And then you read and think and discuss and question. Ultimately the systems right now are at the point where they can say: look around here - that's probably where what you want to know is but they're not at the point of saying: 42. IMHO.

One response so far

  • travc says:

    There is a similar dynamic in some of the research I've been doing (using sensor networks to do some particular field bio observations). The engineering/CS side often looks for completely automated solutions. The bio side doesn't trust automated classification (for pretty good reason most of the time). Fortunately a lot of people actually grok that it makes the most sense to have automated systems trim the data down to a manageable size which humans then deal with (making all those complex judgment calls.)
    "Human in the loop" is the term of art we've adopted.