Archive for the 'comps' category

Actual comps response: Information Retrieval

Aug 09 2009 Published by under comps, information retrieval

Now that I'm not scared to look at my responses...  This one doesn't look so bad, so I'm sharing.  Please do keep in mind that this was written in 2 hours, by a tired person, with tired fingers!


Christina K. Pikas

Comps Information Retrieval (Minor)

July 20, 2009

Question F2: Design an information retrieval system for scientists that covers full-text peer-reviewed articles as well as blogs and wikis

0. Introduction

Today, scientists use more than just the peer-reviewed journal literature in their work, but our information retrieval systems such as our library research databases and online catalogs only cover the journal literature, conference papers, books by title, and to a limited extent, protocols and technical reports. We must support a more comprehensive view of information to stay relevant and to support scientific work. Accordingly, this essay describes the design of an information retrieval system to provide access to this broader variety of scientific information a user might need. The essay starts with an outline of the design elements of the system and continues by identifying the major issues in designing such a system. It ends with a discussion of the best ways to evaluate the system.

1. The design elements of the system

An information retrieval system must take an input that is a representation of a users' information need, match that information to representations of information in the system, present search results, and take feedback. A modern system must go further. It should support learning and exploration, provide the users with a workspace, and be standards-compliant and interoperable with the ecology of other information systems used by the expected users. This section describes the following design elements:

• Coverage
• Query formulation for expected search types
• Representation of information/internal information organization
• Matching
• Presentation of search results
• Working with search results
• Interoperability

1.1 Coverage This section describes the types of information that should be included in the system and the sources for that information.

1.1.1. Types of information

There are many different types of information a scientist needs to do her work. Some types of information might include:
• research articles written on the subject to refine her research question
• protocols and textbooks to learn about research methods
• catalog information for suppliers for lab consumables
• grant submission information
• policy information for her research subjects or for conducting certain types of research
• conference and event information
• contact information for potential collaborators, mentors, or assistants/technicians
• information to keep up with colleagues' work or new advances in the field

With such diverse and complicated information needs, of which those listed above are a small sample, this system should probably start by covering external information such as the research literature, publicly available blogs and wikis, protocols, and grant resources. Organization specific information such as animal research boards would add value, but would require customization for each location. Likewise, full integration of personal information management tools would be ideal, but might be too difficult for the initial system.

1.1.2 Sources of information Research information

The vast majority of the research literature is available online, for a fee, with a license. No institution, no matter how wealthy, has access to all of the information they need. Frequently, abstracts and tables of contents are available for free even to non-subscribed journals. Research articles are well covered by research databases such as Medline, Inspec, and Compendex. Most research databases and digital libraries can be added to a federated search using a web service or a Z39.50 connection. The major exception to this rule is in chemistry: Scifinder (Chemical Abstracts) cannot be federated. The source for the research information should be all of the research databases in science and the digital libraries as well as general searches. Blogs and wikis

A general web search of all blogs and wikis will not be useful as it will be very difficult to avoid introducing lots of noise. Nature Publishing Group has created a directory of science blogs. Each blog must be nominated by another blogger who is already in the system and then reviewed by the system managers. Likewise, the ResearchBlogging site run by Dave Munger and sponsored by Seed requires a review of the blog before it can use the logo and have posts included in the listing. Some listing like this should be required for inclusion in the system. Once a source is included, there could also be a way to report abuse or if the blog is inappropriate. People information

A useful system should help link scientists to other people. There are directories from the professional societies, but these are probably not open for use. Likewise, the COS product has an opt-in product, but it might not be available for federation. Other sources are lists of grant recipients, university or research institution directories, authors of published articles, and site users/profile creators.

1.2 Query formulation

Query formulation can be quite difficult, particularly in a system that covers such a diversity of information. For example, in blog searching many of the searches are filtering searches to set up an alert or to find a new blog to follow instead of ad hoc searches or navigational searches. This system should support these filtering searches, to enable the user to set up alerts on new topics or find continuing resources (like journals and blogs) to follow. Most searchers still use 2-3 keywords to "teleport" and then refine and iterate their search. Ideally, the system could know what information you already have in your personal stores and use that to show only novel information and to get a better representation of your information need. Also, search could be initiated from within Microsoft Office documents and web browsers by highlighting text or a drawing of a chemical, and then right clicking or otherwise invoking the retrieval tool.

1.2.1 Keyword search

A simple keyword box and a guided keyword search are expected in most interfaces. The system can support query formulation by spell-checking and auto-completing the search. It should also be able to recognize if the term input is a chemical, organism, person, or other category of controlled information and ask the user if she wants to search using that index.

1.2.2 Query by example

The user could input a document or other information and get more like that or get citing or cited documents.

1.2.3 Known item search

The user might want to retrieve a known document to get access to the full text or to see how it is represented in the system. The system should allow the user to enter a citation or even just a PMID or DOI to retrieve an item.

1.2.4 Browse

The system should allow the user to browse by any controlled vocabulary used including those representing people, chemicals, organisms, institutions, journals, and so forth.

1.3 Representation of information

Some of the information representations would just be the full text of the document, but machine aided indexing or entity extraction using some of the many controlled vocabularies and information organization systems used within science would be helpful. An early decision would be if the system is primarily a federated search or if it spiders and caches. A federated search will have the freshest results, but will be slow to display the results to the
user. It also relies on the
native search of the targeted source. A spider and cache model would be preferable for speed and the amount of pre-processing required to support entity extraction and other requirements listed in this essay, but one should not underestimate the policy negotiations required to make that possible (Summon seems to be doing this, so that might break some barriers). A federated search would not require the system to store representations of the information objects whereas a spider and cache model would.

1.4 Matching

There are several well-accepted information retrieval models that would work well here including Boolean, vector space, probabilistic, and language models. Their exact details are beyond the scope of this essay.

1.5 Presentation of search results

The presentation of the search results must have enough information to enable the user to judge the relevance of the results, understand how the system interpreted the query in order to provide feedback, and should further support the exploration of the information space.

1.5.1 How scientists judge relevance

Scientists judge relevance by topical measures, both direct and indirect, as well as by other measures such as novelty, timeliness or recency, authority, and subject discipline. Authority is judged using the author's name and affiliation and the publication venue. The number of citations an article has received is also a proxy for utility and authority.

Full text availability is important. In a large JHU libraries study, we found that users want to have a representation of how long it will take to get the information in the results (i.e., available immediately online, walk over to the library, 2-3 days from elsewhere in the institution, 2-3 weeks for ILL).

For web pages, the url is often used to make a first guess about the relevance. Key-word-in-context snippets are also helpful.

For items that are not "citable" some representation of usage or in-links can be used to indicate potential utility of the article. Likewise, for wikis the number and frequency of edits can show if the article is controversial or active, and the number of comments received for blog posts sometimes is informative.

Enabling the users to rate items, and then showing a user rating in stars next to the item can be useful.

1.5.2 Faceted presentation of search results

Faceted presentation of search results enables the user to explore the information space by showing the categories occurring in the retrieved set and how many times they occur. The user can then use these to further narrow the set.

1.5.3 Pivot browsing

Each person's name and controlled vocabulary term should link to allow the user to create a new search using that term.

1.6 Working with search results

The ideal system should allow the user to annotate search results, save some or all of them within the system or export them to another system. The search results should be sortable by any field and the user should be able to search within the results. The system should also allow users to share their work with others in the system in small group areas.

1.7 Interoperability

This system cannot be designed as an isolated system; it must take information from other information systems and be able to export information out. At a minimum, it should be interoperable with the following systems:
• the library catalog and worldcat - for availability information as well as intellectual access information
• open URL resolvers and electronic entitlement systems
• usage statistic compilation software (COUNTER and SUSHI compliant)
• interlibrary loan software
• bibliographic management or pdf management software

2. Design issues

The primary design issues unique to this system are traceable to the diversity of information covered. There are known issues associated with searching research databases that cover journal articles and conference papers. There are known issues with searching the web, and some unresolved issues with searching social computing technologies. By making these sources all available in the same tool, we compound these issues and add issues related to conveying the authority (and how assessment of authority differs in the various sources), freshness/recency/timeliness, and mixing structured and unstructured data. It is clear that this is an open issue, because the current federated searches and overlay discovery tools do not do this well and they typically only search library catalogs, institutional repositories, research databases, and digital libraries. We should also address if or whether scientists want blog and wiki content to be surfaced in the same tool as the research literature. Some may prefer to keep this content separate.

A secondary design issue is getting access to the content and most likely mixing federated content with spidered content.

A third issue is getting enough user feedback from practicing scientists who are quite busy. It is more straightforward to get LIS or CS graduate students to test a system, but this system should be co-developed with and tested by the scientists who form the end user group.

3. Evaluation

A user orientation is critical to the success of this system. There should be formative evaluations while the system is being developed and summative at key points in the process. The first evaluations can be with paper prototypes, asking potential users to give subjective feedback on the design of the system. Other evaluations can be done with limited functionality prototypes and then beta and production systems.

3.1 Experimental or lab evaluations

During the development process early evaluations might require the use of assigned tasks so that the designers can be certain that the users test the desired features and functionality. Participants should be scientists and should be working on a topic of interest, if the topic is limited to those the system can support. Once a training period is complete, the interaction with the system can be monitored to see what paths or how people use it. A survey can be given at the end to surface any participant complaints.

3.2 Naturalistic or field evaluations

The final evaluation of the system should be in actual use by the expected users for an extended period of time. While they are using it, the users can provide feedback in the form of e-mails and comment forms, the developers can unobtrusively capture system logs, and there can be a survey near the end of the trial period. From the system logs, the developers can see what types of queries are being used and the path that users follow. The navigation paths and time for each stage along with the documents the users save can show the developers how various features are working.

4. Conclusion

An information retrieval system that covers social computing technology content with peer-reviewed content is quite complex. This essay has provided a high-level overview of the requirements for such a system given what we know about scientists information needs and information retrieval system design and use.

Comments are off for this post

Comps day 4

Jul 20 2009 Published by under comps

This was the last day of comps - it's up to waiting for the results. The essays were emailed out to the readers immediately at 4pm when I finished.

This day's questions were much more attractive.

First exam was Information Retrieval - a minor area. I had to pick one of three:

- something about a system to present science so that engineers can use it to make technology?

- design and tell how to evaluate an information retrieval system that has both peer-reviewed scientific literature and blogs and wikis of interest to scientists

- exploratory search, what is it, how to support it, etc

I picked the second, but the third was really attractive, too.

The second exam was Research Methods - a minor area. I had to pick one of these two:

- validity in qualitative research what is it and why is it important

- sna vs. bibliometrics, how does one inform the other

I did the validity one because I'm not 100% sure I could really divide sna from bibliometrics as far as the actual methods go. Theory is certainly different but whether the nodes are papers, people, or countries, who cares?

Each of these responses were 1200-1600 words. That's a lot of typing!

If I pass, then I advance to candidacy - or, rather, I can submit my paperwork to advance. Chatting with my advisor today, seems like I probably should put that off, because it starts the next counter. I've got another year until I have to, and when I advance, I've got 4 years to graduate. (I really, really, really hope it doesn't take me that long!)  So now I need to get working on my dissertation proposal (develop a topic first!)

Comments are off for this post

Comps day 3

Jul 17 2009 Published by under comps

Sigh. TGIF and TG I have 2 days free before the next 2.  I really felt like I was very, very familiar with these articles but I had a really hard time.

I think when I look back on my response, I'll see it really wanders and doesn't make a good case.

Here are the questions. Choose one of:

- Olson and Olson (2000) Distance matters is a seminal [sic- for feminists, I know] paper. Describe why they said distance matters. There have been a lot of technological advancements since 2000, does distance still matter?


- Researchers argue Social network analysis is either a theory base or a methodology. Give the reasons for each of these, then pick one the other or both and say why you think that. (paraphrasing).

So my problems were that I remembered a lot about distance matters - one of my readers mentions this paper frequently, but I wasn't sure that I remembered all of the reasons they gave and that I could differentiate points made in this paper from ones I'd read elsewhere. The second problem is that I'm totally not familiar with the term "theory base."  I have a lot of opinions on SNA, but ...

So I picked the first. Now that I'm home, I  pulled the paper up (big mistake) and CRAP, forgot there is a huge aspect of technology readiness. Pretty important to the paper. 

My response might be good enough (I hope), but I'm sure it is not great. Now off to think of something else before I review information retrieval and research methods for Monday!

4 responses so far

Comps day 2

Jul 16 2009 Published by under comps

So this was the dreaded STS. Crap. Good questions, but really, really not sure about my answers.  I did have the guts to shush people talking loudly outside the door and they gave me dirty looks. Oh well!

Q 1 choice between:

- relationship btwn science - technology - society. specifically, relationship btwn science and technology, science and society, technology and society... and how does all this influence sci-tech policy


- pick some major themes out of current sts research and describe them. say how you can use them in your research

picked the first

Q2 choice between

- what is a scientist? what do scientists have in common that distinguishes them from non-scientists?  is the boundary between scientists and non-scientists socially constructed?  how has that changed over time or how is it different over different cultures. what are the practical, political, and rhetorical advantages and disadvantages to drawing this boundary between scientists and non scientists?


- what aspects of STS are useful to librarians and information professionals?...

the second went on a little bit longer, but I picked the first.

I was so done at 4 hours - I wish I had run out of time so I could give that as an excuse. I think my answers weren't so hot.  I did a word count and the first was 1600 words and the second was 1300. I really felt that I repeated myself a lot.


Comments are off for this post

Comps day 1

Jul 15 2009 Published by under comps

Communication, a major area, so 4 hours to write 2 essays, in a row, without a break. (sure I could get up and get water, etc., but I didn't want to waste time).

I got there pretty early, which was good. First thing, had to get someone to open the door. The student worker at the desk had no clue what I was asking her. The proctor (the associate dean) came through with keys and asked if I had checked to see that everything was set up - I hadn't because I couldn't get in. She let me in - no computer, no printer. No IT person (she's on vacation this week). Trotted down a couple of flights of stairs, luckily other IT support person there, but knew nothing about it. Very quickly, he found a laptop labeled "comps" and handed it to me (basically wiped clean and wireless disabled).He even gave me a power cord and a separate keyboard (I might bring a mouse tomorrow). Got it upstairs and it wasn't cleaned off. Had the last person's exam. So I called him and he came up and cleaned it to the proctor's satisfaction. (he was very fast, very helpful, yay!)

With all that, I started the exam on time! (apparently, my questions were awol 24 hours out. Luckily they got hold of my committee chair and he sent them or I'd have to reschedule and who knows when that would be what with work and all)

From memory. First essay: pick 1 of

- Rogers' (2003) Diffusion of Innovations - describe, how diffusion of innovations changed from pre-web, web 1.0, to web 2.0. Does this theory work for most or all ICT innovations?


- Role of gatekeeper in Allen and Cohen (1969). How changed.

(I did pick 1 but should have picked 2 - after all of our recent talk on if "gatekeeper" is good or bad - and one of the readers is one of the people who sees it as I do)

Second essay: pick 1 of

- Scientists have a lot of communication technologies. Paradigm shift or same old same old?

- [edit] propose an innovative service for scientists taking into account their information behavior (or something like that)

I got up and paced a little and stretched between the first and the second.  At about 3:35 in, people started talking and laughing outside the room. I started to get up and snarl at them, but decided against it. Then at like 3:45 they got really loud. Doesn't take much to break your concentration at that point.

Trying to study for tomorrow - well not study, but freshen up.  I'll be really burned this weekend and then have to go back for the last 4 hours on Monday!

[edited 7/16 to add the second question]

Comments are off for this post

Test essay 9: Forces acting on scientists to share and not to share

Jun 29 2009 Published by under comps, open access

This is the final in my practice essays before taking the real comps test in the end of July.  I need to correct the record, though. Apparently although all of these questions came from my advisor, he didn't write them all. These were ones proposed by committee members and rejected for inclusion in the exam. (the gap in numbers you see are two essays that didn't go well). This particular question might be by my advisor with an ok from the two STS committee members. I didn't have any STS questions to practice with so he came up with this one - which I think is an excellent question.


Discuss the forces that move scientists towards open sharing of information and the countervailing forces that prevent scientists from sharing information or encourage them to actively guard information. You may want to distinguish between information on research problems and hypotheses, raw data / data sets, information on methods and apparatus, and information on results. Consider the role of technology in your answer.

I know, right? 🙂

My answer:

0. Introduction

In the past two decades, much controversy and discussion has centered on public access to scientific information, the cost of scholarly journals, and information sharing within science. There are many strong forces that encourage scientists to share and equally many countervailing forces that discourage scientists from sharing. This essay describes these forces and role of technology. The essay ends by considering the role of various mandates in supporting information and data sharing in science.

1. Forces That Encourage Scientists to Share

There are many forces acting on scientists to encourage them to share information and data. These include:

  • wider recognition
  • finding collaborators and making information available within collaborations
  • making scientific information available to the public, scientists not in research institutions, and for data mining or serendipitous location
  • for generalized reciprocity, in order to get data
  • to increase the speed of science or creatively solve problems

1.1 Recognition

Science runs on reputation and recognition; that is, promotion, tenure, winning grants, and attracting graduate students all depend on successful publication of research results in prestigious journals and the citedness of those journal publications. Research has shown that there are many correlates to higher citation rates outside of the quality of the document. These include:

  • article is on the cover
  • article is discussed in the media
  • article is a review article
  • article is longer
  • article is in a more prestigious publication
  • article is open access.

This final correlate is somewhat disputed as there are studies showing both that open access does favorably impact citedness, immediacy, and usage as well as studies showing no statistically significant correlation between open access and citedness over the long term. Even if open access is not significant, we can see that being on the cover and being discussed in the media are both ways that the research is brought to the attention of other scholars. The point is that information sharing with the media increases article citation and recognition of the scientists.

Likewise the sharing of data, workflows, and algorithms in disciplinary repositories can lead to greater recognition of the scientist and his or her lab. Deposits to disciplinary repositories are signed, so high quality results are attributable to their source. The technology of the repository and standards for information structures within repositories make the shared information findable and useful.

1.2 Collaboration

In addition to recognition for promotion, tenure, and grants, recognition can also help in finding new collaborators and in sharing information within collaborations. By seeing the contributions of a person to a data, workflow, or e-print repository, a scientist looking for collaborators can judge the relevance of that person's experience and can also assess his or her expertise in an area.

Once scientists are in a collaboration, open and free information sharing is necessary for trust and to make the project work. This seems obvious but it must still be stated as the lack of information sharing within collaborations is frequently listed as a reason collaborations fail.

1.3 Making information available outside of the invisible college

Despite the frequent mentions in the literature that scientists do not want to consider the societal impact of their research (Polyani, Merton) and do not want to communicate with the public (Weigold), recent surveys indicate that 75% of scientists do communicate with the media about their research and most scientists want their research to be useful and used. Forces moving scientists toward open information sharing include making information available and useful:

  • to scientists who cannot afford toll access to the literature
  • to scientists outside of the particular research area
  • for data mining
  • to the public.

1.3.1 Scientists without toll access

Scientists who are not in large research institutions do not have the same access to the literature because many of the abstracting and indexing services and journals are extremely expensive. Scientists publish in open access journals, post e-prints to their web pages or repositories, or respond favorably to reprint requests to make their research available to these scientists. There is some altruism involved, but the point of publication is to make the results of research available, so sharing of publications does this.

1.3.2 Serendipitous Finds

Scientists who share information in places indexed by major search engines enable serendipitous discovery by researchers outside of the invisible college. Scientists within the research area likely know what labs are doing which work and have access to new research results. Scientists outside of the research area might happen upon this work when looking for something else.

1.3.3 Data mining

In many if not most or all areas of science, computational methods that leverage large collections of data are being used to make new knowledge. Scientists are encouraged to share information and data without restrictive licenses to enable these new uses.

1.3.4. The Public

Open sharing of information with the public can have a positive impact on the government funding of research as well as showing return on past investments in science. Besides getting government funding, scientists can be altruistic, too. While rare, there are often-touted examples of parents researching the biomedical literature to assist in the diagnosis and treatment of their sick children.

1.4 Reciprocity

Scientists might share data for specific or generalized reciprocity. In other words, scientists might share data in order to get data from another scientist in particular or in hopes of getting data in the future from some other scientist.

1.5 Speeding Up Science

Scientists might want to share information openly to get feedback and to solve problems and to speed up the cycle of science. Posting of data or publications on a web page prior to official publication makes that information usable sooner and to a larger group. Many scientific instruments output electronic information. This information can be shared in real time via the web to allow multiple simultaneous diverse uses.

2. Forces That Prevent or Discourage Scientists from Sharing

There are many forces acting on scientists to discourage them from sharing information and data. These include:

  • fear of being scooped or ideas being stolen
  • Inglefinger-type rules preventing information sharing prior to publ
  • intellec
    tual property concerns of the organization
  • sensitivities of information regarding human subjects or national security
  • concern over misuse of information by anti-science groups
  • effort required to describe or format information for deposit or reuse

2.1 Being Scooped

Scientists sometimes do not want to share data until they have "wrung" all of the possible publishable science out of it. The concerns are that another scientist will publish the same information more quickly without the expense of gathering the data or that another scientist will find different information in the data that the original scientist missed (Birnholtz).

Indeed, a cited form of misbehavior in peer review is that the reviewer who is a competitor might use information in the submitted article or might hold up publication of an article until his or her own article is published first.

Some conferences and small workshops do not consider information shared to be "published" and there are guidelines on how this information can be used. Nevertheless, attendees might act on the information and might publish first.

2.2 Inglefinger Rules

The Inglefinger rule from the New England Journal of Medicine states that the journal will not publish any information previously presented in any venue or discussed with the media. Similarly, many journals have an embargo on discussing findings with the media until the date of publication of the journal or the posting of the article on the journal's web page in "early view." Scientists might not share information if they fear that by sharing they will not be able to publish in a prestigious journal. Some of these rules were strengthened after the cold fusion episode in which the scientists held press conferences before peer review of their work. Subsequent peer review and evaluation by other scientists found that their results were not reproducible.

2.3 Intellectual Property Concerns

Scientists might be prevented from sharing data or publishing if their organization intends to patent their discovery. Discussing a discovery or publishing the results starts a clock for patent application or can prevent a patent from being filed.

2.4 Sensitivities

Scientists who work with human subjects or with national security information might be discouraged from sharing due to sensitivities about protecting the privacy of the subjects or concerns over export control or classified information. There are ways to anonymize human subjects data but this still presents a barrier. Likewise, scientific facts should not be classified, but the sensitivities of the research funder trump the forces encouraging the scientists to share information.

2.5 Concern Over Misuse

Open sharing of research using animal experimentation or stem cell research has endangered the physical security of the researchers. By publishing in obscure disciplinary journals, the information is available to other scientists but less likely to attract attention from anti-science groups who have reacted violently in the past. Short of these violent reactions, scientists might be concerned that their research will not be understood.

2.6 Effort Required

Finally, a force acting against the sharing of data is the effort required to describe and make data accessible for wider use. In some fields it is quite easy and straightforward to share data in pre-existing, established, and well-supported repositories. In other fields there might not be any repositories or what repositories exist might be fragmented and with uneven funding and support (Borgman). It is often easier to save the data on a cd-rom in a box under your desk than to properly document it and find a place to store it online.

3. Mandates

Despite the competing forces that encourage and discourage scientists from sharing information, there are mandates to share information coming from several sources. First, funding bodies may require submission of a journal publication to an open repository as a condition of accepting the grant and funders of big science projects require them to make resulting data freely available online. Second, research institutions may mandate submission of publications to the institutional repository for all of their scholars. Third and most successfully so far, groups of journals in a research area might require that the supporting data be submitted to a repository at the time of publication.

5. Conclusion

There are many competing forces moving scientists to share and not to share data, workflows, and publications. The salience of each of these depends on a number of factors not discussed explicitly above but including:

  • the norms and the culture of the research area (sub-discipline)
  • the existence of standards and established infrastructure to support sharing
  • the funding source for the research
  • the scientist's employer
  • the scientist's place in his or her field (in other words, an established scientist might be less concerned with being scooped)

Information scientists can support information sharing by removing barriers related to finding a repository and making the deposit of data or publications. Likewise, we can address ways to secure information such that only those who should have access do. We can also help scientists discuss information sharing with publishers and other scientists to make these concerns explicit and to remove unnecessary barriers.

Technology has facilitated information sharing and discovery, but it alone does not address the cultural and social barriers to information sharing. Ultimately, understanding the social aspects of science along with the technological requirements for information sharing is needed to encourage scientists to share.

8 responses so far

Comps reading & university scientist patenting

Jun 06 2009 Published by under comps, information policy, Information Science

This post serves a few purposes.  First,
Bill Hooker questioned
about the importance of patents to university
researchers [*]. Bill also posted
a nice summary of AUTM's statistics
(later in DrugMonkey's
comment stream
there were other discussions of biotech
spinoffs).  There are certainly some patents that could make
you obscenely wealthy, but these are not all that common (1:2000?
1:10,000?) and I think the consensus is that most tech transfer offices
break even (or slightly better), but do not provide massive revenue
streams for their universities.  So why do universities have
tech transfer offices?  The first reason is that universities
have a social obligation - in particular if they are publicly funded
and land-grant
- to serve the community.  Universities
create the knowledge that can make the world a better place, but they
themselves can't go into vaccine production or widget production.
 Even after something is patented, it will take some support
from the scientist or engineer, and possibly a lot of work to actually
get it into production - particularly if it's a highly regulated
industry. Why would companies spend all of that money and take that
risk if they didn't have first mover advantage?  So
universities must patent so that they can get companies to take these
ideas, invest in them to get them into actual products, and actually
make them available for society to make the world a better
place. Second, it's a matter of prestige for the university.
They are competing for research funding, employees, and students, so
this is one way to prove that they're doing cutting edge work.
 Third, even through you can publish to prevent other people
from patenting, the safest way to make sure you can continue to use a
technology is to protect it with a patent, and then not to prevent any
future work in any licenses.

The second idea is that I'm all into STS - but I think theories and
studies that neglect the structural issues of society and the lab's
place in the world miss really important factors.

Third idea is that I need to integrate ideas from STS better - in
particular since I don't have any practice questions for sts AT ALL and
this is a major area for me. (BTW - do you or someone you know have an
MS/MA or PhD in Science Studies, Philosophy of Science,
 Science and Technology studies? If so, could you forward me
any essay exam questions I could use for practice?  All of the
websites for the programs say "they're available at the office or from
your advisor" so I would be eternally grateful)

This article is on my comps reading list and is apparently where I got
some of the ideas I spouted above and on Bill's blog and friendfeed
Kleinman, D. L.(1998). Untangling Context: Understanding a University
Laboratory in the Commercial World. Science, Technology, &
Human Values
23(3), 285-314. [**]
Kleinman argues that the approaches used in actor-network theory and
social worlds emphasize agency to the exclusion of social structural
constraints on actors.  These ignored constraints include
those imposed by institutions and the distribution of resources.
 These methods also basically start at time 0, with the actors
co-constructing everything - but we know that there existing stable
attributes of the world in which the lab and scientsts are situated.
 So the author has it both ways: structures are constructed,
but "actors will confront structures that are already constructed and
that these structures will shape parctices".  There is a
tradition of studying laboratories (a whole area in STS) - but many of
these, including the most famous by Latour and Woolgar [***], study
labs that are insulated from the application of their science.
 Kleinman studies a lab that not only researches the basic
science, but also the application of the science, and has
collaborations with the university tech transfer office as well as
agro-businesses.  Kleinman acts as a participant observer in
the lab (like a tech who actually screens soil samples and what not as
well as an ethnographer) for a while and then just an observer.

He was in the lab at a really key point.  Patenting of rDNA
technology, a Supreme Court decision allowing the patenting of genes
and things, the increasing importance of corporate funding of
university biotechnology labs (even if still a small % of the
university's total research funding), the Bayh-Dole act, and
the commercialization and standardization of biological research
materials all impact the way this lab does science. (p294-5). (as an
aside - interesting background on biocontrol - what this lab studies -
contrasted with chemicals developed after WW2 - also compare with
studies used in diffusion of innovations research).  The
acceptance of a chemical fungicide that performs about the same
function as the lab's biocontrol agent means that the agent is used as
a yardstick - the new agent must be as effective, easy to use, and
inexpensive as the existing.  The lab was using commercially
prepared polymerase - which it turned out, was contaminated so was
messing up their results.  This is an example of how the lab
is dependent on a limited number companies for standardized resources
sometimes treat like black boxes, trusting the specifications from the

The lab discussed making its own instead of buying to avoid
contamination issues, but the polymerase was proprietary as was PCR.
 Even though there is an exemption for experimental use for
basic science, the owner of PCR and the polymerase had sued university
and government labs and scientists for patent infringement.
 It's not whether or not the university and lab could
successfully defend against such a law suit, it's if they have the time
and resources to spend doing so.  Also, if they do lose, or
win but the basic research right is narrowed, it could really cost them
for future work. It's easier to pay than to defend a moral right -
particularly when your goal is to do science.  

The lab regularly applies for patents, but the author didn't see
evidence of them changing what the lab studies to increase the
number of patents. What follows that is a discussion like my point
above - the evidence does not back up the idea that patent protection
is critical for innovation - but people believe that it is, and to
collaborate with industry university labs in biotech need to patent and
license.  The leader of the lab also uses patents to attract
industry notice and research funding.  Unfortunately,
university tech transfer officials might not be in harmony with the lab
- they might grant an exclusive license which would impede the lab from
working with other companies.

So compare this with Polyani's [****]
discussion of the invisible hand,
in which there's a marketplace for ideas in science, and scientists
work on those that are most promising and to which they can make the
biggest contribution.  In this case, the milieu of the lab has
a lot to do with the research problems selected, particularly if the
university tech transfer office writes the licenses carelessly.
 Likewise, the patenting of the inputs to the lab makes them
more expensive, means there are fewer suppliers, and that the
experiments might need to be done differently.

Nevermind the fact that discussing or disclosing an invention to anyone
besides the inventor starts certain clocks and that inventors should
already have official disclosures and maybe provisional applications in
via their tech transfer office prior to discussing an invention at any
meeting.  Also nevermind the idea that you might intentionally
publish either formally or informally to prevent someone else from
[**] Please note that
>80% of the people reading this blog will know 100% more about
this area of science than either I or the author of the piece - but
that's not the point!
Latour, B., & Woolgar, S. (1986). Laboratory life:
The construction of scientific facts
. Princeton, N.J.:
Princeton University Press.
[****]Polanyi, M. (2000). The
republic of science: Its
political and economic theory. Minerva: A Review of Science,
Learning & Policy,
38, 1-21.

One response so far

Test essay 5: Design a system to find potential collaboration partners for scientists

This is the 5th of the test essays in preparation for comps.
 This question was posed by my advisor.  I opened it
and went, "wow." It's sort of like the perfect storm of question.
 When I first finished it, I thought I did really well, but
now it seems less than completely satisfying.  So here's the essay written in 2 hours, timer started prior to opening the question.

Question (IR 1)
Informal interpersonal communication is very important among
scientists. Describe a retrieval system to identify collaborators.
Include the following in your answer:

a. Knowledge representation to enhance matching queries

b. Search features and fields

c. Relevance criteria that scientists might apply

0. Introduction

With increasing complexity in scientific work, collaborations spanning
discipline, department, institution, and country are ever more
important to exploit the expertise available and to make the most
scientific progress. Yet it can be complex to identify potential
collaborators outside of a scientist’s particular research
area even within an institution and particularly internationally. This
essay describes a retrieval system to assist scientists in identifying
potential collaborators. The essay starts by describing reasons a
scientist might be looking for collaborators and what a good system
suggestion would look like. It continues by describing the knowledge
representation system required to support retrieval. It then describes
the required search features and fields. The essay ends by describing
relevance criteria scientists might apply.

1. Problem Space

Finding potential collaborators is not a trivial problem, in particular
for scientists entering a new field or searching outside of their
primary field. This system seeks to represent the user’s
collaboration needs in the form of a series of queries, the pool of
collaborators in the system, and to match the two.

Scientists might be looking for collaborators for several reasons.
These include looking for similar interests or methods or looking for
complementary interests or methods. Looking for similar interests or
methods will enable the scientists to form common ground more quickly,
and to immediately work on new problems; however, it might be more
typical that the collaborator is expected to fill in the missing puzzle
piece, and to bring a missing but needed expertise or resources (genes,
reagents, equipment, etc.) to the collaboration.

Once scientists identify potential collaborators on topic, they must
also assess how well that potential collaborator’s
personality, working style, or authority fits in with the existing
team. In other words, even if the potential collaborator has the ideal
interests or skills, he might not be a good collaborator if he is a
jerk. This system will not address these aspects directly; however, the
system could link out to social networking sites or social computing
sites so that the user can assess personality and work style.

1.1 Similar Interests
A scientist might be looking for another scientist who:

  • is interested in precisely the same research problem
  • uses the same model organism or chemical
  • studies the same system or entity
  • comes from the same research paradigm, approach, or school
    of thought
  • or who uses the same experimental equipment

1.2 Similar Skills or Methods

In addition to interests, scientists might be seeking collaborators who
have similar skills, who use similar experimental methods, or who use
similar mathematical, statistical, or computational methods.

1.3 Complementary Interests

An example of complementary interests might be in ecology, when one
scientist is interested in a particular organism and another is
interested in the system, they might be able to share data or
collaborate when that organism is sighted in the ecosystem. Zimmerman
and Van House both describe the sharing of biology data between
scientists with complementary interests.

1.4 Complementary Skills or Methods

In this case, the scientist might be seeking a collaborator who can
bring a necessary expertise to the collaboration. For example, the
scientist might have decided that a certain data reduction or
computational method is required to deal with the data, so will search
for that needed expertise, regardless of the subject area in which the
potential collaborator typically works. Statisticians in biomedical
research areas typically appear as co-authors on papers with very
diverse research participants, diverse diseases or health issues, and
so forth.

2. Knowledge Representation
In this system, knowledge representation is how we describe a
collaborator so that the system can retrieve relevant collaborator
profiles and so the user can assess relevance using the criteria
described in 4, below. The user’s knowledge representation,
and the representation of his or her query is described below in
section 2, Search Features and Fields. Representation of the potential collaborators also includes the
derivation of the information to complete the profile.

2.1 Representation of a Potential Collaborator

As described in section 1 above and further described below in section
4, Relevance Criteria, users of the system might be looking for similar
or complementary interests or skills. The representation should include
fields typically found in a CV and others:

  • Name
  • Language proficiency
  • Current location
  • Willingness to travel
  • Citizenship, and clearances (if US)
  • Education (what degree, from where, with which advisor, at
    what time)
  • Employment (institution, location, lab or research group)
  • Model organisms used
  • Organisms studied
  • Chemicals or systems studied
  • Approaches
  • Equipment expertise
  • Techniques/Skills
  • Articles Written (these should be described with the
    typical citation fields, and also descriptors)
  • Conferences attended
  • Society memberships
  • Classes taught

Not all of these fields will be relevant to all users. Most of these
fields should be tied to controlled vocabularies to enhance retrieval.
The controlled vocabularies should be those commonly used in each field
so the contents of the field will include the term, but also a semantic
link to a “namespace” or vocabulary.

2.2 Derivation of the Representation Information

As in literature, characterization can be done through an explicit
description of the character, what the character says, or through the
character’s actions. In other words, we can derive the
representation of the scientist through

  • a profile the scientists creates;
  • through mining of the articles, data sets, protocols, and
    other deliverables the scientist creates;
  • and by compiling information on previous collaborations,
    guest lectures given, and conferences and workshops attended.

More experienced scientists will have a lot more information in their
profile, than younger or less experienced scientists. For this reason,
it is equally valid that the user asserts expertise in his or her
profile as well as substantiating that expertise with published
The system should mine the research literature, have user-created
profiles, mine society web sites and membership directories (when
allowed), and mine university and research lab web sites.

3. Search Features and

The system has to support query formulation in several ways to be
effective. The system should allow standard fielded keyword searching,
browsing by various fields, matching or recommendations by similarity,
and alerting.

3.1 Keyword or Guided Keyword Searching

Most users expect to find a screen into which they can just enter a few
keywords and get some results. This system should provide that feature,
and if possible should attempt to map input terms to appropriate
fields. For example, if someone enters a chemical name, the system
should return scientists that study that chemical, that use methods
which require that chemical as matrix, reagent, or whatever, and so
forth. The search results should provide facets to allow the user to
further refine the search by other criteria.

Guided keyword searching would allow the user to input a term, and
require the system to locate that term within a field. As in other
research databases, the user should be able to look up the appropriate
term in an index, or just enter any keyword that comes to mind. The
system should offer search suggestions and spell checking.

3.2 Browsing

Browsing is underrated by many information retrieval practitioners but
it is an important way to search for information. In this system,
browsing is particularly important because users might not know how a
method is described in a different discipline. Once the user clicks or
inputs any facet to browse, the system could provide a list, and then
display facets by which the user could narrow. For example, show me
only those who are within 100 miles of my current location.

Rashmi Sinha described something called pivot browsing, where from a
particular record found through any method, you can then start a new
browse by clicking on a field. In other words, you could browse other
profiles using the same term from the same controlled vocabulary and
maybe get recommendations from other vocabularies.

3.3. Matching by Similarity
A scientist who is represented in the system should be able to find
another scientist listed in the system by highlighting portions of his
or her profile and asking the system to locate similar scientists.
There could be some prioritization system, too, if multiple fields are
selected. For example, it is most important that they have the same
citizenship as me, but it is also important that they study this
organism, and it would be nice if they use this method, and I could
meet them at an upcoming meeting or workshop because they’ve
attended that meeting in the past.

Similarly, a user could point to a research article or protocol, and
ask the system to locate a scientist relevant to the methods used or
the topic studied. This match might occur by looking at citation of
that section of the document, or even better, the system could extract
the methods or topics from the document, and then match these in the

3.4 Alerting

The system should also be configured to alert users to new profiles
matching their search profile. This alerting can happen within the
system with a message displayed on login, or e-mails or RSS feeds can
be sent out to the user if they prefer.

4. Relevance Criteria

There are many criteria a scientist might use to assess the relevance
of a potential collaborator using their profile in the system.

4.1 Topical Relevance

Topical relevance describes the match of the features of the searcher
with the features of the retrieved scientist. A potential collaborator
will be directly relevant if they are interested in the same research
problem, they use the same approaches, they have the needed skills,
they use the same research or analysis methods, and so forth. A
potential collaborator will also be relevant if they bring new and
necessary skills and expertise to the collaboration – this is
similar to the topical relevance idea of novelty.

4.2 Other Relevance Criteria

There are other important relevance criteria in addition to topical
matching. These include availability, policy, seniority, and access to

4.2.1 Availability

An ideal collaborator will have time available at the appropriate
portions of the project cycle to devote to the issue. While this seems
relatively straightforward, the new project might be more important to
the scientist than existing projects, so he or she might be willing to
oversubscribe or accept projects even when fully scheduled. On the
other hand, existing requirements such as teaching a course or serving
on a committee might severely restrict the scientist’s
ability to travel to equipment or allocate the time required.

4.2.2 Policy

Policy criteria are multidimensional. Depending on the funding source,
the scientist’s citizenship and location might be extremely
important. For example, a national government might require that all
investigators come from that country. Likewise, a funding body might
call the research that results from a collaboration export controlled,
so team members could not discuss results without export clearances.
Other policies that could impact this are requirements for the
protection of privacy of participants (health information, student
information, etc), the treatment of animals, or the treatment of
intellectual property. Potential collaborators with experience and
education in the funding body’s requirements for privacy, for
example, will be more attractive than collaborators who need to spend a
great deal of time in online compliance courses and certifications.

4.2.3. Seniority

Potential collaborators might be very senior, running their own lab, or
might be new post doctoral researchers. More than topic matching on
expertise, these criteria rely on mentoring possibilities and project
management skills

4.2.4. Access to Resources

For some “big science” projects, the primary
relevance criterion might be access to large, very expensive equipment,
but it is also important for smaller projects. Access to resources
could mean model organisms, human subjects, astronomical datasets,
computing power, ecosystems, or spectrometers, for example. A scientist
might be skilled in the use of some equipment, but might only be
relevant if he or she can access the equipment for the collaboration.
An organism or ecosystem might be of interest, but a scientist might
need to find a local scientist who has physical access to go to the

5. Conclusion

In this essay I have described desirable features for a system to
recommend potential collaborators for a scientist. I described the
problem space, how the collaborators would be represented and how that
information is obtained, how search should be facilitated, and how the
users will most likely judge relevance.

In this system, I have omitted discussion of probably one of the most
important things in collaboration, chemistry – as in e-dating
sites. In other words, whether the two potential collaborators can get
along or if they will have personality and working style clashes that
will impede the work. The system could facilitate the scientists
chatting online, but meeting outside of the system is probably more
important in determining if the collaboration will work. Additionally,
some might suggest a rating system for collaborators, but this has so
many political and ethical implications it could prevent the system
from working or indeed, ever being released for use. Unless someone has
committed scientific fraud which is well documented, they should stay
in the system and be judged based on their actions and outputs which
are less controversial and less likely to cause unpleasantness.

One response so far

Comps readings: community detection

Jun 02 2009 Published by under comps, Information Science, online communities

Last set of comps readings, I talked about sense of community:
belonging, having influence, fulfillment of needs, and emotional
support.  Now, let's talk about the physics version of
"community" - cohesive subgroups.  In a graph, these are
groups of nodes in a graph that are more connected to each other than
to other parts of the graph. Clumpy spots.  If you read old Wasserman and
, you'll probably think of cliques, cores, and lambda
sets... some how these didn't do it for me - literally, when I was
trying to locate
communities in science blog networks
, it didn't work..
 If you have a computer science or maybe even sociology
background you'll probably
just look at some sort of clustering (agglomerative or divisive).
 The hot thing for the
past few years comes from physicists and that's what's covered here.
 I did other posts on SNA
articles, so those are mostly
elsewhere. (BTW - if you ever take stats for the social sciences and
can substitute R for stata, do so and take the time to learn it. The
igraph package for R has all of the coolest community detection
thingies in it) (note, too, that these readings are not necessarily for
the dabbler in bibliometrics or social network analysis!)

Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating
community structure in networks. Physical Review E (Statistical,
Nonlinear, and Soft Matter Physics), 69(2), 26113-21.
(just go here)
This article, like the ones from Barabasi, sort of kicked off this
flurry of research.  They use a divisive clustering technique
- so they start with the whole network, and break the connections with
the highest betweeness.  See figure. bowtie.png
See how if you remove
that one line, how you completely break up the thing? That line has
high betweenness. So they calculate that for all of the lines using
whatever method, then take the line with the highest out, then
re-calculate and remove, and again. They then go on to talk about the
actual algorithm to use to efficiently do all of this betweenness
calculating and give some examples.  There's a lot in this
article, though, because they next talk about how to figure out when
you're done and if you've got decent communities. This measure is
modularity (see the article for the definition), but basically it's 0
if random and 1 is the maximum. If you calculate Q at each step, then
you can stop when it's highest. Note that any given node can only be in
one community, unfortunately. (in real life, people are nearly always
in multiple communities)

Reichardt, J., & Bornholdt, S. (2006). When are networks truly
modular? Physica D, 224(1-2), 20-26. doi: 10.1016/j.physd.2006.09.009
(or look here)
They review Newman and Girvan and suggest a new way that groups
connected nodes and separates non-connected
nodes.  They go through a process and end up
with an equation that's apparently like a Hamiltonian
for a q-state Potts spin glass (dunno, ask a physicist if you need more
info on that!).  This method allows for overlapping
communities because there could be times when you could move a node
from one community to the next without increasing the energy.
 They compared it for some standard graphs and it did better
than N-G. Instead of just stopping by minimizing modularity, they
compare the modularity to a random graph with the same degree

Reichardt, J., & Bornholdt, S. (2007). Clustering of sparse
data via network communities-a prototype study of a large online
market. Journal of Statistical Mechanics: Theory and
Experiment, P06016. doi:10.1088/1742-5468/2007/06/P06016
In this one they test the spin glass community detection method against
the German version of ebay to look for market segmentation. The network
has bidders as nodes, and if they bid on the same item there is an
edge.  The spin glass method was successful at pulling out
clusters and using odds ratios, the authors showed that these clusters
corresponded to groupings of subject categories. The Q was much higher
than it would be for a random graph.

Comments are off for this post

Comps readings: virtual communities

May 31 2009 Published by under comps, Information Science, online communities

Sunday morning I was all set to do another essay - just had to pick a
question source and question - when my mother in law called to say she
would be stopping by at about the same time I would be finishing up the
2 hour window, leaving no time for emergency house cleaning (no, I
haven't grown out of that yet despite being married for >10
years). So here are a few readings on "community" which I'll drop like
a hot potato and then run to clean the house.

Both Wellman and Rheingold dispute the idea that we're all "Bowling Alone"
and assert that virtual communities appearing in computer mediated
communication are
real communities, but what does "community" look like online?
 Is the implementation of a "community" software tool enough?
 We're in a second wave of all sorts of vendors offering their
own online communities - this was also done in the 90s.  Are
these communities?  Only when they succeed?  Never?
It depends?  On what?  At the same time, there are
lots of articles coming out in the physics literature on mathematical
ways to identify cohesive subgroups in networks and they
call this process identifying communities.  Are they
identifying communities or only cohesive subgroups? Could you develop
an algorithm to locate a community?  How would you test what
you found to see if it's really a community (or maybe it's a group of
people all disputing a knowledge claim, what Collins called a core
set)?  Is a binary yes or no enough or do we need to know what
participants feel and why?

Blanchard, A. L., & Horan, T. (1998). Virtual Communities and
Social Capital. Social
Science Computer Review
, 16(3), 293-307

This article is more or less in direct response to Putnam's Bowling Alone.
 His thesis was that increasing online activity lead to
decreasing community participation and civic engagement and that this
low participation hurts the community as a whole.  They look
at three possible outcomes of online communities: 1) that online
communities enhance f2f
communities, 2) that online communities detract
from f2f communities, or 3) that they are unrelated. Since this was
written, social capital has been defined (and operationalized) at an
individual level, a group level, and then a societal level.
 Putnam looks really at the societal level. They quote him
describing it as "the features of social organization such as networks,
norms, and social trust that facilitate coordination and cooperation
for mutual benefit."  When they define virtual communities,
they differentiate between online places for physical communities (my
neighborhood has a Yahoo! Group) and online-only communities.

Networks in virtual communities might be larger and more geographically
dispersed.  They might also encourage participation by some
who might not participate in f2f..  Norms in communities
include reciprocity - doing favors and having favors returned.
 The idea in this article is that generalized reciprocity (not
direct, Mary does for Bob, but Mary does for Bob, Sue sees, and
 Sue does a favor for Mary) is increased in virtual
communities because helping acts are visible (see, however, Wasko
& Faraj, discussed on my old blog - they found that reciprocity
didn't really explain any variance in contribution to a professional
virtual community).  Blanchard and Horan also discuss lurking
as a negative social norm, akin to free riding (see, however, various
discussions by Nonnecke and Preece as well as those by Lave and Wenger
on legitimate peripheral participation).  With respect to
trust, it might be increased by increased social identity in virtual
groups and decreased social cues (less stereotyping by physical
attributes), but it will be decreased by flaming, trolls, and deception.

Blanchard, A. L. (2004). Blogs as Virtual Communities: Identifying a
Sense of Community in the Julie/Julia Project. Into the Blogosphere: Rhetoric,
Community, and Culture of Weblogs
, Retrieved from

When I talk about blogs as communities, I mean like between blogs, or
collections of blogs, or bloggers linking to each other and commenting
on each others' blogs.  In this paper, Blanchard looks at a
community that formed within the comments of a single blog (that became
a book, and isn't there a movie coming out)?  The comments in
this blog were like a forum and sometimes wandered from the topic of
the post and had a life of their own.  She asks the question
whether this is truely a community or only a virtual settlement.
 Virtual settlement comes from a paper in JCMC by Jones. It is
defined as when there is "a) a minimal number of b) public interactions
c) with a variety of communicators in which  d) there is a
minimal level of sustained membership over a period of time."
Communities, on the other hand has a sense of community, which includes
a) feelings of membership, b) feelings of influence, c) integration and
fulfillment of needs, and d) shared emotional connection.
 This "sense of community" comes from f2f research on
communities (the next article discusses measuring it in virtual
situations). She did a survey of the commenters after the blog had been
around for 11 months.  Some respondents who commented
frequently felt strongly that it was a community while others who kind
of read it like they would a newspaper, thought not (oh, really? 🙂 )

Blanchard, A. L. (2007). Developing a Sense of Virtual Community
Measure. CyberPsychology
& Behavior,
10(6), 827-830. DOI:
This one was done a few years later (obviously) and she was trying to
develop a valid and repeatable sense of community measure for virtual
communities. In previous work, people pretty much just adapted the f2f
sense of community, but it turns out that community might feel
different in virtual settings than f2f. This measure was developed like
others - f2f scales were modified, and new questions were added to
address things that are different in virtual settings. There was a
pilot, and then it was tested with other groups (total n=256, 7 usenet
groups and listservs).  Factor analysis with maximum liklihood
factoring and a promax rotation.  Once things were dropped
that didn't load where they were supposed to, the internal reliability
coefficient for the SOVC scale was 0.93.  Tested with the
groups, it explained 53% of the variation while the standard sense of
community only explained 46% (better, but eh.)

Comments are off for this post

Older posts »