This is the 5th of the test essays in preparation for comps.
This question was posed by my advisor. I opened it
and went, "wow." It's sort of like the perfect storm of question.
When I first finished it, I thought I did really well, but
now it seems less than completely satisfying. So here's the essay written in 2 hours, timer started prior to opening the question.
Question (IR 1)
Informal interpersonal communication is very important among
scientists. Describe a retrieval system to identify collaborators.
Include the following in your answer:
a. Knowledge representation to enhance matching queries
b. Search features and fields
c. Relevance criteria that scientists might apply
With increasing complexity in scientific work, collaborations spanning
discipline, department, institution, and country are ever more
important to exploit the expertise available and to make the most
scientific progress. Yet it can be complex to identify potential
collaborators outside of a scientist’s particular research
area even within an institution and particularly internationally. This
essay describes a retrieval system to assist scientists in identifying
potential collaborators. The essay starts by describing reasons a
scientist might be looking for collaborators and what a good system
suggestion would look like. It continues by describing the knowledge
representation system required to support retrieval. It then describes
the required search features and fields. The essay ends by describing
relevance criteria scientists might apply.
1. Problem Space
Finding potential collaborators is not a trivial problem, in particular
for scientists entering a new field or searching outside of their
primary field. This system seeks to represent the user’s
collaboration needs in the form of a series of queries, the pool of
collaborators in the system, and to match the two.
Scientists might be looking for collaborators for several reasons.
These include looking for similar interests or methods or looking for
complementary interests or methods. Looking for similar interests or
methods will enable the scientists to form common ground more quickly,
and to immediately work on new problems; however, it might be more
typical that the collaborator is expected to fill in the missing puzzle
piece, and to bring a missing but needed expertise or resources (genes,
reagents, equipment, etc.) to the collaboration.
Once scientists identify potential collaborators on topic, they must
also assess how well that potential collaborator’s
personality, working style, or authority fits in with the existing
team. In other words, even if the potential collaborator has the ideal
interests or skills, he might not be a good collaborator if he is a
jerk. This system will not address these aspects directly; however, the
system could link out to social networking sites or social computing
sites so that the user can assess personality and work style.
1.1 Similar Interests
A scientist might be looking for another scientist who:
- is interested in precisely the same research problem
- uses the same model organism or chemical
- studies the same system or entity
- comes from the same research paradigm, approach, or school
- or who uses the same experimental equipment
1.2 Similar Skills or Methods
In addition to interests, scientists might be seeking collaborators who
have similar skills, who use similar experimental methods, or who use
similar mathematical, statistical, or computational methods.
1.3 Complementary Interests
An example of complementary interests might be in ecology, when one
scientist is interested in a particular organism and another is
interested in the system, they might be able to share data or
collaborate when that organism is sighted in the ecosystem. Zimmerman
and Van House both describe the sharing of biology data between
scientists with complementary interests.
1.4 Complementary Skills or Methods
In this case, the scientist might be seeking a collaborator who can
bring a necessary expertise to the collaboration. For example, the
scientist might have decided that a certain data reduction or
computational method is required to deal with the data, so will search
for that needed expertise, regardless of the subject area in which the
potential collaborator typically works. Statisticians in biomedical
research areas typically appear as co-authors on papers with very
diverse research participants, diverse diseases or health issues, and
2. Knowledge Representation
In this system, knowledge representation is how we describe a
collaborator so that the system can retrieve relevant collaborator
profiles and so the user can assess relevance using the criteria
described in 4, below. The user’s knowledge representation,
and the representation of his or her query is described below in
section 2, Search Features and Fields. Representation of the potential collaborators also includes the
derivation of the information to complete the profile.
2.1 Representation of a Potential Collaborator
As described in section 1 above and further described below in section
4, Relevance Criteria, users of the system might be looking for similar
or complementary interests or skills. The representation should include
fields typically found in a CV and others:
- Language proficiency
- Current location
- Willingness to travel
- Citizenship, and clearances (if US)
- Education (what degree, from where, with which advisor, at
- Employment (institution, location, lab or research group)
- Model organisms used
- Organisms studied
- Chemicals or systems studied
- Equipment expertise
- Articles Written (these should be described with the
typical citation fields, and also descriptors)
- Conferences attended
- Society memberships
- Classes taught
Not all of these fields will be relevant to all users. Most of these
fields should be tied to controlled vocabularies to enhance retrieval.
The controlled vocabularies should be those commonly used in each field
so the contents of the field will include the term, but also a semantic
link to a “namespace” or vocabulary.
2.2 Derivation of the Representation Information
As in literature, characterization can be done through an explicit
description of the character, what the character says, or through the
character’s actions. In other words, we can derive the
representation of the scientist through
- a profile the scientists creates;
- through mining of the articles, data sets, protocols, and
other deliverables the scientist creates;
- and by compiling information on previous collaborations,
guest lectures given, and conferences and workshops attended.
More experienced scientists will have a lot more information in their
profile, than younger or less experienced scientists. For this reason,
it is equally valid that the user asserts expertise in his or her
profile as well as substantiating that expertise with published
The system should mine the research literature, have user-created
profiles, mine society web sites and membership directories (when
allowed), and mine university and research lab web sites.
3. Search Features and
The system has to support query formulation in several ways to be
effective. The system should allow standard fielded keyword searching,
browsing by various fields, matching or recommendations by similarity,
3.1 Keyword or Guided Keyword Searching
Most users expect to find a screen into which they can just enter a few
keywords and get some results. This system should provide that feature,
and if possible should attempt to map input terms to appropriate
fields. For example, if someone enters a chemical name, the system
should return scientists that study that chemical, that use methods
which require that chemical as matrix, reagent, or whatever, and so
forth. The search results should provide facets to allow the user to
further refine the search by other criteria.
Guided keyword searching would allow the user to input a term, and
require the system to locate that term within a field. As in other
research databases, the user should be able to look up the appropriate
term in an index, or just enter any keyword that comes to mind. The
system should offer search suggestions and spell checking.
Browsing is underrated by many information retrieval practitioners but
it is an important way to search for information. In this system,
browsing is particularly important because users might not know how a
method is described in a different discipline. Once the user clicks or
inputs any facet to browse, the system could provide a list, and then
display facets by which the user could narrow. For example, show me
only those who are within 100 miles of my current location.
Rashmi Sinha described something called pivot browsing, where from a
particular record found through any method, you can then start a new
browse by clicking on a field. In other words, you could browse other
profiles using the same term from the same controlled vocabulary and
maybe get recommendations from other vocabularies.
3.3. Matching by Similarity
A scientist who is represented in the system should be able to find
another scientist listed in the system by highlighting portions of his
or her profile and asking the system to locate similar scientists.
There could be some prioritization system, too, if multiple fields are
selected. For example, it is most important that they have the same
citizenship as me, but it is also important that they study this
organism, and it would be nice if they use this method, and I could
meet them at an upcoming meeting or workshop because they’ve
attended that meeting in the past.
Similarly, a user could point to a research article or protocol, and
ask the system to locate a scientist relevant to the methods used or
the topic studied. This match might occur by looking at citation of
that section of the document, or even better, the system could extract
the methods or topics from the document, and then match these in the
The system should also be configured to alert users to new profiles
matching their search profile. This alerting can happen within the
system with a message displayed on login, or e-mails or RSS feeds can
be sent out to the user if they prefer.
4. Relevance Criteria
There are many criteria a scientist might use to assess the relevance
of a potential collaborator using their profile in the system.
4.1 Topical Relevance
Topical relevance describes the match of the features of the searcher
with the features of the retrieved scientist. A potential collaborator
will be directly relevant if they are interested in the same research
problem, they use the same approaches, they have the needed skills,
they use the same research or analysis methods, and so forth. A
potential collaborator will also be relevant if they bring new and
necessary skills and expertise to the collaboration – this is
similar to the topical relevance idea of novelty.
4.2 Other Relevance Criteria
There are other important relevance criteria in addition to topical
matching. These include availability, policy, seniority, and access to
An ideal collaborator will have time available at the appropriate
portions of the project cycle to devote to the issue. While this seems
relatively straightforward, the new project might be more important to
the scientist than existing projects, so he or she might be willing to
oversubscribe or accept projects even when fully scheduled. On the
other hand, existing requirements such as teaching a course or serving
on a committee might severely restrict the scientist’s
ability to travel to equipment or allocate the time required.
Policy criteria are multidimensional. Depending on the funding source,
the scientist’s citizenship and location might be extremely
important. For example, a national government might require that all
investigators come from that country. Likewise, a funding body might
call the research that results from a collaboration export controlled,
so team members could not discuss results without export clearances.
Other policies that could impact this are requirements for the
protection of privacy of participants (health information, student
information, etc), the treatment of animals, or the treatment of
intellectual property. Potential collaborators with experience and
education in the funding body’s requirements for privacy, for
example, will be more attractive than collaborators who need to spend a
great deal of time in online compliance courses and certifications.
Potential collaborators might be very senior, running their own lab, or
might be new post doctoral researchers. More than topic matching on
expertise, these criteria rely on mentoring possibilities and project
4.2.4. Access to Resources
For some “big science” projects, the primary
relevance criterion might be access to large, very expensive equipment,
but it is also important for smaller projects. Access to resources
could mean model organisms, human subjects, astronomical datasets,
computing power, ecosystems, or spectrometers, for example. A scientist
might be skilled in the use of some equipment, but might only be
relevant if he or she can access the equipment for the collaboration.
An organism or ecosystem might be of interest, but a scientist might
need to find a local scientist who has physical access to go to the
In this essay I have described desirable features for a system to
recommend potential collaborators for a scientist. I described the
problem space, how the collaborators would be represented and how that
information is obtained, how search should be facilitated, and how the
users will most likely judge relevance.
In this system, I have omitted discussion of probably one of the most
important things in collaboration, chemistry – as in e-dating
sites. In other words, whether the two potential collaborators can get
along or if they will have personality and working style clashes that
will impede the work. The system could facilitate the scientists
chatting online, but meeting outside of the system is probably more
important in determining if the collaboration will work. Additionally,
some might suggest a rating system for collaborators, but this has so
many political and ethical implications it could prevent the system
from working or indeed, ever being released for use. Unless someone has
committed scientific fraud which is well documented, they should stay
in the system and be judged based on their actions and outputs which
are less controversial and less likely to cause unpleasantness.