Are we at a turning point for citing data?

Sep 13 2010 Published by under scholarly communication

I was just stretching my legs and ran across the 24 August 2010 issue of EOS*, the member publication for AGU. On the front cover was this article:

Parsons, M.A., Duerr, R., and Minster, J.-B. (2010) Data Citation and Peer Review. EOS 91 (34), 297-298.

It's a decent article describing the issues also brought up in an AGU Town Hall and elsewhere (h/t Joe Hourclé). According to the authors, the AGU Council "asserts that the scientific community should recognize the value of data collection, preparation and description and that data 'publications' should 'be credited and cited like the products of any other scientific activity.'" The thing is that data centers ask for different types of citation: cite the data set (use citation that is provided), mention the data source somewhere in the text, or cite some journal article discussing the gathering of the data**. The council also calls for peer review - but who knows what that means? It could mean data accuracy or just that the metadata correctly describes the data.  The authors also suggest that proper citation standards must be figured out if we really want people to share their data. They point to the IPY model as an example. It has some strengths like having editors and a DOI, but there are still issues about when data are scientifically identical even if not in the same format and granularity. The authors continue with a section on peer review - but that seems a bit more dicey to me.

I think we're kind of getting to a turning point here - at least in a couple of fields. Gathering the data can be a full time job and these people need to get tenure, promotions, and grants. Sharing data is incredibly important, but who wants to share when a) they don't get credit for gathering the data and b) they don't get credit for sharing the data. People who want to use existing data sets might have issues finding them and also need a standard way to give credit for them. We also need better linking from the journals to the data and vice versa (see the several efforts to assign DOIs - very promising). If you're reading the journal, you need to be able to find the data (particularly if supplemental data go away).

All of these things are just getting more and more obvious and more important. The answers have to come from the various communities, although it would make sense if one community learned from another. It would also be great if the research databases linked to the data, too. I'm pretty sure ADS does this, but others might want to consider now how they would do it in addition to just having the citation.*** Oh, and yeah, citation managers should learn to deal with data sets, too!

*institutions still can't get this online, and we get like 3 issues at a time, 3 weeks late, but anyway.

** I have a real issue with this - I used a software package for R and the author wants some article cited that has nothing to do with my work or the development of the package. Grrr.

*** journal platforms do, but I mean indexing and abstracting services.

7 responses so far

  • August Muench says:

    hi, by "ADS" do you mean the NASA Astrophysical Data System? the links in ADS from literature to the data are wonderful but the curation required to add them is non-trivial and so much more could be done. - gus

    • Christina Pikas says:

      Exactly. But the other high end value-add indexing services could really add this in and do us all a favor. Like Geobase and Inspec as two examples.

  • Joe Hourclé says:

    Hmm ... I guess I should actually go through the many months worth of EOS & Physics Today collecting in my mailbox at work. (I haven't read the article yet, but I've been so busy the last few weeks, if I don't comment now, I'll likely forget)

    So, anyway, the issue of peer reviewing data. I'm of the opinion that *someone* should be able to go back and look over the processed form of the data and be able to verify if the PI team (or whoever's doing the processing) was done correctly. We've had incidents where discontinuities were found when people switched from PPC Macintoshes to Intel Macintoshes ... and realized that the problem *wasn't* in the Intel code -- it had been wrong on the PPC side for *years*.

    At the 2008 AGU/SPD Joint Assembly, my talk was placed in a session on "Instrumentation and Data Products" ( http://www.agu.org/cgi-bin/sessions5?meeting=ja08&part=SP54A ) and I was amazed to hear what sort of things go into the calibration of the instruments -- and how they get things wrong; in the GOES talk they mentioned how there was a factor applied to all of the x-ray telescopes so that the images from later generations could be compared to the first one ... unfortunately, they've come to realize that the later generations were the correct ones, it was the first one that was flawed ... so an 'X' class flare being 10^−4 W/m^2 ? Well, it's possible that an X-1 is really only 70% that. (which would be an M-7 flare, if I understand the scale correctly)

    Anyway ... the point being -- the calibration of instruments in many ways is an art; you have to make a number of assumptions, and those assumptions and methodology can result in biased information. (and by bias, I mean that it might tend towards mis-interpretation if not compensated for, not necessarily that it was done intentionally)

    So, in the AGU Town Hall, there was a discussion of the difference between simple quality assurance (making sure all of the metadata's in there, that the data isn't corrupted, etc), vs. actual in-depth review of the methodology and processes used to generate the data that's being distributed, and that's it'd been done correctly without error. Data accuracy is whole 'nother issue -- the instruments are getting more and more precise, but for many of them we don't have good ways to verify the accuracy without having multiple sensors observing the same event in the same/similar manner, and cross-calibrating them.

    ...

    Oh ... and the DataCite folks sent around a draft to a couple of mailing lists about two weeks ago, with a request that people comment by the 20th. As best I can tell, they never put it on their website, though, so I can't easily link to it. I'll go and send it to the PAMNET mailing list.

    -Joe

    • Christina Pikas says:

      Joe - you've taught me just about everything I know about geosciences data 🙂 The review thing is really tricky as is how much documentation of data sets is really necessary before sharing. On the one hand, you need a lot to be able to trust the data. On the other hand, you don't want such a high barrier that the data aren't shared. Also, who does the reviewing? Would it end up that it had to be done by other scientists within the facility because they know what to look for? Maybe it's not peer reviewing but internal review and quality control that is needed?

      WRT DataCite - I'm on ESI and I tweeted about it - DDC is a very odd system to use for discipline... I need to blog that! Thanks for sharing.

      • Joe Hourclé says:

        Sorry about the slow delay ... It'd been a crazy couple of weeks, and I'm actually procrastinating on other stuff I should be doing right now. (besides just typing up grad-student friendly recipes for Scicurious)

        So ... the questions you had ...

        Much of the issue about data is how it's going to be used -- if you're going to use the data in exactly the same way that the PI intended, then their assumptions that were made in using the data should be valid, and therefore, it may not need to be documented. The problem is when some of their assumptions aren't valid for your type of study.

        Take for instance one of the scientists in our cubicle farm who decided that they were going to look at multiple year's worth of data to look at how the total intensity changed over time ... she graphed it, and it was a flat line. It turned out that as part of the calibration process, they normalized the total brightness of the images. Therefore, during a solar flare, it'd look as if the non-flaring images were darkening.

        For some types of studies, it's a perfectly legitimate thing; some instruments will even automatically reduce the exposure time if there's a flare, so they don't over-saturate the flares ... but once you then normalize for the exposure time, the flare images will have more noise in them. Some will adjust the contrast radially -- as in some wavelengths, the solar limb (what we see as the edge of the circle) is significantly brighter than the rest of the disk, so they even things out to see different features.

        I'm not sure I can get the same image, but here's the same source, but processed in three different places: (note, the first one's for today as of the time I write this, while the others are the 'latest' image)

        http://sdowww.lmsal.com/sdomedia/SunInTime/2010/09/28/l0304.jpg
        http://sdo.gsfc.nasa.gov/assets/img/latest/latest_1024_0304.jpg
        http://umbra.nascom.nasa.gov/images/latest_aia_304.gif

        And for comparison, here's a similar image from a different telescope:

        http://umbra.nascom.nasa.gov/images/latest_eit_304_full.gif

        Now ... if you're used to dealing with one of these, you might make assumptions that it's been calibrated how you're used to ... and the third one is calibrated like the older SOHO/EIT telescope (the 4th one), to make it easier to for people used to this new data. The PI team, however, came from the TRACE telescope, so they have it more calibrated like TRACE images (the first one), and then there's the compromise (second image).

        um ... I'm off on a tangent. Anyway, as I've been saying for years, the quality of the data depends on what it's being used for. The documentation should explain how it's been calibrated, but there's always going to be some assumptions made because of the time / discipline / etc. Specialists in the field might understand how the instrument design affects certain types of studies, but for someone trying to re-use the data 20 years from now ... are they going to know what the implications are? How about if they're from a completely different field?

        In some ways, data documentation is like documenting software -- the people who made it aren't the best people for the job, or even judging if the documentation's complete. (although, they're okay for judging if it's wrong). Some of our scientists have recommended post-docs for the job. We've often had the mission scientists work on documentation when the instrument PI teams aren't. (but for some reason, they didn't appreciate my documentation that includes flowchart with software recommendations that included the question 'are you a masochist?' ... I guess I have to leave these things for the scientists)

        ...

        As for the who does the reviewing ... it's tricky. Some of the PI teams seem really reluctant to share the details of their instruments; I guess it's what gives them a competitive advantage in biding for the next mission.

        Yet, my boss has mentioned that some of the best calibration of the data for some instruments has some from people other than the PI team. (we're having an issue with what level of calibration we're going to be serving to the public for SDO/AIA, and there's one group that wants 'level 1', while the PI team wants to serve the more high-processed and non-reversable calibration they're calling 'level 1.5'. (and they don't want to give us 'level 0' *at all*. Oh... and did I mention the PI team is the same group that had the PPC issue?))

        In the case of solar images at least, we can compare the images from one telescope, and compare them to ones from another, and see if they're getting similar images ... it's not perfect, but they might be able to spot some problems. For earth sensing satellites, they can compare the remote-sensed values to 'ground truth' to try to verify the data, but I have no idea how often and/or easy that is to do.

        ...

        I'd say in the long run, there's a need for real peer-review of the data to make sure that the assumptions made in the published calibration are useful for more than just the PI team's work, and then an regular ongoing quality control effort to make sure that the data stays consistent.

        • Christina Pikas says:

          using postdocs to do the documentation makes a lot of sense - they know enough to be able to do it - but is it the best use of their time? I guess better than using the mission scientists.

          I guess the people who go to use this in 20 years will be the ones to tell if the documentation is complete, but then it will be too late.

          thanks for all of the info - sorry your comment got held up in moderation

          • Joe Hourclé says:

            I think it's posting URLs that force my comments to require moderation. Unfortunately, some topics are hard to discuss w/out reference materials.

            ...

            Anyway ... postdocs are the new cheap labor. I'd have to find it again, but there was an article earlier in the year discussing as there aren't a whole lot of researcher and teaching positions opening up, postdocs are doing a lot of crap work that used to be done by grad students, rather than having a chance to work on their own science.

            In this particular case, I think postdoc is probably a good level -- they're not too entrenched into any given speciality, so they're hopefully not making lots of assumptions that they're not documenting; this helps them to build reputation in the field (you're trying to use InstrumentA data? You should talk to Dave; he knows all about InstrumentA).

            They have more experience in the field than someone still in grad school, so they'll hopefully write at the correct level for the intended audience.

            As they tend to be younger, they'll also be around to answer questions for longer (we hope). I recall hearing that for New Horizons (the probe to Pluto), they intentionally selected younger scientists as Co-Is because the mission was going to take so long from when the instruments are designed and built to "Phase E" (Operations, ie. data collection and analysis)