22 December 2013

The availability of research data

A recently published paper [1] (free preprint here) warns that research data becomes less accessible with time. The authors tried to retrieve email addresses from articles 2 to 22 years old, sent standard messages requiring the data sets and followed up on the responses. In most cases (63%) the addresses were not working or they received no response. The other outcomes were: no information on the status of the data (6%), claim of data loss (7%), refusal to share (4%) and receipt of data (19%).

For some reason, these results caught the attention of the media, leading to dire warnings that scientific data is "disappearing into old email addresses" (UPI) at an "alarming rate" (Slashdot), viz. "the odds of the requested data set still existing fell by 17 percent per year" (Washington Post).

Clearly, the data does not support such an apocalyptic view. The authors are somewhat responsible for this confusion, because by adding the last two outcomes they create a category labeled "extant data" (implying that the remaining 77% are "nonextant") and thus conflate two issues:
  1. The availability of data on casual request
  2. The physical existence of the data
Their study deals with the first point, and a reasonable conclusion would be that scientists are reluctant to do a substantial amount of work (putting together all the data from an experiment in a format legible by someone outside the group) simply on receiving an impersonal message. I guess phone calls would have been more effective.

To address the second issue, one would need to take at face value the 7% declared as lost and extrapolate the ratio (lost/responded) to all the requests.This approach has a few shortcomings:
  1. The number of "lost data" responses is very small.
  2. The respondents may have pretended that data was lost simply to make the request go away.
  3. The "lost data" category also includes information stored on "inaccessible storage media", meaning in practice "harder to retrieve" and not "lost forever".
  4. Finally, the authors exclude from their statistics papers for which the data was publicly available. Since this is exactly the solution they advocate, I find their choice very strange.
Mandating full data sharing is an interesting idea and Vines's paper makes a good case for it but the sky is not falling...

[1] T. H. Vines et al., The Availability of Research Data Declines Rapidly with Article Age, Current Biology (2013).

