PLoS done got it wrong

Feb 25 2014 Published by under Careers

So PLoS has this new data sharing policy. It's, at best, heavy handed and misguided. There are many reasons for this, but I'm going to focus on a practical issue.

They don't have community buy in.

Seriously, they don't. Sure there are plenty of people who are all open access, all the time. I get that, I really do. And I'm all for making my data freely available. But it's not as easy as PLoS would have us all believe. It's not as simple as a publisher saying "you must do this or you can't publish in our journals." Yes, there are publishers that require data to be deposited in various databases. The Protein Data Bank (PDB) for example. Pretty much any journal that will publish a protein structure requires that the structure be deposited in the PDB before publication. And sequences need to be deposited in places like GenBank etc. etc. etc.

Here's the difference. Those databases existed before journals imposed their policies. The various communities involved realized that these things were important and established the databases BEFORE journals really got involved. There was community buy in already.

PLoS doesn't have that. The PLoS policy covers almost ALL data, much of which does not have a corresponding existing database. Not their problem? Actually, yes it is. Many people are somewhat agnostic about the whole open access thing. This new requirement is likely to result in many of them, consciously or subconsciously, deciding that publishing in PLoS is just not worth it.

And yes, I realize that PLoS isn't requiring all data to be deposited into databases. But that's the ideal isn't it? Common formats, user-friendly interfaces and all that. After all, the goal of the whole open access thing is to make it all freely and easily available to everyone, right? Right?

So what should they have done? Laid the groundwork. For a start, given everyone a lot more warning that they were going this route. At least a year. Two would have been better. Time for people to process what this all means and maybe try to do something. Then they should have worked with those sub-fields that publish regularly* in the PLoS journals to help develop needed databases. You might argue that's not their job, but if they're going to evangelize the whole open everything thing, they need to step up and shoulder some of the load they expect everyone else to carry.

But that's not the evangelist way, is it.

________
* Sounds like a job for... altmetrics! Or just someone at PLoS with a decent grasp of databases. Oh wait...

39 responses so far

  • "Here's the difference. Those databases existed before journals imposed their policies."

    It sounds to me a little like you're complaining that PLoS is showing up with a chicken when you'd rather have eggs. Repositories need deposits!

    • odyssey says:

      Think about it carefully. The deposits (data) exist. I have many, many GBs of data in my lab alone. Suitable repositories do not exist. Figshare and the like? Sure, I could dump a bunch of data in places like that to satisfy PLoS, but it wouldn't be in a form useful to anyone except people in my own sub-sub-field. That defeats the purpose of the whole open everything approach, don't you think?

      • Are you suggesting that it (in your example) is better to withhold the data than to put the data on Figshare et al? That would be counterintuitive. Better to immediately have potential to benefit those in your sub-subfield than wait till we get all the ducks in a row with domain-specific databases. Surely.

        • odyssey says:

          Deposit in Figshare etc. for those in my sub-sub-field? Why when they could just ask for it? (Not that any of them ever have.)

          • Ian Dworkin (@IanDworkin) says:

            What happens if you leave science, or retire (or attend the great Gordon Conference in the sky)? Or more simply if the hard drives that the data are stored on fails? All of these could lead to the data being completely unretrievable by members of the community.

            If it is made available on a public archive, even in a format that could only be easily used by members of your sub-field, then it could be used. Yes the authors may still need t try to contact you about specific questions. However even if you were unavailable they may be able to partially or completely use the data even if they could not contact you?

  • LM says:

    One complaint: I have no idea what data they want. I'm a biophysicist, was thinking about submitting to PLoS Computational Biology. Depending on how you interpret their policy, it sounds like they could want everything from full source code, data files, and test cases - or they could just want the outputs. One of these is an order of magnitude more hassle than the other. And the lack of clarity is pretty annoying.

    • odyssey says:

      Indeed.

    • Ian Dworkin (@IanDworkin) says:

      For our simulation and agent based model work, we supply source code (in situations where it is not available elsewhere), configuration files (with seeds in case they want to produce exact results) and parsing scripts. For most purposes, this is sufficient to re-generate the data exactly. Sometimes we also submit the output of runs if they take a long time (weeks-months).

  • MTomasson says:

    I really really want to agree with my fave tweeps on this one, but I must play PLOS's advocate in a couple of ways.. True about community buy-in, it seems from twitter. But maybe the community of sci's wanting to get data published is large enough for PLOS through this policy to define their own community a little better. If you want not to post all your data, there are plenty of journals out there to go to.

    Also, I agree with some commenters that consider that this policy may help bring about some of the missing data storage resources and standards that are currently missing.

    And this is vague philosophy, but I always wonder when someone gets such a poor reaction, paradoxically, whether they may in fact be on to something. There is a problem in science publication now where increasingly papers are too expensive to repeat and opaque to re-analysis. Ignoring this problem is letting science change into something I don't recognize. "We had the money, you didn't, trust us on this." I take all of your and Drugmonkey's points..but I am not convinced the policy is a disaster of an idea. Maybe needs clarification and refinement, but that may come.

    • odyssey says:

      We're not in total disagreement. PLoS is on to something. My take on it though is they've approached this the wrong way. The existing databases (PDB, GenBank etc.) are amazingly useful. I use them almost daily. They're useful because the community bought into the idea, lobbied for the resources, and decided on things like useful formats etc. PLoS could have pushed that and generated something (or bunch of things) really, really useful for the many. Under their current mandate, data can be dumped anywhere public, in less than useful forms, benefiting almost no one.

      • "Under their current mandate, data can be dumped anywhere public..." that's not strictly correct, although the definition of "accession number" is somewhat dubious. I suppose taking that loosely, one could post anything to a personal website and give it your own personal accession number. I doubt that is what is intended and the rules regarding that will be cleared up.

    • dr24hours says:

      I remain sympathetic to the open data/open access environment. But there are serious practicalities here. My simulations can produce volumes of data that it is impossible for me to keep. I overwrite it all the time. I produce raw data from a run, interpret it, store the relevant bits, and then overwrite the rest. Because I am going to produce terabytes of data settling in on the right way to frame an outcome. If I had to keep all that, I'd need access to more storage space than I have now, but lots. Similarly, the public depository would need to be huge.

      Someone is going to make a lot of money getting already cash-strapped scis to pay them to store their data.

      • If you are simulating, the code and sufficient computing resource should be enough to reproduce the simulations you report on your computer. Hence you submit the source code to reproduce the simulation reported in the paper. I don't see that PLOS's new data mandate requires you to throw data from all the tests, missteps, trial runs etc. into a repository.

  • Spiny Norman says:

    I'm sorry, MTommasson, but the availability of "raw" data has almost nothing to do with reproducibility.

    Lack of reproducibility in the vast majority of cases comes down to poor documentation of METHODS, not poor documentation of RESULTS. And this initiative does nothing to address that problem. What it /does/ do is create yet another huge administrative burden — an unfunded mandate — for scientists. And it is a burden that disproportionately will harm smaller groups, and which will disproportionately harm groups using non-standard or emerging or diverse technical approaches versus groups that are focused on a single approach and that simply crank out data.

    In other words, it penalizes groups doing problem-based, hypothesis-driven work and rewards data factories (cough ENCODE, cough).

    I can see almost no way in which this is not harmful.

    • Availability of the "raw" data is a start and would allow the post-data-collection aspects of a study to replicated and/or validated. But that is only one part of why Open Data advocates want to see greater openness in this area. Your data might be very useful to scientist X interested in something you might never think of or aren't interested in pursuing. If they can take your data and do something useful with it, that is a net benefit as that useful something is unlikely to get done if scientist X has to recreate your data. There is also a data stewardship issue here; data on scientist Y's hard drive is not secured; data get lost all the time. When the public have paid for those data it is hard to argue that scientists shouldn't take steps to i) secure them and ii) make them openly available.

  • DrugMonkey says:

    MT-
    PLoS continuing to further isolate itself as the province of TruBelievers in OpenEverything is their right, of course. I just think it a shame when their central mission is so fantastic. This severely curtails the influence of the PONE idea about predicting impact before publication. A damn shame.

    • odyssey says:

      Right. They really missed an opportunity here. They could have worked with the community to better spread the idea of their central mission. Instead they chose to mandate. Really makes me wonder who's at the helm.

    • Ian Dworkin (@IanDworkin) says:

      I am at a loss about why you are treating those who advocate for open data policies as "whackaloons" and "true believers". While they may hold different starting sets of assumptions from you, and are arguing for a different point of view, I have not seen those commenting on your blog, or this blog which suggest those who are advocating for a different point of view from your own as being irrational. This is an extremely worthwhile conversation for practicing scientists to have. Why all of the name calling?

      • Odyssey says:

        My issues are not with sharing data. My issues are with the way PLoS has mandated this be done. I also have not called anyone names.

        • Ian Dworkin says:

          Sorrry, that was referring to the initial comment by drugmonkey who has been, both on this blog, and their own.

  • Maria says:

    In fact, you could go further with the concept of community buy-in - the PDB database and community were already in existence, and THEN they asked the journals to help them enforce the policy, not the other way around.

  • Theo Bloom says:

    Please do check out the FAQs on the PLOS data policy at http://www.plos.org/plos-data-policy-faq/. The main thing we are asking for is that the data be somewhere other than the authors' hard drive, and for everyone who reads the paper to know where that is. We do appreciate that involves some work up front, when the paper is being prepared, rather than waiting for someone to request the data after publication.

    And just to respond to a couple of points in the main post: the policy has been under development for more than a year, involved consultation internally with representatives of as many fields as we could engage, and a public consultation since December last year - http://www.plos.org/data-access-for-the-open-access-literature-ploss-data-policy/. Clearly a lot of people have now taken notice of it who didn't before, and we're really grateful for all the thoughtful input. We'll keep adding to the FAQ as long as new questions arise.

    We have also been working for a number of years with the likes of Dryad, FigShare and other repositories, both those structured for particular data types and the more catch-all ones designed to deal with the problem that there will never be suitable community databases for all possible data types.

    • odyssey says:

      Theo,
      Thanks for commenting here. I have read the FAQ and I still find the whole thing less than satisfying. The NIH requirement for making data available is quoted as a rationale, yet that is really very different to PLoS's data sharing mandate. My data IS available to anyone who wants it. All they have to do is ask. That satisfies the requirements of the agencies that fund me. Why isn't that enough for PLoS?

      You've been working with repositories? Great. And have you been helping them to identify sources of funds to help deal with the data you're expecting them to host? Running a data specific repository isn't cheap. I know, I've been associated with one for a few years.

      I'm also aware of at least three data specific repositories in my general area that started, never took off, and have disappeared. Why? No community buy in. Focusing on the repositories is all well and good, but is a waste of time if you're not also working to get wider community buy in. And that takes more than a post or two on your web site. People need to be convinced, not just told, that this is a worthwhile endeavor.

      No one has ever asked me for my primary data. Ever. Yet PLoS wants me to go to a large amount of effort, with associated expense ($'s I don't have), to post my data somewhere just in case? Just to save someone the effort of dropping me an email asking me for the data? Maybe there are people who are asked for their data all the time. Good for them. Don't you think maybe they'd have the sense to put their data somewhere readily accessible themselves without being told to? And people who refuse to share? They just won't publish in PLoS.

      So maybe you can enlighten me. Just what problem/need is this solving?

      • Ian Dworkin (@IanDworkin) says:

        While some repositories like DRYAD may charge, others like FigShare do not. So that should not cost you anything.

        I am not sure why you seem to suggest (based on Tim Vines's comment above) that decline only happens after ten years. The data suggests the decline begins rapidly and increases in severity. It is not a sudden drop off.

        Even if it does only happen after ten years (regardless of government mandate), why would you not want other researchers to use your old (10+ years) data? Maybe only 0.1% of data sets will be truly valuable, but even that small amount may end up ultimately saving time and money.

        • Odyssey says:

          You really believe it costs the lab nothing to store data, even on a free site?

          You seem to be assuming that I don't take steps to ensure the security of my data. Like every scientist I know, I don't have just one a hard drive in the lab. I'm not naive.

          • Ian Dworkin (@IanDworkin) says:

            Odyssey

            I of course have no information on how you store your data, and perhaps you have numerous redundant storage systems in play, which is great. I certainly have not suggesting you are naive (nor have I).

            However it is (as numerous studies including the one Tim Vine sent the link for) generally clear that there is a problem in the community in terms of data loss. So why would we not make an effort to "fix the leak".

            I am confused about your other comment
            "You really believe it costs the lab nothing to store data, even on a free site?"

            Of course getting the metadata organized may require some time, but in my experience (with a number of different data types, although they may be different from yours) that it is a very small time investment relative to all of the other parts of a study including collecting the data.

            Is there another cost you are referring to?

  • Tim Vines says:

    "But it's not as easy as PLoS would have us all believe. It's not as simple as a publisher saying "you must do this or you can't publish in our journals.""

    Our experience at Molecular Ecology suggests otherwise - we brought in a tough data sharing policy in 2010, and made it clear we were going to enforce it (on pain of the paper not getting published), and guess what happened? Nothing. Submissions have stayed the same, and we're still getting good quality papers. Sure, people who are afraid of others seeing their data have probably taken their papers elsewhere, but that's fine with us as those papers can't be validated and are thus less useful for the community anyway.

    • odyssey says:

      Where is the data being stored?

      • Tim Vines says:

        Most of the datasets are on Dryad, which preserves them in perpetuity for a one time fee of c $80. The rest are mostly supplemental material and data specific repositories (e.g. GenBank).

  • odyssey says:

    And Tim, to follow up on my earlier comment. It would be wonderful if we could keep all the data for all time. But that's simply not practical. I get that the loss of old data is a very bad thing, but how would you suggest we fund the storage of such vast amounts of information? Simply mandating that all data be kept and made available doesn't solve that issue.

  • Tim Vines says:

    It's actually remarkably cheap to preserve data - Dryad will look after 10GB for the foreseeable future for a one time fee of under $100. Compare that to the cost of either recollecting the same datasets over again or missing out on the ability to answer an important research question because irreplaceable data have been lost, and it's a bargain. Heather Piwowar had an interesting letter in Nature about this:

    http://researchremix.wordpress.com/2011/05/19/nature-letter/

    We can go round and round on how much space we need to store all this data, and I acknowledge that some areas routinely produce a lot more than 10GB per paper. However, for most fields it seems that even the raw raw data and all the downstream files rarely gets above 1GB.

    • odyssey says:

      10GB for $100? Dude, that ain't cheap. Not when you can buy a 3TB drive for that much.

      And while you might be correct that for many fields a typical paper might involve no more than 1GB of raw data, you are a) ignoring the often significant cost in time (and therefore $'s) to generate and organize the metadata to make it useful, b) ignoring the effects on those whose typical data are >>1GB, and c) expecting everyone to do this even though there are many of us who have never, ever been asked for our raw data.

  • odyssey says:

    Of course getting the metadata organized may require some time, but in my experience (with a number of different data types, although they may be different from yours) that it is a very small time investment relative to all of the other parts of a study including collecting the data.

    A typical paper from my lab might involve 100's of individual raw data files. Organizing the metadata in order to make it useful to anyone else would be a very significant cost in time and money.

  • […] number of recent happenings (to name just two the PLoS data sharing mandate and tweets that led to @MyTChondria's guest post over at DrugMonkey's joint) have got me […]

  • […] of work and dedication it needed to create them coincided with a cascade of blog posts on data sharing, triggered by a change of PLOS’s data sharing policy (required to deposit raw data). I have […]

  • […] researchers in an attempt to fix something that isn’t broken, that the new policy applies a misguided, one-size-fits-all solution to different scientific communities with different needs and standards, and that making it easy to […]