Why Scientists Need To Start Managing Data Better
Guillermo Vela, CEO of Nebulab — a local tech company building a data management platform for scientists — spoke to The University of Texas Health Science Center at San Antonio graduate students and postdoctoral fellows about the importance of implementing a data management plan.
“How many of you feel highly confident that you can find and understand your data from ten, five, or even three years ago with relative ease?” Vela asked trainees.
One pressing reason why scientists need to begin addressing data management needs today, Vela explained, is because science is experiencing a data deluge.
Over the last two decades, many lab instruments have become vastly more sophisticated and affordable, and as a result, many more labs are now generating data at unprecedented rates. Vela explained that even the best funded institutions will be unable to grow their IT infrastructures to meet modern research demands.
“We’re generating so much data that we don’t even know how to store or make shareable yet… for example, by 2025, the field of genomics alone is expected to generate as much as 40 exabytes of data per year, or roughly about twenty times more data than all of Twitter and YouTube combined,” he said.
Another reason why data management is so important is experimental reproducibility. As modern science becomes increasingly complex, keeping proper track of all experimental details is vital.
“In science, access to a file by itself is meaningless. We also need access to all the context necessary to be able to understand, validate, and potentially replicate what we’re looking at,” Vela said.
Vela explained that the true extent of irreproducibility in preclinical research came to light as pharmaceutical companies began looking for commercialization opportunities within academic research.
They saw this as a way to cut down on internal R&D spending, but unfortunately, pharmaceutical companies faced huge setbacks trying to validate promising research coming from academic institutions. Currently, between 50-90 percent of preclinical research is estimated to be irreproducible.
“Reproducibility is a huge issue for translational science—and sometimes the reasons for such are as common and inane as a graduate student leaving the lab and nobody being able to find or understand that person’s data.”
Vela suggested that proper data management should be of particular importance for translational research institutions such as The University of Texas Health Science Center at San Antonio because research that cannot be validated cannot be translated into clinical applications and therapies.
He also explained that most scientists today usually rely on their personal Google Drive and Dropbox accounts to store data. This means that when they leave the lab, the data often goes with them.
“[Improper data management is] not just a technology problem. It’s also a human problem. Lab Notebooks can still useful, but for most people, the notebook has become the equivalent of a post-it note,” he said.
“Technology alone will not solve our data management problems. We also need to be more aware of our habits, introduce better training, and strive to keep more detailed records.”
Lastly, Vela explained that it’s important for scientists to learn and practice proper data management because data management plans are now being required by funding agencies like the NIH and the NSF with every grant application.
Funding agencies and publishers alike see data management plans as a way to enhance research transparency, reproducibility, and accessibility, and funding eligibility will depend on them.
Here are some of his best tips. For additional references and help on how to make your own plan, visit this link.
1) Plan Ahead
Think about how much data you think you might generate, where you will be storing the data, and what those storage costs might be. Will data be stored on an institutional server or an external hard drive? Are you allowed to keep copies on your Dropbox or Google Drive, and what happens if you run out of free storage? Importantly, how do you plan on making primary data publicly available after publication? Public repositories might already exist for your type of research data. Alternatively, sites like Figshare, Dryad, or the Dataverse Project might offer an alternative.
2) Stick To A File Naming Convention
Do not label your files DSC200004, instead figure out a labeling convention that your entire lab will know so when you leave, they will be able to easily pull the files they need.
3) Add a Readme file
Keeping these within project and experiment folder is a simple way to help people figure out how to navigate your data more efficiently
4) Back Up Your Data
Always backup your data, and make sure your backups don’t live next to your original data. Vela mentioned the case of a researcher who had her data backups on an external hard drive which she kept in her backpack, along with her laptop. One day, her backpack was stolen, resulting in years of lost data. Don’t keep those external hard drives next to the computer you work on either. What if there was a chemical fire or an emergency at the lab? Cloud storage is a great way to keep backups.
5) Avoid Data Lock In
Make sure that the tools and software that you choose to use for data management offers an easy way to retrieve or export your data in case the manufacturer or product goes out of business or becomes obsolete.
6) Engage Administrators
Ask your PI, department head, and administrative offices how to handle data storage and long-term data preservation. Is there a lab or institutional plan already in place?
7) Using the Cloud
Many people are wary of the cloud but the reality is your data is usually safer in the cloud, much like your money is safer in the bank. It’s much easier to hack into your personal computer than it is to hack cloud providers. Likewise, your computer is much more likel
y to crash, whereas its almost impossible for cloud storage providers like Amazon, Google, or Rackspace to lose your data.
8) Address Privacy Concerns
When using cloud and software providers, be sure to read the terms of conditions. For example, some data type should only be stored on HIPAA-compliant cloud storage, and must be de-identified.