Background
Data Management and Reproducibility
In the context of bioinformatics and related areas of research, data management can be described as the plan or approaches used to manage one's data throughout its lifecycle. In a well planned experiment, initiative, or effort, planning for the management of data should occur well before any data is ever collected! Data management for a project is closely linked to the biomedical data lifecycle:
Source: Longwood Research Data Management, Harvard University
Good data management practice supports reproducibility by making it easier for others (and your future self) to understand what data were collected, how they were processed, and how results were produced.
Tip: Data management planning should happen early—before data collection begins—so workflows, storage, documentation, and sharing expectations are clear.
A Brief Overview of Steps in The Biomedical Data Lifecycle
Material for this page was sourced from the Longwood Research Data Management, Harvard University. Please visit their website for additional information, resources, and updates.
Plan and Design
Before you start a scientific experience or project, you must plan and design your approach. Will everything go as planned, always? No! However, planning your approach will ensure that the idea in your mind fits the timeframe, scope, and feasibility. Essentially, you are ensuring that your project can be reasonably completed in the time and resources you have available. A good project plan will build in sufficient time and resources dedicated to effective data management.In this stage, planning for a project, initiative, or effort occur, usually alongside design of a data management plan. Many institutions such as the National Institutes of Health (NIH) are now requiring submission of a Data Management Plan with research grants. (Click here to view the Data Management and Sharing Plan page from the NIH.) The data management plan, to be discussed in later modules, is a well-thought out plan for who will manage the data, where it will be stored and analyzed, when it will be collected and for how long, why it is being collected or appropriate for collection, what types of data will be collected and what types of tools will be used to manage, store, and analyze it, and how the data will be used.
Collect and Create
Consider what is needed to collect your data, including software, hardware, and what kind of training collaborators on your project will need. Consider how the data will be named in a way that makes it easy to find and efficient to store, search, and analyze when you reach the next step. Much like an old-fashioned lab notebook, many folks in bioinformatics use a digital lab notebook, software, or some other type of tool to detail their data collection details to ensure transparency.Analyze and Collaborate
In this stage, especially in bioinformatics, it is important for researchers and trainees to understand the reproducibility of their workflow or pipeline. Is your analysis being performed on a laptop or a high performance computing system? What requirements or dependencies are required for your analysis to run? Are you using a commercial product or an open source software? Is it a well supported and documented tool that is popular in your community, or is it a one-off script written 5 years ago by a student who has now graduated? Who has helped write this code, and are you appropriately citing all the packages and libraries used in your work? All of these questions and more should play into how you document your bioinformatics analysis. Again - much like an old-fashioned lab notebook - many folks in bioinformatics use a digital record of their analyses to ensure transparency and reproducibility. This includes a documented README file, and where appropriate, an INSTALL file, TEST file, LICENSE file, use of a style guide for code, and use of commenting in one's code.Evaluate and Archive
Consider where you will need to store your data, for how long, and how frequently you will need to access it. Do you need a HIPAA-compliant storage asset? FERPA-compliance? Are there any data security issues that need to be considered? Additionally, storage space and upload/download time need to be considered. Relatively large datasets (think 1Tb or bigger) might require purchase of longer term storage options than are provided by your institution or university. Hard drives are easily lost and can be physically corrupted by water and age. Use of cloud based servers, when appropriate, are a better way to protect the redundancy of your data against decay and loss.This step is critical but can be overlooked, especially in when focusing on publication and dissemination! Consider where to store your data long term, or where to archive it! Is this at your institution, or should you pay to have it archived elsewhere?
Share and Disseminate
Does your funding agreement require you to submit your data to a public repository? Sharing your data publicly, when possible, helps science by allowing others to re-use your data and replicate your analyses using new tools. There are studies that demonstrate papers sharing data have a higher citation rate than papers that do not. Other items to consider is where and how to disseminate your work - at a conference, as a pre-print, and/or in a journal? Where possible, reproducibile scholarly works should be shared as open access with licensing detailing how the work can be used; sometimes funders such as the NIH and NSF require work to be shared as open access.Publish and Reuse
This is typically the final stage of the data lifecycle in an effort, initiative, or project. Ensuring "loose ends" are wrapped up means reviewing your product(s) to ensure access, reproducibility, and availability of your code. Is there a visible point of contact on your code if you have it? Is there designated coordinator or leader who will be the point of contact for your work? Who will handle data or code requests in the future - 1 year, 5 years, 10 years down the road?Tip: Data management planning should happen early—before data collection begins—so workflows, storage, documentation, and sharing expectations are clear.