Databases

Classifying Databases used in Bioinformatics

A primary database is one that contains original data, or data that was submitted directly by the person or group who collected it.

One example of this is the Sequence Read Archive from the National Center for Biotechnology Information. When an investigator generates sequencing data, they have the option to submit their data into the archive for others to re-use.

A secondary database containes derived, integrated, or otherwise curated/annotated information built from other sources and may also include expert analysis. Multiple databases on NCBI link to one another and bring in secondary data.

There are other types of specialized or niche databases that serve specific purposes in the bioinformatics community:

Point solution database os designed to answer a specific biological question or support a defined workflow.
Deep database is designed to provide rich information for one species or one tightly focused biological system, i.e. FlyBase
Broad database is designed to provide information organized across multiple species, usually by data type or biological system, i.e. UniProt
Laboratory information management systems (LIMS) is a database, usually with a graphical user interface, that helps manage samples, workflows, instruments, and associated metadata
An advanced custom database is an in-house resource designed for the needs of a specific lab, clinic, core, or company. These are useful for specialized data sets or time-sensitive workflows and may incorporate data from public community resources.

Community Databases

A community database is built for, used by, and often developed by a scientific community with a shared interest. Community databases broaden utility and support increased adoption, which is ideal for the scientific community. Support for thee resources often come from grants and are intended to make public data reuse and provenance possible.

The National Center for Biotechnology Information, NCBI, is a collection of resources for scientists (and the public!), including a number of databases that are significant and critical for science research globally. Per their website, NCBI has a number of responsiblities:

"To carry out its diverse responsibilities, NCBI:

conducts research on fundamental biomedical problems at the molecular level using mathematical and computational methods

maintains collaborations with several NIH institutes, academia, industry, and other governmental agencies

fosters scientific communication by sponsoring meetings, workshops, and lecture series

supports training on basic and applied research in computational biology for postdoctoral fellows through the NIH Intramural Research Program

engages members of the international scientific community in informatics research and training through the Scientific Visitors Program

develops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communities

develops and promotes standards for databases, data deposition and exchange, and biological nomenclature"

A Note on NCBI's History

It is important for bioinformaticians to recognize the history of and resources dedicated to NCBI and related resources such as EMBL. NCBI was created through a specific legislative process in the United States after the scientists in collaboration with the National Library of Medicine identified molecular biology information management as a national need. NLM’s legislative chronology states that NCBI was established in 1988 under the Health Omnibus Programs Extension Act with strong support for Senator Claude Pepper [Source].

More information on the historical context for NCBI can be found in A brief history of NCBI’s formation and growth by Kent Smith.

NCBI Resources

NCBI has become one of the most influential community database ecosystems in biomedical science. Recent changes in the United States National Institutes of Health funding models requires that data generated by projects be submitted to these repositories (as appropriate) by funded investigators. These changes support data re-use, reproducibility, and participation in the scientific method at large.

Remember - this is just an overview exposing you to resources that exist and context for bioinformatics. You can know these topics more thoroughly, and should let curiousity and due diligence guide your depth of study on these topics. You can find out more about NCBI resources, their purposes, and how to use them at the following links:

Interested in Historical Context or Resource Management?

While the current physical infrastructure and setup of NCBI resources is not available to the public, there are publications and other resources that capture early design for some NCBI databases and resources. For example, a publication describing the original Gene Expression Omnibus database schema was published in 2002 and contains descriptions of the database setup at the time:

Edgar, R., Domrachev, M., & Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research, 30(1), 207-210. Link: academic.oup.com/nar/article/30/1/207/1332640

While it is no longer common to publish database data definition lanugage or data manipulation language (for multiple reasons, including but not limited to security and privacy), many resources do publish updates to their resources regularly, oftentimes in the same journal with similar title. See the examples below for more insights.

Sequence Read Archive Publications:

2010: Leinonen, R., Sugawara, H., Shumway, M., & International Nucleotide Sequence Database Collaboration. (2010). The sequence read archive. Nucleic acids research, 39(suppl_1), D19-D21. Link
2012: Kodama, Y., Shumway, M., & Leinonen, R. (2012). The Sequence Read Archive: explosive growth of sequencing data. Nucleic acids research, 40(D1), D54-D56. Link
2022: Katz, K., Shutov, O., Lapoint, R., Kimelman, M., Brister, J. R., & O’Sullivan, C. (2022). The Sequence Read Archive: a decade more of explosive growth. Nucleic acids research, 50(D1), D387-D390. Link