Information Retrieval in Bioinformatics

Estimated time to complete: 15–25 minutes

When people say they are “searching a database,” they may actually be doing two different things: querying a structured database or using an information retrieval system. These are related, but they are not the same.

Why this distinction matters

In bioinformatics, we often use the word search very loosely. A researcher may say they “searched a database,” but the actual task could involve:

Core idea: A database query is usually designed for structured data with known fields. Information retrieval (IR) is designed to find relevant information from a collection of documents or records, often when the content is unstructured or only partly structured.

Keyword searching as motivation

A keyword search looks for words anywhere in a collection of records or documents. Keyword searching is very common in biology because users often know the concept they want, but they do not know the exact schema, field names, or query language behind the resource.

In a traditional relational database, retrieval is often done through a query language such as SQL. In contrast, keyword searching has long been central to information retrieval systems, especially for text-heavy collections.

Why not just use SQL?

  • Every database may use a different schema.
  • Most users do not know SQL.
  • Even SQL varies across database systems.
  • Not all data resources are relational databases.

Why keyword search is attractive

  • It lowers the barrier to entry.
  • It lets users begin with concepts, not field names.
  • It works well for articles, abstracts, and descriptive text.
  • It supports exploratory searching.

What is information retrieval (IR)?

Information retrieval (IR) is the science of obtaining information relevant to a user’s need from a collection of information resources. In practice, IR often treats information as a collection of documents rather than as rows and columns in a highly structured table.

Those documents may be unstructured or semi-structured. They may contain titles, abstracts, free text, metadata, controlled vocabulary terms, and identifiers. The goal is not simply to find exact matches in a field, but to retrieve items that are likely to be relevant.

IR asks: “Given this information need, which documents are most relevant?”
Database querying asks: “Given this structure, which records satisfy these exact conditions?”

How is IR different from querying a database?

Table: Simplified comparison of information retrieval and database querying
Feature Information Retrieval (IR) Database Querying
Typical data Unstructured or semi-structured documents Structured records with defined fields
Organization Collection of documents, text, metadata, identifiers Schema-defined tables, fields, and relationships
User input Keywords, phrases, concepts Formal conditions and field-based constraints
Match style Approximate or relevance-based Exact logical conditions
Output Ranked results by estimated relevance Records that satisfy the query
Typical challenge Ambiguity in terms and meaning Knowing the schema and query language

IR systems also handle problems that are less central in traditional database querying, such as:

IR in bioinformatics

In bioinformatics, IR becomes especially important when the target information is embedded in text, annotations, abstracts, article metadata, or descriptive records. Biological terms are often messy: the same concept may appear under multiple names, abbreviations, or spellings.

Example: semantic ambiguity

A user searching for information about Bordetella pertussis might search using:

  • pertussis
  • whooping cough
  • Bordetella pertussis
  • B. pertussis

A database system and an IR system may treat these very differently unless metadata, ontology, controlled vocabulary, or query expansion are used.

This is one reason why standardized keyword systems, taxonomies, and controlled vocabularies are valuable. They reduce ambiguity and help connect related terms during search and retrieval.

Whats the Difference?

In practice, many modern websites combine both ideas. A site may store data in a structured backend database, but expose a keyword-based interface to users. The underlying system may still rely on database design but the user layer/front later behaves more like information retrieval. We don't want to expect users to know query language.

  1. Searching a resource and receiving a ranked list of results that you have to choose from is probably information retreival
  2. Getting records from a query where you type exact words, dates, identifiers or phrases is likely to be querying an actual database (or portion of it).
  3. Sometimes, both can happen in the same resource.