Reading: Information Retrieval in Bioinformatics

Why this distinction matters

In bioinformatics, we often use the word search very loosely. A researcher may say they “searched a database,” but the actual task could involve:

running a structured query against a database with a defined schema,
typing keywords into a search box,
retrieving articles or records ranked by relevance, or
looking across semi-structured documents with metadata and text.

        Core idea: A database query is usually designed for structured data with known fields.
        Information retrieval (IR) is designed to find relevant information from a collection of documents or records,
        often when the content is unstructured or only partly structured.
      

Keyword searching as motivation

A keyword search looks for words anywhere in a collection of records or documents. Keyword searching is very common in biology because users often know the concept they want, but they do not know the exact schema, field names, or query language behind the resource.

In a traditional relational database, retrieval is often done through a query language such as SQL. In contrast, keyword searching has long been central to information retrieval systems, especially for text-heavy collections.

Why not just use SQL?

Every database may use a different schema.
Most users do not know SQL.
Even SQL varies across database systems.
Not all data resources are relational databases.

Why keyword search is attractive

It lowers the barrier to entry.
It lets users begin with concepts, not field names.
It works well for articles, abstracts, and descriptive text.
It supports exploratory searching.

What is information retrieval (IR)?

Information retrieval (IR) is the science of obtaining information relevant to a user’s need from a collection of information resources. In practice, IR often treats information as a collection of documents rather than as rows and columns in a highly structured table.

Those documents may be unstructured or semi-structured. They may contain titles, abstracts, free text, metadata, controlled vocabulary terms, and identifiers. The goal is not simply to find exact matches in a field, but to retrieve items that are likely to be relevant.

        IR asks: “Given this information need, which documents are most relevant?”
        
        Database querying asks: “Given this structure, which records satisfy these exact conditions?”

How is IR different from querying a database?

Table: Simplified comparison of information retrieval and database querying
Feature	Information Retrieval (IR)	Database Querying
Typical data	Unstructured or semi-structured documents	Structured records with defined fields
Organization	Collection of documents, text, metadata, identifiers	Schema-defined tables, fields, and relationships
User input	Keywords, phrases, concepts	Formal conditions and field-based constraints
Match style	Approximate or relevance-based	Exact logical conditions
Output	Ranked results by estimated relevance	Records that satisfy the query
Typical challenge	Ambiguity in terms and meaning	Knowing the schema and query language

IR systems also handle problems that are less central in traditional database querying, such as:

approximate searching by keywords,
ranking results by likely relevance, and
language variation or synonymy in the way users describe concepts.

IR in bioinformatics

In bioinformatics, IR becomes especially important when the target information is embedded in text, annotations, abstracts, article metadata, or descriptive records. Biological terms are often messy: the same concept may appear under multiple names, abbreviations, or spellings.

Example: semantic ambiguity

A user searching for information about Bordetella pertussis might search using:

pertussis
whooping cough
Bordetella pertussis
B. pertussis

A database system and an IR system may treat these very differently unless metadata, ontology, controlled vocabulary, or query expansion are used.

This is one reason why standardized keyword systems, taxonomies, and controlled vocabularies are valuable. They reduce ambiguity and help connect related terms during search and retrieval.

Whats the Difference?

In practice, many modern websites combine both ideas. A site may store data in a structured backend database, but expose a keyword-based interface to users. The underlying system may still rely on database design but the user layer/front later behaves more like information retrieval. We don't want to expect users to know query language.

Searching a resource and receiving a ranked list of results that you have to choose from is probably information retreival
Getting records from a query where you type exact words, dates, identifiers or phrases is likely to be querying an actual database (or portion of it).
Sometimes, both can happen in the same resource.