Information Retrieval Models

LIS 4/5523: Online Information Retrieval

Dr. Manika Lamba

Introduction

Information Retrieval

A process in which sets of records or documents are searched to find items which may help to satisfy the information need

  • IR is concerned with:

    • representation
    • storage
    • organization
    • accessing of information objects

Model of IR System

Blair, 1990

Model of IR System

Chowdhury, 2010

Model of Information Retrieval

Information Retrieval (Cont.)

  • Information retrieval concerns a range of concepts
    • User Group
      • types of knowledge
      • context & information environment
    • Information Need
    • Information Sources
    • Information System
      • system capabilities/IR techniques used
      • how information organized
    • Results of the Query
    • User Selection & Evaluation (Relevance)

Information Retrieval (Cont.)

  • The central problem of IR is how to represent documents for retrieval
  • To be more successful, document representation must be used in ways similar to the ways ordinary language is used
    • document representations should take context into account
    • document representations should take users into account

Information Retrieval (Cont.)

  • Most IR is based on techniques introduced in the 1960's
    • Primarily text-based retrieval
    • Makes use of inverted indexes and index terms
  • IR is no longer just a library problem
    • Used in businesses, everyday settings
    • Used in search engines
  • As a result of these evolved uses high standards of retrieval are expected by users

Taxonomy of IR Techniques

  • We can divide IR techniques into basic classes
    • Exact Match: where the set of retrieved documents contains only documents whose representations match exactly with the query
    • Partial Match: where there is some matching that occurs, but it is not exact, although some of the documents may be exact matches to the query

Traditional IR Model

Simple Match Model

Request = Information Data

Document A = data, information

Document B = data, information

Document C = information, retrieval

Advantages: simple process; widespread; familiar

Disadvantages: single descriptor requests less effective in large databases

Boolean IR Model

  • Boolean Retrieval
    • one step above the basic model
  • Named after the creator, George Boole, of Boolean algebra, around 1850
  • Most familiar IR technique used in OPAC’s and online databases
  • Uses AND, OR, and NOT to allow more complex queries to the IR system
  • Works with that what is called Set Theory

Operators: OR

OR = build up concepts

  • Synonyms or equivalent terms
  • Spelling variants
  • Related terms

OR: How it Works

  • Any documents that contain ANY of the terms or combination of the terms
  • Produces large sets/more documents

Operators: AND

  • AND = combine words/concepts blocks

  • Only documents that contain ALL words/concept blocks

  • Produces smaller set/fewer documents

Operators: NOT

  • NOT = used to exclude words/concepts from a set
  • ONLY documents that DO NOT include excluded terms
  • Produces smaller, more specific sets/fewer documents

Boolean IR Model (Cont.)

  • Example of Boolean Search

Problems with Boolean Model

  • Need to know the order (preference) the operators are processed by the system
  • Seems very simple to users but is really fairly complex
  • May miss potentially relevant documents
  • Does not rank retrieved documents
  • Concepts within documents are difficult to show
  • So why do we continue to use them?

Term Weighting

  • Weighted IR (probabilistic IR)
    • makes use of inverted index and index terms
    • easier to assign weights if automatically indexed
    • each term in the index has a weight or value attached to it
      • weight reflects its relative importance in the document
    • weight is determined by use of term frequency
      • defined as the number of occurrences of a term in the document
      • more frequently a term appears in a document, the more likely it is to an important concept within the document

Example of Weighted IR

Term Weighting (Cont.)

Other IR Models

Clustering (such as faceted search, tag clouds)

Clustering (Cont.)

Vector Space IR

Advantages

  • more useful documents retrieved
  • levels of relevance are shown

Disadvantages

  • can be very complex
  • difficult to explain to users
  • was not very feasible for OPAC’s without GUI interface
  • subjective, dependent upon user

Other IR Models

  • Semantic or Linguistic Model (NLP)

attempts to get at the “concepts” contained in the information object or the surrogate

  • syntactic analysis
-   free text searching

-   paragraph indexing

-   discourse analysis
  • Passage Retrieval

Newer IR Models

  • User Profiles

    • uses heuristics (rules of thumb)
    • uses process models
  • Intelligent Agents (e.g. Windows Cortana)

    • autonomous
    • able to learn
    • customizable
  • Web Search Engines

  • Data Mining/Text Extraction Methods

Example of Conversational Retrieval System

Topic Modeling

  • Soft clustering method based on Probablistic IR algorithm which can be used for classification in downstream tasks such improving recommendation systems

  • Used to infer hidden themes in a collection of documents - provides an automatic means to organize, understand and summarize large collections of textual information

  • Based on statistical and machine learning techniques to mine meaningful information from a vast corpus of unstructured data and document’s content

  • Infers abstract topics based on “similar patterns of word usage in each document”

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) = Information Retrieval (IR) + Large Language Models (LLMs)

RAG is a Generative IR technique that helps AI models generate better, more accurate, and up-to-date responses by retrieving relevant information from external sources before generating an answer

Why Do We Need RAG?

Traditional LLMs (like GPT-4) have a fixed knowledge base from training data. But:
- They don’t know new information after training.
- They hallucinate (make up facts).
- They struggle with specific or niche knowledge (e.g., latest research papers).

  • Solution: RAG helps by searching for relevant documents and using them to generate accurate answers!

How Does RAG Work?

Imagine you ask an AI:

🗣️ “Who won the Nobel Prize in Physics this year?”

Without RAG (LLM Only):

🤖 “I don’t know. My training data only goes up to 2023.”

With RAG (LLM + IR):

🤖 (Searches the web → Finds latest Nobel Prize winners → Summarizes the results)

“The 2024 Nobel Prize in Physics was awarded to [Winner’s Name] for [Reason].”

IR systems You Use

  • Online catalogs

  • Online databases

  • Web Search Engines

Requirements for Successful Retrieval

Indexer has selected (perhaps among others) the concept that the patrons will want

What If?

  • Indexer picks a different topic
  • Indexer and patron use different terms for the same concept
  • Patrons cannot articulate just what the question state is

The Dance

Indexer

describes doc

predicts use

Patron

describes doc

predicts doc

Some Important Questions

  • What patron attributes can we know?
  • What document attributes can we know?
  • How can we use this knowledge to open the bottleneck between patrons in need and the documents that might be of use?

Indexing Factors Affecting IR Performance

  1. Indexing

  1. Type of Knowledge

  1. Effective/Cognitive

1.1 Consistency

1.2 Subject Expertise

1.3 Indexing Expertise


2.1 Searching Experience

2.2 Domain Knowledge


3.1 Motivation Level

3.2 Emotional State

Other Factors to Consider

  • Use of Standards/Rules (code)
    • Depends on form of index/abstract
    • Depends on criteria of employer
      • Pages allocated
      • Format used
      • Order
  • Depends on Resources/Audience

How Does this Relate to Searching?

Types of IR Systems

  • Pre-Coordinate systems
    • printed indexes and catalogs
    • OPACs
  • Post-Coordinate systems
  • Computer retrieval systems (Databases)
  • Online retrieval systems (Internet and Web)
  • Smart phones
  • Tablet, computers
  • Others you use? Can we consider Social Media an IR system? Can we consider all Chatbots as IR system or just the modern ones powered by LLMs?

Cautions

  • We should take a more global view of IR
  • Users (including information professionals) need to know which IR model or technique the system is using for retrieval
  • Users also need to know how information systems are structured and how objects are represented in the system