Basic Concepts of Information Retrieval

LIS 4/5523: Online Information Retrieval

Dr. Manika Lamba

Introduction

  • Data and information generation in every discipline in the universe of knowledge has seen staggering growth

  • Storing, managing, querying, & retrieval of huge amount of data & information needs sophisticated procedures & advanced technologies

  • Nowadays, information collection is web-based and online which is vast and growing at an exponential rate

Information Retrieval

Definition

A process in which sets of records or documents are searched to find items which may help to satisfy an information need

Information Retrieval

Information Retrieval includes:

  • Search engines have been developed based on the concepts, principles, and techniques developed by IR

Brief History of Information Retrieval

Brief History of Information Retrieval

  • System for the Mechanical Analysis and Retrieval of Text (SMART) was developed by Gerard Salton in Cornell University in 1960s. This system incorporated many important concepts like vector space model, relevance feedback, and Rocchio Classification

Brief History of Information Retrieval

  • J.W. Sammon (1969) gave the idea of visualization interface integrated to an IR system in his famous paper “A nonlinear mapping for data structure analysis”

  • First online systems–NLM’s AIM-TWX, MEDLINE; Lockheed’s Dialog; SDC’s ORBIT

  • During 1966-67, F.W. Lancaster evaluated the MEDLARS (Medical Literature Analysis and Retrieval System)

Brief History of Information Retrieval

  • AM SIGIR Conference started in 1978 which subsequently emerged as the apex conference in IR systems

  • Belkin, Oddy, and Brooks gave the concept of Anomalous State of Knowledge (ASK) for information retrieval in 1982

  • OKAPI model was formulated in 1982-88 which is a set-oriented ranked output design for probabilistic type retrieval of textual material using inverted index

  • Major breakthrough was in 1989 when Tim Berners-Lee proposed World Wide Web in CERN Laboratory

  • TREC conference started as part of TIPSTER text program in 1992 and it was sponsored by US Defense and National Institute of Standards and Technology (NIST)

Brief History of Information Retrieval

  • PageRank algorithm was developed at Stanford University by Larry Page and Sergey Brin in 1996

  • In 1997, Google Inc. was born which has now ruling dominantly in searching engine domain

  • Google personalized search started in 2005

  • Multimedia IR (Smeulders, Lew, Sebe) integrates into search in 2010

  • Semantic models came first in 2013-2014 such as Word2Vec, GloVe

  • Google introduces BERT in 2018

  • Conversational IR in assistants were introduced in 2020-2021 such as Alexa, Siri

  • Retrieval Augmented Genreration in 2022-2023

  • LSI gained huge popularity in WWW and was hugely used in Search Engine Optimization (SEO)

  • Latent Dirichlet allocation (LDA), a generative/topic model in NLP was developed by David Blei, Andrew NG, and Michael Jordan in 2003

Who are the Users?

A user is a person who uses information and/or information systems in some meaningful way

A user can be:

  • End-user: seeks, evaluates, uses information for personal question or problem
  • System-user: end user who exploits information systems at some level
  • Information professional: facilitates end-user information seeking and use
  • Computerized system, software program

User’s Information Needs

Users are motivated to seek information in a given situation to:

  • answer a question
  • solve a problem
  • complete a task
  • learn about a subject
  • verify a fact
  • just for fun

User’s Information Needs

Typical user questions:

  • What
  • When
  • Where
  • Why
  • How

Information Needs

Two broad categories of searches:

  • Known item search
  • Subject or topic search

Information Retrieval Systems

A specialized system for the description, storage, and retrieval of information representations: primarily information objects (text, images) and their surrogates (metadata, records). Operates by matching queries (representations of information need) with data (representations of information objects)

Components of IR systems

Knowledge system into which an IR system is implanted generally consists of three main components:

  1. people in their role as information-processors

  2. documents in their role as carriers of information

  3. topics as representations

Model of IR System

Types of IR Systems

Based on the different types of services, IR can be categorized as:

    • Printed indexes and catalogs
    • OPACs
    • ChatGPT

What Information Can You Find Online?

  • Bibliographic citations
  • Full-text documents
  • Directory of reference sources
  • Numeric data
  • Images
  • Multimedia files

Issues to Consider

How is access to the Internet changing our user’s …………….?

  • Expectations

  • Ways of engaging with information

  • Brains

  • Information needs/seeking activities and behaviors

  • Other thoughts?

Putting It All Together