LIS 4/5523: Online Information Retrieval
A process in which sets of records or documents are searched to find items which may help to satisfy the information need
IR is concerned with:
representationstorageorganizationaccessing of information objectsBlair, 1990
Chowdhury, 2010
User Group
Information NeedInformation SourcesInformation System
Results of the QueryUser Selection & Evaluation (Relevance)Most IR is based on techniques introduced in the 1960's
IR is no longer just a library problem
As a result of these evolved uses high standards of retrieval are expected by usersWe can divide IR techniques into basic classes
Simple Match ModelRequest = Information Data
Document A = data, information
Document B = data, information
Document C = information, retrieval
Advantages: simple process; widespread; familiar
Disadvantages: single descriptor requests less effective in large databases
AND, OR, and NOT to allow more complex queries to the IR systemSet TheoryOR = build up concepts
AND = combine words/concepts blocks
Only documents that contain ALL words/concept blocks
Produces smaller set/fewer documents
Example of Boolean Search

Weighted IR (probabilistic IR)

Advantages
Disadvantages
Semantic or Linguistic Model (NLP)attempts to get at the “concepts” contained in the information object or the surrogate
- free text searching
- paragraph indexing
- discourse analysis
User Profiles
Intelligent Agents (e.g. Windows Cortana)
Web Search Engines
Data Mining/Text Extraction Methods

Soft clustering method based on Probablistic IR algorithm which can be used for classification in downstream tasks such improving recommendation systems
Used to infer hidden themes in a collection of documents - provides an automatic means to organize, understand and summarize large collections of textual information
Based on statistical and machine learning techniques to mine meaningful information from a vast corpus of unstructured data and document’s content
Infers abstract topics based on “similar patterns of word usage in each document”
Retrieval Augmented Generation (RAG) = Information Retrieval (IR) + Large Language Models (LLMs)
RAG is a Generative IR technique that helps AI models generate better, more accurate, and up-to-date responses by retrieving relevant information from external sources before generating an answer
Why Do We Need RAG?
Traditional LLMs (like GPT-4) have a fixed knowledge base from training data. But:
- They don’t know new information after training.
- They hallucinate (make up facts).
- They struggle with specific or niche knowledge (e.g., latest research papers).
Imagine you ask an AI:
🗣️ “Who won the Nobel Prize in Physics this year?”
Without RAG (LLM Only):
🤖 “I don’t know. My training data only goes up to 2023.”
With RAG (LLM + IR):
🤖 (Searches the web → Finds latest Nobel Prize winners → Summarizes the results)
“The 2024 Nobel Prize in Physics was awarded to [Winner’s Name] for [Reason].”
Online catalogs
Online databases
Web Search Engines







Indexer has selected (perhaps among others) the concept that the patrons will want


Indexer picks a different topicIndexer and patron use different terms for the same conceptPatrons cannot articulate just what the question state isIndexer
describes doc
predicts use
Patron
describes doc
predicts doc
What patron attributes can we know?What document attributes can we know?How can we use this knowledge to open the bottleneck between patrons in need and the documents that might be of use?1.1 Consistency
1.2 Subject Expertise
1.3 Indexing Expertise
2.1 Searching Experience
2.2 Domain Knowledge
3.1 Motivation Level
3.2 Emotional State
Use of Standards/Rules (code)
Depends on Resources/Audience