LIS 5043: Organization of Information
A system for choosing or highlighting some characteristics (attributes), together with a specification of the rules for selection (codes)
This implies a trade-off: if some characteristics are highlighted, other characteristics are left behind
ENTITES
: objects or conceptsATTRIBUTES
: characteristics of entities
DIACHRONIC
: stable across timeSYNCHRONIC
: changes across timeIndexer has selected (perhaps among others) the concept that the patrons will want
Indexer picks a different topic
Indexer and patron use different terms for the same concept
Patrons cannot articulate just what the question state is
Indexer
describes doc
predicts use
Patron
describes doc
predicts doc
What patron attributes can we know?
What document attributes can we know?
How can we use this knowledge to open the bottleneck between patrons in need and the documents that might be of use?
1.1 Consistency
1.2 Subject Expertise
1.3 Indexing Expertise
2.1 Searching Experience
2.2 Domain Knowledge
3.1 Motivation Level
3.2 Emotional State
Use of Standards/Rules (code)
Depends on Resources/Audience
A process in which sets of records or documents are searched to find items which may help to satisfy the information need
IR is concerned with:
representation
storage
organization
accessing of information objects
User Group
Information Need
Information Sources
Information System
Results of the Query
User Selection & Evaluation (Relevance)
Most IR is based on techniques introduced in the 1960's
IR is no longer just a library problem
As a result of these evolved uses high standards of retrieval are expected by users
We can divide IR techniques into basic classes
Simple Match Model
Request = Information Data
Document A = data, information
Document B = data, information
Document C = information, retrieval
Advantages
: simple process; widespread; familiar
Disadvantages
: single descriptor requests less effective in large databases
AND
, OR
, and NOT
to allow more complex queries to the IR systemSet Theory
Example of Boolean Search
Weighted IR (probabilistic IR)
Topic modeling itself is a soft clustering method but the output of topic modeling can be used for classification in downstream tasks such as information retrieval and improving recommendation systems
It is used to infer the hidden themes in a collection of documents and thus provides an automatic means to organize, understand and summarize large collections of textual information
It is based on statistical and machine learning techniques to mine meaningful information from a vast corpus of unstructured data and is used to mine document’s content
It infers abstract topics based on “similar patterns of word usage in each document”. These topics are simply groups of words from the collection of documents that represents the information in the collection in the best way
Semantic or Linguistic Model (NLP)
attempts to get at the “concepts” contained in the information object or the surrogate
syntactic analysis
free text searching
paragraph indexing
discourse analysis
Passage Retrieval
User Profiles
Intelligent Agents (e.g. Windows Cortana)
Web Search Engines
Data Mining/Text Extraction Methods
Retrieval-Augmented Generation (RAG) = Information Retrieval (IR) + Large Language Models (LLMs)
RAG is a technique that helps AI models generate better, more accurate, and up-to-date responses by retrieving relevant information from external sources before generating an answer
Why Do We Need RAG?
Traditional LLMs (like GPT-4) have a fixed knowledge base from training data. But:
- They don’t know new information after training.
- They hallucinate (make up facts).
- They struggle with specific or niche knowledge (e.g., latest research papers).
Imagine you ask an AI:
🗣️ “Who won the Nobel Prize in Physics this year?”
Without RAG (LLM Only):
🤖 “I don’t know. My training data only goes up to 2023.”
With RAG (LLM + IR):
🤖 (Searches the web → Finds latest Nobel Prize winners → Summarizes the results)
“The 2024 Nobel Prize in Physics was awarded to [Winner’s Name] for [Reason].”
Online catalogs
Online databases
Web Search Engines