7th May 2023
Data and information generation in every discipline in the universe of knowledge has seen staggering growth
Storing, managing, querying, & retrieval of huge amount of data & information needs sophisticated procedures & advanced technologies
Nowadays, information collection is web-based and online which is vast and growing at an exponential rate
Challenges to IR
Information collection is very heterogeneous
Information need of a user is of complex nature
Many complex models have been developed to understand the information need of a human but it still remains a problem area that has many open questions that is unanswered (user behavior, query analysis):
How an end-user will seek information
How will they understand their information need
How they will go to a system & express their information need
User’s information need is continuously evolving according to the medium & collection itself
Users have very little time to retrieve complex material, formulate & refine their query
In such a situation, it becomes a huge IR challenge when we have a large evolving heterogeneous content on one hand and we have users with very complicated user query on other hand which has not given much time for retrieval. The challenges for IR are certainly are huge
Size and number of documents increased where no traditional cataloging system can give technical support
Libraries had a little or limited scopes in terms documents processing, handling different e-resources or sharing heterogeneous data and information over the internet
Different disciplines (biotechnology, genetics, geoinformatics, etc.) started producing different types of data with computer support and in multiple number of file formats which need to be indexed, stored, organized, or retrieved. These data are mostly semi-structured (video, audio) or unstructured (webpages, e-resources)
But that is not happening. Different organizations just started publishing & populating documents as and when they can produce information
Wherein in a classic IR system, documents need to be indexed, scanned, coining certain information that goes to bibliographic elements (structured metadata)
For example, based on colon classification, chain indexing system was developed by S.R. Ranganathan
Classification of Documents
Information & Data Visualization
Ranking of Documents
Web Based IR System
Multimedia IR System
Distributed IR System
The importance of IR was felt when there was a necessity to locate or to get those shared information without restrictions
It can accept queries in natural language and execute matching operation with its indexed term at back-end and locate the expected document from its term-document matrix.
After executing the queries, search engine represents the results with ranks as a specific ranking algorithm (e.g. Page Rank) runs on the fetched result. Preferably, the most relevant documents get top ranks than non-relevant ones.
As most of the IR systems (Search Engines) index the documents on incremental basis, web-based crawlers crawl the web pages in the hyperspace within certain time interval and get the updated information and further index the crawled information. Thus, we get the latest information from the search spaces.
IR system has opened up huge business opportunities through web environment.
System for the Mechanical Analysis and Retrieval of Text (SMART)
J.W. Sammon (1969) gave the idea of visualization interface integrated to an IR system in his famous paper “A nonlinear mapping for data structure analysis”
First online systems–NLM’s AIM-TWX, MEDLINE; Lockheed’s Dialog; SDC’s ORBIT
- During 1966-67, F.W. Lancaster evaluated the MEDLARS (Medical Literature Analysis and Retrieval System)
AM SIGIR Conference started in 1978 which subsequently emerged as the apex conference in IR systems
Belkin, Oddy, and Brooks gave the concept of Anomalous State of Knowledge (ASK) for information retrieval in 1982
OKAPI model was formulated in 1982-88 which is a set-oriented ranked output design for probabilistic type retrieval of textual material using inverted index
Major breakthrough was in 1989 when Tim Berners-Lee proposed World Wide Web in CERN Laboratory
TREC conference started as part of TIPSTER text program in 1992 and it was sponsored by US Defense and National Institute of Standards and Technology (NIST)
PageRank algorithm was developed at Stanford University by Larry Page and Sergey Brin in 1996
Latent Dirichlet allocation (LDA), a generative/topic model in NLP was developed by David Blei, Andrew NG, and Michael Jordan in 2003
In 1997, Google Inc. was born which has now ruling dominantly in searching engine domain
The present situation of web and the environment of search engine did not evolve within moments rather it’s the product of decades-long research
people in their role as information-processors
documents in their role as carriers of information
topics as representations
IRS does not inform the user on the subject of their inquiry, it merely informs them of the existence (or non-existence) and whereabouts of documents relating to their request (Lancaster)
This notion changed of IR changed since the availability of full-text documents in bibliographic databases
IRS originally meant text retrieval systems, since they were dealing with textual documents
Many modern information retrieval systems deal with multimedia information comprising text, audio, images and video
Specific nature of audio, image and video information has called for the development of many new tools and techniques for information retrieval
Modern information retrieval deals with storage, organization and access to text, as well as multimedia information resources
An IRS is developed to help users to discovery relevant information from a storehouse containing collection of documents
The idea of information retrieval assumes that there exist several documents or records comprising data that have been arranged in a suitable order for easy retrieval
The storehouse contains many bibliographic information, which is quite different from other kinds of information or data
Conventional database management systems, such as Access, Oracle, MySQL, etc, deal with structured data, where the arrangement/structuring of data is based on the specific attributes of data elements
The main objective of these databases is to enable the user to search for specific records that be matched with one or more specific conditions or search criteria usually laid by users in an online environment
Unlike a conventional database management system (DBS), an IRS deals with unstructured data
Main purpose of designing an IRS is to answer to the users’ queries
Retrieved information can be in represented in different forms: text along with video, audio, images, graphics, animations
Most IR research focuses more specifically on text retrieval – the computerized retrieval of machine-readable text without human indexing
But it has spread across other interesting areas such as
- QA systems can pull answers from an unstructured collection of natural language documents. Eg. ChatGPT, Chatbots
Image Retrieval - It helps the retrieval system for browsing, searching and retrieving images from a large database. The database may contain only digital images, images along with text
Music Retrieval - It is a small yet it is a growing field of research with many real-world applications
An information system essentially makes ensure that users should be satisfied with the service.
The system will be able to accomplish tasks, solve problems, and make decisions, based on the user needs. In short, an IRS should:
All operations pertaining to information retrieval surround around usefulness and relevance of documents.
The use of a document is dependent upon on three major things, topical connectedness, applicability, and originality.
A resource is considered to be topically significant for a particular context, question, or task if it consists of information that either instantly provides answer to the query or can be used, in combination with other information, to infer an answer or perform the task.
The appropriateness of the answer completely depends upon the user for a given context.
It is original if it provides an input to the user’s knowledge.
Utility can be measured in monetary terms: “To what extent the document is useful for the user?”, “What is the recall and precision of the search engine”?
The term “relevance” can indicate utility or topical relevance or pertinence. Many IR systems focus on finding topically relevant documents, leaving further selection to the user.
Relevance is a matter of degree. Some documents are highly relevant and indispensable for the user as it serves the purpose of the users’ need; others may not contribute much to the users’ requirements.
Analysis - Analyzing the available content in the information sources as well as the queries
Matching - Matching the user’s query with the available document in order to retrieve relevant resources
To identify the information sources relevant to the areas of interest of the target user’s community
To analyze the contents of the sources (documents)
To represent the contents of analyzed sources in a way that matches users’ queries
To analyze users’ queries and represent them in a form that will be suitable for matching the database
To match the search statement with the stored database
To retrieve relevant information
To make continuous changes in all aspects of the system
Q1. What do you understand by ‘Information Retrieval’? Discuss the various components and types of Information Retrieval System. (12.5 Marks)
Q2. ‘Evaluation is the best process to ascertain the merits and demerits of Information Storage and Retrieval System’. In light of the statement, discuss the criteria used for evaluation of an Information Retrieval System. (12.5 Marks)