LIS 4/5693: Information Retrieval and Text Mining
Information Retrieval (IR) is about finding the right documents
Natural Language Processing (NLP) is about understanding language
Text Mining is about discovering patterns and knowledge from large collections of text
Data and information generation in every discipline in the universe of knowledge has seen staggering growth
Storing, managing, querying, & retrieval of huge amount of data & information needs sophisticated procedures & advanced technologies
Nowadays, information collection is web-based and online which is vast and growing at an exponential rate

For several years advances in Knowledge Discovery in Databases (KDD) have been undertaken to manage the information in an efficient manner
Data mining is a part of the KDD process which identifies the hidden patterns in large information repositories
It involves several information extraction techniques such as regression models, association rules, Bayesian methods, decision trees, neural networks, etc.
Data can be textual or non-textual in nature
Textual data are generated from various digital sources such as journals, newspapers, archives, social networks, blogs, forums, etc.
Definition
A process in which sets of records or documents are searched to find items which may help to satisfy an information need
Information Retrieval includes:

J.W. Sammon (1969) gave the idea of visualization interface integrated to an IR system in his famous paper “A nonlinear mapping for data structure analysis”
First online systems–NLM’s AIM-TWX, MEDLINE; Lockheed’s Dialog; SDC’s ORBIT

AM SIGIR Conference started in 1978 which subsequently emerged as the apex conference in IR systems
Belkin, Oddy, and Brooks gave the concept of Anomalous State of Knowledge (ASK) for information retrieval in 1982
OKAPI model was formulated in 1982-88 which is a set-oriented ranked output design for probabilistic type retrieval of textual material using inverted index
Major breakthrough was in 1989 when Tim Berners-Lee proposed World Wide Web in CERN Laboratory
TREC conference started as part of TIPSTER text program in 1992 and it was sponsored by US Defense and National Institute of Standards and Technology (NIST)
PageRank algorithm was developed at Stanford University by Larry Page and Sergey Brin in 1996
In 1997, Google Inc. was born which has now ruling dominantly in searching engine domain
Google personalized search started in 2005
Multimedia IR (Smeulders, Lew, Sebe) integrates into search in 2010
Semantic models came first in 2013-2014 such as Word2Vec, GloVe
Google introduces BERT in 2018
Conversational IR in assistants were introduced in 2020-2021 such as Alexa, Siri
Retrieval Augmented Genreration in 2022-2023

LSI gained huge popularity in WWW and was hugely used in Search Engine Optimization (SEO)
Latent Dirichlet allocation (LDA), a generative/topic model in NLP was developed by David Blei, Andrew NG, and Michael Jordan in 2003
A user is a person who uses information and/or information systems in some meaningful way
A user can be:
Users are motivated to seek information in a given situation to:
Typical user questions:
Two broad categories of searches:
A specialized system for the description, storage, and retrieval of information representations: primarily information objects (text, images) and their surrogates (metadata, records). Operates by matching queries (representations of information need) with data (representations of information objects)
Knowledge system into which an IR system is implanted generally consists of three main components:
people in their role as information-processors
documents in their role as carriers of information
topics as representations

Based on the different types of services, IR can be categorized as:
Polysemy: one word maps to many concept such as batSynonymy: one concept maps to many words such as happy or joyful, car or automobileWord orderLanguage is generative
Starbucks coffee is the best
The place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhere
Many different ways to express given idea
Frege's principle: The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine themLanguage is a form of communication
Language is changingIll-formed inputCo-ordination, negation, etcMulti-linguitySarcasm, irony, slang, jargon, etcText Analytics: a set of linguistic, analytical, and predictive technique to extract structure and meaning from unstructured documents
NLP: academic term for Text Analytics
Based on existing vocabulary of documents
Terms are extracted or derived from titles, abstracts, full text
Terms are in title, abstract, descriptor, full-text fields
Searcher inputs any term likely to occur in free text
Autocorrect 
Did you Mean 
Text Categorization
Terminology Extraction
Speech Recognition
Named Entity Recognition
Source: Markey Ch-7
user-supplied, folksonomy, tagging, social classificationText mining is a process of automatically extracting information from the text with the aim of generating new knowledge
It is a specialized interdisciplinary field combining techniques from linguistics, computer science, and statistics to build tools that can efficiently retrieve and extract information from digital text
It assists in the automatic classification of documents
In text mining, “words are attributes or predictors and documents are cases or records, together these form a sample of data that can feed in well-known learning methods” (Weiss et al., 2005)

Weiss et al. (2005). Overview of text mining. In: Text mining:Predictive methods for analyzing unstructured information

