Natural Language Searching

LIS 4/5523: Online Information Retrieval

Dr. Manika Lamba

Introduction

  • Free text searching = flexiblity + complexity
  • NLP is essential for modern IR
  • Conversational interfaces are shaping the future in library search

Document Indexing and Retrieval

  • Methods include
    • Boolean
    • Vector Space
    • Probabilistic
  • Rely on index terms
    • “bag of words”
    • stoplist + stemming
  • But text is “unstructured”
    • information may be “hidden”

Problems with Text

  • Polysemy: one word maps to many concept such as bat
  • Synonymy: one concept maps to many words such as happy or joyful, car or automobile
  • Word order
  • Language is generative
    • Starbucks coffee is the best

    • The place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhere

  • Many different ways to express given idea
    • synonymy, paraphrase, metaphor, etc
  • Frege's principle: The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine them

Problems with Text (Cont.)

  • Language is a form of communication
    • All communication has a *context*
      • time and place of utterance, the writer, the reader, their background knowledge, intentions, assumptions and the reader’s knowledge/intentions, etc.
  • Language is changing
  • Ill-formed input
  • Co-ordination, negation, etc
  • Multi-linguity
  • Sarcasm, irony, slang, jargon, etc

Enter NLP/Text Analytics

  • Text Analytics: a set of linguistic, analytical, and predictive technique to extract structure and meaning from unstructured documents

  • NLP: academic term for Text Analytics

    • analogous to “search” vs. “IR”
    • Text Analytics ≈ NLP ≈ Text Mining

Role of Natural Language Processing in Information Retrieval

Natural Language Searching

Natural Langauge Indexing

  • Based on existing vocabulary of documents

  • Terms are extracted or derived from titles, abstracts, full text

  • Terms are in title, abstract, descriptor, full-text fields

  • Searcher inputs any term likely to occur in free text

NLP Applications in Searching

  1. Word Prediction
    • Assistive technologies (TextHelp)
    • Google, Bing, Yahoo query suggestions

NLP Applications in Searching

  1. Spelling Correction
    • Autocorrect

    • Did you Mean

NLP Applications in Searching

  1. Text Categorization

    • News agencies: classifying incoming news stories
    • Search engines: classifying queries
    • Identifying spam emails
    • Routing email or documents to appropriate people
  2. Terminology Extraction

    • Differentiate between useful index terms and ‘noise’
    • Help lexicographers identify new terminology
    • Term extraction systems process scientific papers to identify terminology, possibly comparing it with a known list
  3. Speech Recognition

    • Spoken Dialogue System
    • iPhone Voice Search

NLP Applications in Searching

  1. Named Entity Recognition

    • Identification of key concepts (eg. people, places, organizations)
    • Increase precision of IR (New companies in New York vs. Companies in New York)
    • Support navigation
    • Improve machine translation
    • Speech synthesis, auto-summarization, etc.

NLP Applications in Searching

  1. Information Extraction
    • Identification of entities + relationships
    • Based on pre-defined structures
    • Can be used for metadata retrieval or store in database and query against it

Free Text Searching

Free Text Searching in Databases

  • Terms added at the discretion of the cataloger
  • Do not come from a controlled vocabulary or from the words of the document
  • Cataloger tries to match user’s terms (user warrant)
  • Not a frequent practice
  • Can be used in combination with controlled vocabulary or natural language indexing

User-Defined Tagging

  • Has many labels such as user-supplied, folksonomy, tagging, social classification
  • It is really not a new practice but one that has recently become the buzz on the Web with the emergence of blogs and media sharing sites like Blogger, Flickr, YouTube, etc.
    • researchers in image retrieval have explored this idea
    • researchers in organization of information, thesauri development, indexing, subject representation have also explored this idea
  • To date is being used to tag images, web pages, blogs, library catalogs, etc.

Applications in LIS

  • Digital libraries and institutional repositories
  • Discovery systems and OPACs
  • Personalized recommendations

Future Directions

  • Multimodal Searching
  • Intelligent Research Assistants
  • Knowledge Graphs Integration
  • Multilingual and Cross-Lingual Search
  • More!!