Text Mining: Topic Modeling for Librarians

Dr. Manika Lamba

Assistant Professor
School of Library & Information Studies
University of Oklahoma, USA

IIM Bangaluru
26 August 2024

About Me

Norman, Oklahoma

My Department

How to Organize Data in Today’s World Using Machines

Data

Figure 1: World’s Technological Capacity to Store, Communicate, and Compute Information

Digital Trace Data

Past decade has witnessed an increasingly voluminous amount of digital data that is produced on the internet which describes human behavior and other objects of scholarly inquiry
Recent decades have not only witnessed an increase in the amount of text-based data but also increased computing power which is increasingly necessary to analyze it
Together, these two shifts hold the potential to significantly expand the scope of research in many different fields

What to Read Next?

How to Make an Effective Recommendation System?

Human vs Machine

How Do Machines Learn?

Making the Machine Understand

Machines perform much better that too at Scale & Speed!

Machine Learning Model

Topic Modeling

Definition

Topic

It is “a recurring pattern of co-occurring words” (Brett, 2012)
A topic can be defined as the main idea discussed in a text, i.e., the theme or subject of different granularity
Topics are simply groups of words from the collection of documents that represents the information in the collection in the best way

Topic Modeling

It is “a method for finding and tracing clusters of words (called topics) in large bodies of texts” (Brett, 2012)
It is a text mining approach to understand, organize, process, extract, manage, and summarize knowledge

Introduction

It performs soft clustering, where it presumes that every document is composed of a mixture of topics
It makes an excellent tool for discovery and helps to uncover evidence already present in the text
It has been called an act of reading tea leaves (Chang et al., 2009) or the process of highlighting words (Brett, 2012) based on their topics
It is based on statistical and machine learning techniques to mine meaningful information from a vast corpus of unstructured data and is used to mine document’s content
A subject expert (human-in-loop) is needed to label the topics

It represents terms as a table or matrix of numbers for a given corpus
In TDM, terms are represented as rows and documents as columns for a corpus where the number of occurrences of terms in the document is entered in the boxes

It represents terms as a table or matrix of numbers for a given corpus
It is a transposition of TDM
In DTM, each document is a row, and each word is the column

It evaluates the relevancy of a term for a document in a corpus and is the most popular weighting scheme in information retrieval (IR)

The term weighting is popularly used in IR and supervised machine learning tasks like text classification
It makes a list of more discriminative terms than others and assigns a weight to each highly occurring term

What Happens in Topic Modeling?

It infers abstract topics based on “similar patterns of word usage in each document”
These topics are simply groups of words from the collection of documents that represents the information in the collection in the best way

How Topic Modeling Works?

Topic Analysis + Time

It assists in identifying topics within a context and how they advance in time
For instance, over time, few documents within a topic may initiate content that varies from the original content; if that initiated content is shared by a lot of later documents, the content is recognized as a new topic
Hence, with the progression of time, topics advance, new themes emerge, and old ones become obsolete
So, topic modeling not just helps the librarians to decide the trending topics or related fields to their field of intrigue but additionally encourages them to distinguish new concepts and fields over time

How to DO Topic Modeling?

Extract/Retrieve dataset (e.g. webscraping, API, etc.)
Preparing a corpus (such as converting files from PDF to plain text format)
Conducting text pre-processing (removing stopwords, tokenization, stemming, n-grams)
Exploratory analysis (Word clouds, clustering)
Determining the number of topics (using perplexity, coherence, entropy, or eye-ball method)
Selecting the appropriate algorithm (such as LDA, STM, CTM)
Seeding (so that one can reproduce the algorithm with the same selected parameters)
Running the selected algorithm using proprietary or open-source tools (such as RapidMiner, TopicModelingTool) or programming languages (such as R or Python)
Iterating the whole process till the algorithm fits the model

When to Use Topic Modeling

When you have a vast collection of text documents
When the collection belongs to a specific subject
When the collection has a similar type of documents, such as when all files in the collection are newspaper articles

When NOT to Use Topic Modeling

When you have a relatively small number of documents
When you do not have any idea about your collection. In this case, clustering will be a better option than using topic modeling
When the collection has a mixture of different types of documents, such as when the collection is composed of newspaper archives, journal articles, and ETDs

Available Tools and Packages

Out-of-Box Tools

R Libraries

Python Libraries

Algorithms

Topic Visualization

Open questions:

How we use the output of the algorithm?
How should we visualize and navigate the topical structure?
What do the topics and document representations tell us about the texts?

Output of topic modeling is not entirely human-readable, and one way to understand the results is through visualization
“Topic models are meant to help interpret and understand texts, but it is still the researcher’s job to do the actual interpreting and understanding” (Blei, 2012)
“Be sure that you can understand the visualization as topic modeling tools are fallible” (Blei, 2012)

Case Studies

iArxiv
CORD-19
COVID-19
Topic Hex-Maps
COVID-19 Research

Manika Lamba. (2022). Visualizing the Pace of COVID-19 Research: An Experimental Study of All India Institute of Medical Sciences (AIIMS), New Delhi. In SIS Annual Convention 2022, New Delhi, India.

Case Studies

ETD Dashboard
LDA Vis
Bar Graph
Correlation Vis

Manika Lamba and Margam Madhusudhan. (2018). Metadata Tagging of Library and Information Science Theses:Shodhganga (2013-2017). In ETD2018 Taiwan Beyond the Boundaries of Rims and Oceans:Globalizing Knowledge with ETDs. Taipei,Taiwan.

Topic models do not model topics (Shadrova, 2021)

It operates from
- relevantly unrealistic assumptions
- is non-deterministic
- cannot effectively be validated against a reasonable number of competing models
- does not lock into a well-defined linguistic interface
- does not scholarly model topics in the sense of themes or content (not true anymore - BERTopic + LLM)

Topic models do not model topics (Shadrova, 2021)

Features are intrinsic make interpretation of its results prone to
- apophenia: human tendency to perceive random state sets of elements as meaningful patterns
- confirmation bias: human tendency to perceptually prefer patterns that are in alignment with pre-existing biases
While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way

Application of Topic Modeling in Libraries

Applications in Library

Data
Use Cases
Use Cases (Cont.)
Use Cases (Cont.)

Topic modeling has been applied to numerous resources, such as
- annual meetings
- diary
- clinical notes
- case reports
- newspapers
- journals
- research articles
- preprints

- patents
- conferences
- chats
- online reviews
- MOOCs
- call for papers
- social media platforms
- RSS feed
- blogs
- open-ended survey responses
- emails
- digital libraries’ resources
- smart card data
- EZproxy daily log files
- data from library mobile apps
- virtual libraries’ resources
- reference questions
- library databases
- in-house journals
- institutional and digital repository resources
- theses and dissertations
- WebOPACs
- MOOC feedback, chats, and suggestions
- online library chats 
- forums
- emails
- syllabuses
- library’s social media platform accounts

1. Making Ontologies: Mehler and Walitinger used topic modeling to build a Dewey Decimal Classification (DDC)-based topic classification model in digital libraries

2. Automatic Subject Classification: They can be used in libraries to index subject terms for documents

3. Bibliometrics: It can be used to study evolutionary pathways, citations, and trends to explore different hot and cold topics of research in a particular discipline

4. Altmetrics: It can be used to know what people are talking about your library on social media and what topics they care about

5. Recommendation Service: It can be used to recommend electronic resources based on the reading or search habits of the users

6. Organization and Management of Resources: It can be used to do metadata tagging of the electronic resources, library’s database, website, and repository resources

7. Better Searching and Information Retrieval of Resources: In digital libraries, it can help in providing a fast searching experience to users and better information retrieval of electronic resources

Text Mining for Information Professionals (2022)

Topic Modeling Demo in R

Data

Dataset consists of nearly 2,500 research abstracts from six years worth of publications by University of North Carolina at Charlotte (UNCC) researchers in Social Science (across dept/college) and Computing & Informatics (the entire college CCI).

Coconut Libtool

Introduction

Coconut Libtool is a web-based application that utilizes cutting-edge Natural Language Processing (NLP) technologies to analyze bibliographic data
Designed an open application specifically intended for textual analysis use by LIS professionals
This application boasts a user-friendly design that eliminates the need for librarians to possess advanced computer skills or install any software on their devices

What Does it Do?

Upcoming Books

Thank You!

Important Links

If you have any questions, reach out to me at manika@ou.edu!

Text Mining: Topic Modeling for Librarians

ToC

About Me

Norman, Oklahoma

My Department

How to Organize Data in Today’s World Using Machines

Data

Digital Trace Data

What to Read Next?

Finding Related Material

Human vs Machine

How Do Machines Learn?

Making the Machine Understand

Machine Learning Model

Topic Modeling

Definition

Introduction

Some Important Concepts

What Happens in Topic Modeling?

How Topic Modeling Works?

Topic Analysis + Time

How to DO Topic Modeling?

When to Use Topic Modeling

When NOT to Use Topic Modeling

Available Tools and Packages

Algorithms

Topic Visualization

Case Studies

Case Studies

Topic models do not model topics (Shadrova, 2021)

Topic models do not model topics (Shadrova, 2021)

Application of Topic Modeling in Libraries

Applications in Library

Text Mining for Information Professionals (2022)

Topic Modeling Demo in R

Data

Coconut Libtool

Introduction

What Does it Do?

Upcoming Books

Thank You!