Assistant Professor
School of Library & Information Studies
University of Oklahoma, USA
IIM Bangaluru
26 August 2024
Past decade has witnessed an increasingly voluminous amount of digital data that is produced on the internet which describes human behavior and other objects of scholarly inquiry
Recent decades have not only witnessed an increase in the amount of text-based data but also increased computing power which is increasingly necessary to analyze it
Together, these two shifts hold the potential to significantly expand the scope of research in many different fields
How to Make an Effective Recommendation System?
Machines perform much better that too at Scale & Speed!
Topic
Topic Modeling
It evaluates the relevancy of a term for a document in a corpus and is the most popular weighting scheme in information retrieval (IR)
Out-of-Box Tools
Python Libraries
BERTopic, 2022 . . . more!
It operates from
relevantly unrealistic assumptions
is non-deterministic
cannot effectively be validated against a reasonable number of competing models
does not lock into a well-defined linguistic interface
does not scholarly model topics in the sense of themes or content (not true anymore - BERTopic + LLM)
Features are intrinsic make interpretation of its results prone to
While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way
Topic modeling has been applied to numerous resources, such as
- patents
- conferences
- chats
- online reviews
- MOOCs
- call for papers
- social media platforms
- RSS feed
- blogs
- open-ended survey responses
- emails
- digital libraries’ resources
- smart card data
- EZproxy daily log files
- data from library mobile apps
- virtual libraries’ resources
- reference questions
- library databases
- in-house journals
- institutional and digital repository resources
- theses and dissertations
- WebOPACs
- MOOC feedback, chats, and suggestions
- online library chats
- forums
- emails
- syllabuses
- library’s social media platform accounts
1. Making Ontologies: Mehler and Walitinger used topic modeling to build a Dewey Decimal Classification (DDC)-based topic classification model in digital libraries
2. Automatic Subject Classification: They can be used in libraries to index subject terms for documents
3. Bibliometrics: It can be used to study evolutionary pathways, citations, and trends to explore different hot and cold topics of research in a particular discipline
4. Altmetrics: It can be used to know what people are talking about your library on social media and what topics they care about
5. Recommendation Service: It can be used to recommend electronic resources based on the reading or search habits of the users
6. Organization and Management of Resources: It can be used to do metadata tagging of the electronic resources, library’s database, website, and repository resources
7. Better Searching and Information Retrieval of Resources: In digital libraries, it can help in providing a fast searching experience to users and better information retrieval of electronic resources
Dataset consists of nearly 2,500 research abstracts from six years worth of publications by University of North Carolina at Charlotte (UNCC) researchers in Social Science (across dept/college) and Computing & Informatics (the entire college CCI).
Important Links
If you have any questions, reach out to me at manika@ou.edu
!