
Exploring OCR Errors in Full-Text Large Documents: A Study of LIS Theses and Dissertations
Abstract The accuracy of OCR output for text mining and NLP analyses of large text documents can be impacted by errors that occur during the OCR process. The methodology involves retrieving electronic theses and dissertations (ETDs) for LIS discipline from the ProQuest Dissertations and Theses Global database and manually reviewing the full-text ETDs for OCR problems associated with the conversion of PDF files into plain text format.