Exploring OCR Errors in Full-Text Large Documents: A Study of LIS Theses and Dissertations

By 👩‍🔬Manika Lamba, Margam Madhusudhan in Article 2023

July 27, 2023

PDF

Abstract

The accuracy of OCR output for text mining and NLP analyses of large text documents can be impacted by errors that occur during the OCR process. The methodology involves retrieving electronic theses and dissertations (ETDs) for LIS discipline from the ProQuest Dissertations and Theses Global database and manually reviewing the full-text ETDs for OCR problems associated with the conversion of PDF files into plain text format. The study examines the factors that impact the quality of OCR output, including the quality of the original document. The findings show that five major types of scanning problems in PDFs were identified that caused OCR errors like joining of words, misspellings, space between words, insertion of random characters, hyphenation, and formatting. To avoid these errors, it is important to use high-quality scanned documents published from the 1920s to the 1970s. Further research could focus on improving the accuracy of OCR technology for large-text documents published before the 1980s.

Posted on:: July 27, 2023

Length:: 1 minute read, 156 words

Categories:: Article 2023

See Also: