Reprocessing OCR files using Tesseract

With our preliminary goal for “dLOC as Data” rooted in facilitating access to the image and textual data for dLOC’s newspaper collection, we were aware of the varying OCR quality of newspapers within the Caribbean Newspaper Digital Library. Many of these newspapers were originally digitized over a decade ago, prior to the development and implementation of current OCR technology. As a result, born-digital newspapers and those involved in recent digitization efforts offer more accurate text and greater accessibility overall to the material in comparison. As part of the process of enhancing access to existing collections, under the leadership of UF team member Laura Perry we focused on newspapers that we assessed to be in need of reprocessing. 

Once the list of newspaper titles for inclusion in the project was finalized, we started by running titles through SobekCM’s Management & Reporting Tool (SMaRT), part of the SobekCM software solution developed at UF. Using this tool, we generated an ad hoc report of each newspaper using its bibliographic identifier (BibID). These reports were then exported into spreadsheets for analysis with our Digital Metadata Steward (DMS), another tool developed at UF. We created sets of the items compiled in each spreadsheet, which we then fed directly into Tesseract, the OCR engine of choice for dLOC projects, through the DMS. Previously, OCR processing was performed using PrimeRecognition and occurred post-capture rather than as part of the digitization process, leading to a delay between the availability of a newspaper’s page scans and OCR text files. In addition to improved and faster OCR output, in terms of multilingual capabilities, Tesseract can handle Spanish and French well.

As such, we used Tesseract to create .txt files for each page of a newspaper issue and to combine the individuals files into a .pdf of the entire issue. Once generated, .txt files, along with image derivatives in the form of .jpeg and .jp2 files, are housed in the resources directory of UF’s repository system. Using batch script or .bat files, we created folders of compiled issues and copied the corresponding .txt, .pdf, and .xml files from our resources directory into these new folders. Aware of the sheer numbers of issues for some of the older newspapers, we knew that we would need to use FTP or a physical hard drive to transfer the new OCR files to FIU, an ongoing process at the moment. Once part of FIU’s open access data repository, this newspaper data will be able to reach broader audiences and hopefully play a role in other research projects aimed at studying the Caribbean.

Leave a Reply

Your email address will not be published. Required fields are marked *