Categories
Uncategorized

Exploring the Literature: Research & Data Ethics

As dLOC as Data’s effort to demonstrate the potential of newspaper data included the selection, use, and analysis of historical hurricane and storm data from Caribbean publications, we were aware that compiling some relevant works of the research and data ethics literature was an important part of encouraging discussions of best practices and considerations for our own toolkit. 

Within the literature collected focusing on research and data ethics broadly as well as disaster-based research in particular, we found several recurring points for consideration. In addition to the methodology and scope of research initiatives, context, historical and otherwise, presents another important aspect that must be considered by researchers for initiatives. In terms of methodology, ethical behavior and practices must be a major consideration from initial discussions of research design to data collection and analysis to presentation of findings and post-collection procedures. Research and data collection should be purposeful, respectful, and coordinating in nature for locals and outsiders alike, particularly research exploring disaster-based response and history. Researchers should be able to explicitly determine the scope and nature of their efforts, while accounting for the concerns and expectations of affected communities as well.

In terms of context, historical and geographic factors influence the form and scope of best research and data collection practices for a particular region or community. For research initiatives focusing on an aspect of post-colonial societies, acknowledgement of colonial entitlement, in itself an important distinction to make as a result of the influence of colonial rule on archival material and data collection practices, provides researchers with an opportunity to confront colonial dynamics and make steps to fill in archival gaps with materials and narratives that reflect the experiences of involved communities. Similarly, this practice in turn encourages researchers to recognize valued indigenous ways of understanding and defining major events, or natural disasters in the case of our own data. One work in fact challenges the popular perception of data and numbers being able to ‘speak for themselves’, calling for researchers to instead consider factors that may influence interpretations of findings and approach data with an intersectional lens. 

As our own initiative focuses on hurricane data collected from Caribbean newspapers, in some cases published from colonialist perspectives or prior to formal independence from colonial rule, each of these considerations is valuable to our discussions and the development of the thematic toolkit. Other potential points to consider include language, as utilized for describing and identifying storms, and publication source, as a platform for the elevation of particular voices, possibly evident between independent and national or state newspapers.

Categories
Uncategorized

Reprocessing OCR files using Tesseract

With our preliminary goal for “dLOC as Data” rooted in facilitating access to the image and textual data for dLOC’s newspaper collection, we were aware of the varying OCR quality of newspapers within the Caribbean Newspaper Digital Library. Many of these newspapers were originally digitized over a decade ago, prior to the development and implementation of current OCR technology. As a result, born-digital newspapers and those involved in recent digitization efforts offer more accurate text and greater accessibility overall to the material in comparison. As part of the process of enhancing access to existing collections, under the leadership of UF team member Laura Perry we focused on newspapers that we assessed to be in need of reprocessing. 

Once the list of newspaper titles for inclusion in the project was finalized, we started by running titles through SobekCM’s Management & Reporting Tool (SMaRT), part of the SobekCM software solution developed at UF. Using this tool, we generated an ad hoc report of each newspaper using its bibliographic identifier (BibID). These reports were then exported into spreadsheets for analysis with our Digital Metadata Steward (DMS), another tool developed at UF. We created sets of the items compiled in each spreadsheet, which we then fed directly into Tesseract, the OCR engine of choice for dLOC projects, through the DMS. Previously, OCR processing was performed using PrimeRecognition and occurred post-capture rather than as part of the digitization process, leading to a delay between the availability of a newspaper’s page scans and OCR text files. In addition to improved and faster OCR output, in terms of multilingual capabilities, Tesseract can handle Spanish and French well.

As such, we used Tesseract to create .txt files for each page of a newspaper issue and to combine the individuals files into a .pdf of the entire issue. Once generated, .txt files, along with image derivatives in the form of .jpeg and .jp2 files, are housed in the resources directory of UF’s repository system. Using batch script or .bat files, we created folders of compiled issues and copied the corresponding .txt, .pdf, and .xml files from our resources directory into these new folders. Aware of the sheer numbers of issues for some of the older newspapers, we knew that we would need to use FTP or a physical hard drive to transfer the new OCR files to FIU, an ongoing process at the moment. Once part of FIU’s open access data repository, this newspaper data will be able to reach broader audiences and hopefully play a role in other research projects aimed at studying the Caribbean.

Categories
Uncategorized

Building an interactive map of select Caribbean publications with StoryMapJS

With the aim of further introducing users to the list of publications selected for the thematic toolkit, we decided to create an interactive map with StoryMapJS, a tool designed by Northwestern University’s Knight Lab. With its user-friendly layout and emphasis on geographic storytelling, this tool allowed us to build a visual display that both explores the individual histories of each publication and presents their relative regions of operation and influence to users. Each entry in the story map includes information about the origins, publishers, primary audiences, and publication frequency of a particular paper, among other historical details.

Though already accessible to users, this is an ongoing project that we will continue to develop as more detailed information is added to corresponding entries.

Categories
Uncategorized

A new addition to the project team

With experience in digital library projects specializing in Latin America and given the objectives of “dLOC as Data: A Thematic Approach to Caribbean Newspapers”, I was recently selected as the Caribbean Data Curation Intern under the direction of the project lead. Starting off as a student library aide at the Latin American & Caribbean Collection (LACC) at the University of Florida during my sophomore year, I quickly became interested in archival work and assisted with several archival and research projects under the guidance of the LACC’s senior staff. Through these projects and an undergraduate fellowship with the Association for Research Libraries the following year, I began to concentrate on the cultural and educational roles of archives and similar institutions, particularly digital collections. As digital institutions such as dLOC become increasingly vital resources for students and researchers alike in the modern world, accessibility to and understanding of such resources become even more necessary.

With these interests and the nature of this internship in mind, I hope to explore different aspects of web development and data utilization in the context of libraries and other educational institutions. I hope to better familiarize myself with the processes of data collection, enhancement, and diffusion as well, particularly within and concerning Latin American and the Caribbean. As such, I look forward to contributing to this project and its objectives over the course of the next few months.