A look into ongoing digital newspaper projects

With the primary goal of digitizing regional newspapers as a way of increasing access to and awareness of marginalized historic material, the US Caribbean & Ethnic Florida Digital Newspaper Project (USCEFDNP) is part of the National Digital Newspaper Program (NDNP), a large-scale digitization effort managed by the Library of Congress and funded through the National Endowment for the Humanities. Launched in 2005, this initiative awards one institution per state or territory biennially with the necessary funds to digitize 100,000 pages worth of content, which is then uploaded to Chronicling America, an open access digital library designed as a permanent host of this content. As one of several ongoing newspaper-focused projects taking place at the George A. Smathers Libraries of the University of Florida, the USCEFDNP is the result of a partnership with two other academic institutions: the University of Puerto Rico-Rio Piedras and, as of the latest award cycle, the University of the Virgin Islands. Over the course of the project’s previous three cycles, over 300,000 pages of content have been digitized and uploaded to Chronicling America, along with detailed essays to contextualize each publication. With a focus on ethnic histories in the region, the project’s current selections for digitization include three papers of Jewish, African American, and Hispanic origins. In a similar fashion to the objectives of our own project, the scope and quality of the content hosted on Chronicling America culminated in a Data Challenge held by the NEH in order to promote interest in the use and potential of historical data by other scholars on a variety of topics. Through dLOC as Data, we hope to make additional Caribbean newspapers outside the US more amenable to text and data analysis.

Those interested in the USCEFDNP may contact Melissa Jerome, the Project Coordinator at the University of Florida, for more information.


Exploring the Literature: Research & Data Ethics

As dLOC as Data’s effort to demonstrate the potential of newspaper data included the selection, use, and analysis of historical hurricane and storm data from Caribbean publications, we were aware that compiling some relevant works of the research and data ethics literature was an important part of encouraging discussions of best practices and considerations for our own toolkit. 

Within the literature collected focusing on research and data ethics broadly as well as disaster-based research in particular, we found several recurring points for consideration. In addition to the methodology and scope of research initiatives, context, historical and otherwise, presents another important aspect that must be considered by researchers for initiatives. In terms of methodology, ethical behavior and practices must be a major consideration from initial discussions of research design to data collection and analysis to presentation of findings and post-collection procedures. Research and data collection should be purposeful, respectful, and coordinating in nature for locals and outsiders alike, particularly research exploring disaster-based response and history. Researchers should be able to explicitly determine the scope and nature of their efforts, while accounting for the concerns and expectations of affected communities as well.

In terms of context, historical and geographic factors influence the form and scope of best research and data collection practices for a particular region or community. For research initiatives focusing on an aspect of post-colonial societies, acknowledgement of colonial entitlement, in itself an important distinction to make as a result of the influence of colonial rule on archival material and data collection practices, provides researchers with an opportunity to confront colonial dynamics and make steps to fill in archival gaps with materials and narratives that reflect the experiences of involved communities. Similarly, this practice in turn encourages researchers to recognize valued indigenous ways of understanding and defining major events, or natural disasters in the case of our own data. One work in fact challenges the popular perception of data and numbers being able to ‘speak for themselves’, calling for researchers to instead consider factors that may influence interpretations of findings and approach data with an intersectional lens. 

As our own initiative focuses on hurricane data collected from Caribbean newspapers, in some cases published from colonialist perspectives or prior to formal independence from colonial rule, each of these considerations is valuable to our discussions and the development of the thematic toolkit. Other potential points to consider include language, as utilized for describing and identifying storms, and publication source, as a platform for the elevation of particular voices, possibly evident between independent and national or state newspapers.


Reprocessing OCR files using Tesseract

With our preliminary goal for “dLOC as Data” rooted in facilitating access to the image and textual data for dLOC’s newspaper collection, we were aware of the varying OCR quality of newspapers within the Caribbean Newspaper Digital Library. Many of these newspapers were originally digitized over a decade ago, prior to the development and implementation of current OCR technology. As a result, born-digital newspapers and those involved in recent digitization efforts offer more accurate text and greater accessibility overall to the material in comparison. As part of the process of enhancing access to existing collections, under the leadership of UF team member Laura Perry we focused on newspapers that we assessed to be in need of reprocessing. 

Once the list of newspaper titles for inclusion in the project was finalized, we started by running titles through SobekCM’s Management & Reporting Tool (SMaRT), part of the SobekCM software solution developed at UF. Using this tool, we generated an ad hoc report of each newspaper using its bibliographic identifier (BibID). These reports were then exported into spreadsheets for analysis with our Digital Metadata Steward (DMS), another tool developed at UF. We created sets of the items compiled in each spreadsheet, which we then fed directly into Tesseract, the OCR engine of choice for dLOC projects, through the DMS. Previously, OCR processing was performed using PrimeRecognition and occurred post-capture rather than as part of the digitization process, leading to a delay between the availability of a newspaper’s page scans and OCR text files. In addition to improved and faster OCR output, in terms of multilingual capabilities, Tesseract can handle Spanish and French well.

As such, we used Tesseract to create .txt files for each page of a newspaper issue and to combine the individuals files into a .pdf of the entire issue. Once generated, .txt files, along with image derivatives in the form of .jpeg and .jp2 files, are housed in the resources directory of UF’s repository system. Using batch script or .bat files, we created folders of compiled issues and copied the corresponding .txt, .pdf, and .xml files from our resources directory into these new folders. Aware of the sheer numbers of issues for some of the older newspapers, we knew that we would need to use FTP or a physical hard drive to transfer the new OCR files to FIU, an ongoing process at the moment. Once part of FIU’s open access data repository, this newspaper data will be able to reach broader audiences and hopefully play a role in other research projects aimed at studying the Caribbean.


Building an interactive map of select Caribbean publications with StoryMapJS

With the aim of further introducing users to the list of publications selected for the thematic toolkit, we decided to create an interactive map with StoryMapJS, a tool designed by Northwestern University’s Knight Lab. With its user-friendly layout and emphasis on geographic storytelling, this tool allowed us to build a visual display that both explores the individual histories of each publication and presents their relative regions of operation and influence to users. Each entry in the story map includes information about the origins, publishers, primary audiences, and publication frequency of a particular paper, among other historical details.

Though already accessible to users, this is an ongoing project that we will continue to develop as more detailed information is added to corresponding entries.


Newspaper Selection for dLOC as Data: A Thematic Approach to Caribbean Newspapers

As a researcher in Caribbean Studies, the Digital Library of the Caribbean’s Newspaper collection has served as an essential resource. This access to digitized newspapers from dLOC’s Caribbean partners is even more important for researchers now due to COVID-19.  dLOC’s Caribbean Newspaper Digital Library (CNDL) provides access to digitized versions of Caribbean newspapers, gazettes, and other research materials on newsprint currently held in archives, libraries, and private collections. CNDL has 300 newspapers and counting within the collection. dLOC also participates in the Chronicling America database by the Library of Congress and contributes Florida and Puerto Rican newspapers. With the aid of a recent Council on Library and Information Resources (CLIR) grant, the University of Florida (dLOC’s Technical hub) will be digitizing an additional 800,000 pages of pre-1923 Caribbean newspapers and making them available on dLOC and Biblioteca Digital Puertorriqueña.  

When conceptualizing the “dLOC as Data” project, our intent was to provide the text for select newspapers and make it available for users for bulk download. At that point, the last OCR performed on the newspaper collection was over a decade ago and the technology has improved significantly. We recognize that even with these improvements, there will still be inaccuracies, but part of the project includes providing documentation on these issues. 

Since the dLOC as Data project is focused on hurricanes and tropical cyclones, we decided to narrow down the list to national newspapers across the numerous Caribbean countries represented in the collection. Our initial “wish list” had over 60 newspapers, but for the project we narrowed down to 19 newspapers including: Barbados Mercury and Bridge-town Gazette, Diario de La Marina (Cuba), Boletin Mercantil de Puerto Rico, Correspondencia de Puerto Rico, Revue-Express (Haiti), Revista de Cayo Hueso (Cuba), Le Nouvelliste (Haiti), Port of Spain Gazette (Trinidad), Le Matin (Haiti), Bohemia (Cuba), The Herald (St. Croix), Tribune (Bahamas), El Mundo (Puerto Rico), Noticias de Hoy (Cuba), Aruba Esso News, Barbados Advocate, Haiti Sun, Panama American, and Star (Dominica). Several of these newspapers, Diario de La Marina (Cuba) 1844-1961, Le Nouvelliste (Haiti) 1898-1979, and Tribune (Bahamas) 1915-2018, covered significantly large time periods but might reveal gaps in the collection and reporting of hurricanes. It was important that the newspapers selected represented various time periods (late 1700s-1970s), multilingual, and able to have coverage of particular storms across different countries. 

Some of our guiding questions for the “dLOC as Data” project included: How were these hurricanes described when traveling across different nations?; What stories of resilience can we find across these pages?; How have the impacts of these disasters changed over time?; What can we learn about climate change as well as disaster capitalism from this newspaper collection? Our project is inspired by the Colored Conventions Project Principles and we seek to name specific people and places as a way to affirm their value and experience during these disasters. We recruited multidisciplinary scholars to help us think through ethical ways the data could be used while highlighting the accounts of people we come across through the project. Ultimately, it is our hope that making the text available for users along with some toolkits focusing on particular storms, researchers can contribute new narratives about Caribbean resilience and innovation.


A new addition to the project team

With experience in digital library projects specializing in Latin America and given the objectives of “dLOC as Data: A Thematic Approach to Caribbean Newspapers”, I was recently selected as the Caribbean Data Curation Intern under the direction of the project lead. Starting off as a student library aide at the Latin American & Caribbean Collection (LACC) at the University of Florida during my sophomore year, I quickly became interested in archival work and assisted with several archival and research projects under the guidance of the LACC’s senior staff. Through these projects and an undergraduate fellowship with the Association for Research Libraries the following year, I began to concentrate on the cultural and educational roles of archives and similar institutions, particularly digital collections. As digital institutions such as dLOC become increasingly vital resources for students and researchers alike in the modern world, accessibility to and understanding of such resources become even more necessary.

With these interests and the nature of this internship in mind, I hope to explore different aspects of web development and data utilization in the context of libraries and other educational institutions. I hope to better familiarize myself with the processes of data collection, enhancement, and diffusion as well, particularly within and concerning Latin American and the Caribbean. As such, I look forward to contributing to this project and its objectives over the course of the next few months.


Finding the most mentioned storm in a newspaper run with AntConc and OpenRefine

Since one of the goals of the “dLOC as Data: A Thematic Approach to Caribbean Newspapers” project is to look at hurricanes and tropical cyclones across the Caribbean, we knew we would need to identify significant storms repeatedly mentioned in the newspaper runs. I found that the easiest way to do this was with using the Collocate tool in AntConc. Additionally, OpenRefine has proven a powerful tool to then filter out these specific storm mentions. Below is documentation of my process so far using two very different newspapers. 

The first newspaper corpus was a very rough OCR of the Barbados Mercury and Bridge-town Gazette, of which we had issues from the late 1700s to the mid 1800s. Since this is well before the time of formal storm naming systems, such as that developed by NOAA in the 1950s, culling out commonly mentioned storms proved to be a little more nuanced than just looking for the word hurricane mentioned alongside a formal name, such as Andrew or Katrina. 

The first step was to load my files into Antconc. For this set, I searched for instances of the word “hurricane” using some wildcards–just to make sure I caught any weirdness from the often inaccurate OCR or different forms of the word. My exact search was “*hurric*.” This brought back 345 variations on the word out of 2,333 issues of the newspaper. The next part of our workflow for the project is to export the antconc data mentions into a .csv file so that we can more easily work with it in spreadsheet format and keep track of specific hurricane mentions. 

Strategic use of wildcards on both sides of the word catches many forms of “hurricane” including “hurricaue” and “thehurricane.”

Collocations, in text analysis, are words that frequently appear close together. These can be bi-grams (such as “social media”) or tri-grams (such as “out to lunch”), or they can just be words that occur frequently near other words depending on the parameters set in the AntConc Collocation tool. 

When you navigate to the Collocate tab in AntConc, you are presented with the same search term, but with different options specific to the tool. You can set the parameters for collocation by setting the number of words on either side of the tool–I left this at the preset of 5 left and 5 right. You can also change how you would like your results to be displayed. The preset is “Sort by Stat” or statistics. I change this to “Sort by Freq,” or frequency (the statistics number refers to the value of a statistical measure between the search term and the collocate).

The top collocations of the word “hurricane” sorted by frequency. Stopwords, as expected, come out on top. 

The first words that appear are, unsurprisingly, common “stopwords.” In text analysis, stopwords refer to the most commonly used words, usually determiners or prepositions, that don’t give us a ton of information. Once I scrolled down, I was able to identify more interesting words that might be able to point me towards a date or particular event that was repeatedly mentioned. “October” was the first such word that stuck out to me as the 23rd most frequent collocate (with 17 instances). After doing an in-text search of my data, it turns out that this was, indeed, often used in reference to The Great Hurricane of 1780. Phrases such as the “…the Hurricane which took place on the 10th day of October, 1780…” occurred repeatedly. This wasn’t the only significant hurricane that took place in October, however. With some sleuthing, I also discovered that another repeatedly mentioned hurricane occurred in October of 1817.

Scrolling down the collocates for the Barbados Mercury, “october” stuck out to me.

Because this corpus was smaller, and there were only 345 instances of the word hurricane, this more labor-intensive detective work was possible. With a larger and more recent corpus, such as the Nassau Tribune, for which we have issues spanning from 1913-2018, a more methodical method was needed. This turned out to work well, however, since named storms were among the first collocates most readily apparent.

Scrolling down the collocates of the Nassau Tribune, I immediately recognized that “frances” must be a named storm.

Hurricane Frances, which took place in 2004, was mentioned 474 times in 3655 issues of the newspaper. In order to get a closer look at this data, I loaded my spreadsheet into the data cleanup tool OpenRefine

OpenRefine allows you to filter and facet your data in a way programs like Excel do not. Once I had the spreadsheet loaded, I clicked on the arrow next to “mention_text”–this brought up  a dropdown. I then clicked on Text Filter. Doing this brought up a search box on the left of the screen which will search just the “mention_text” column. I then put in the word “frances” (this search is not case sensitive). We get 580 results now filtered out. The results grew because OpenRefine is searching the entire “mention_text” field, not just the 5 words on either side of hurricane mentions that the Collocate function in AntConc was. 

I found OpenRefine the easiest way to dive deeper into specific hurricane mentions, as it allows you to filter specific data very easily. 

Now it’s possible to easily browse all the mentions of that particular hurricane, and even pull out key words, further, such as “damage” or “fatalities.” It is also possible to facet by date to see when the mentions took place–this is helpful to clarify between named storms that share a name. There have actually been 6 Atlantic hurricanes named Frances since 1961.

Using the facet feature in OpenRefine to easily browse the dates of Hurricane Frances mentions.

In the case of the Nassau Tribune, all 580 collocations of “Frances” and “Hurricane” took place in 2005 or later. It can be gleaned that these are all referring to the 2004 hurricane, as this was the most devastating one (and because of this, Frances is no longer in rotation for storm naming).