Data for Freshwater Inflows Book Available at GRIIDC

white wave effect

Dr. Paul Montagna, Harte Research Institute’s (HRI) Chair of HydroEcology, recently published a book, “Freshwater Inflows to Texas Bays and Estuaries”. The book was an update to a previous volume published in 1994 and represents years of work on Texas water quality and quantity. Published by Springer Nature, the book includes 17 chapters that are open access, and all associated data are stored and available for download in GRIIDC.

Publishing a book is no easy feat, and ensuring the associated data are publicly available added another layer of complexity. But for Montagna, making the data open access was important. Not only are 32 datasets stored with GRIIDC, but he also ensured that the datasets were cited in the book using the GRIIDC-issued digital object identifiers.

GRIIDC asked Montagna about his experience making the data publicly available for the book.

Why was it important for you to make the data from this book open access?

Two reasons really. First, I’m a strong believer in making data available, and I especially want to make sure none of my data disappears when I retire, all data is an important legacy. Second, the purpose of this particular book was to synthesize data for management purposes, and too often the first step (assembling data) is a roadblock. So, making all the data used in the book will be an important time saver for future studies.

You had to collaborate with multiple researchers to write this book. Did you encounter any difficulties getting them to share the data? If so, what were the difficulties?

Some weren’t used to the idea of data sharing, but most already had experience with data sharing, so it was not difficult. Also, I made it a requirement to participate in the project.

You provided data citations within each book chapter. Did you encounter any difficulties citing the data?

Yes, citing the data is critical to its discovery, and we made a point to cite each dataset in the chapters that they supported. Also, I included a final appendix with all the data citations in one place to make it easier to find data. Citing data is easy because all book and journal publishers have a standard format for data citation. It is common today for publishers to request a declaration on the author’s data sharing policy.

Have you cited data before in other publications? How important do you think data citations are?

I’ve been including data citations in all my publications since 2013. Data citations are more important than ever because there is a lot of fraud and fake science, which is especially easier now with artificial intelligence, so providing the raw data is an important tool for maintaining scientific integrity. Also, data citations are a CV builder, because they can be in a section all their own. Some funders are asking for data citations as part of the CV for proposal submissions.

How do you think the knowledge that you were going to make the data publicly available influence how you managed your data throughout the publication process?

Knowing we would archive data made everyone more cognizant of keeping track of metadata, and more careful with how it was stored. I like to use Excel as a notebook for data because you can maintain multiple sheets without having to make multiple files. It’s also easy to create a “metadata” sheet with all the information about the data, and definitions of the variables on all the sheets. Keeping track of metadata is not natural and it’s difficult to learn. I have always told students “write it down now” when they are creating data files and “you won’t remember what the units are or some other detail about the data in a few months". Getting data right is actually very hard and easy to mess up.

However, there are two problems with Excel: 1) it will automatically format cells so that the wrong values are displayed or the cell value actually changes. For example, I often take sample over a depth range such as 3 to 10 cm, but if you type “3-10” in a cell, Excell will change it to “3/10/2025” and display it as “Mar-10.” There are also similar problems when sample identifications or station names have a period or dash in the value. So, I am careful to define cells as “text” when I know the cells will contain periods, dashes, slashes, or other symbols that may alter the meaning of the value. 2) the other problem is that Excel is a proprietary format that may change or disappear in the future. I had one experience with a bad legacy format because in the late 1980s and early 1990s I used Lotus-123 and Lotus-Symphony. A few years ago, I wanted to archive the data from the 1980s and 90s and found converters for 123 but not Symphony. Luckily, I can program and wrote a program to read the Symphony files and convert them to Excel format, but it took a couple of weeks to solve the problem. I know I should probably use a text-based format like “csv”, but it is just too inconvenient.

One more detail. I actually do all my data management, data analysis, and programming in SAS. I learned it in the early 1980s and have always used it. It is a very powerful tool with comprehensive analytical procedures and it is a mainstay in the financial, insurance, and health industries. It’s also very easy to move data to and from SAS, and I archive data in Excel, which is more commonly used. A few years ago, someone sent me 300 files in Paradox format, and I had no trouble converting them in SAS. But because SAS is a commercial package with annual licensing fees, use in academia is rare today (except in Business Schools).

What has been your experience trying to find publicly available data?

Very mixed. Google Data is huge step forward and big help. But it is still very difficult to find data. For example, several times I have tried to find data that I put in a federal database, and I couldn't find it.

What is your advice for researchers who plan on sharing their data?

As the Nike slogan says, “Just do it.” Keep track of methods, sources, and variable names and units, so it is easy to construct the metadata. Be careful with Excel and don’t manipulate data in Excel.

For more information on Montagna’s book, see the Harte Research Institute story.

Select datasets from the book are featured on the GRIIDC homepage and all datasets can be found on the GRIIDC search page by searching for the book's digital object identifier: https://doi.org/10.1007/978-3-031-70882-4 .