By Marlene Hanken, Data Analyst, Science
Welcome to The Data Corner, a place where this resident data nerd addresses some of the most common data issues and data-related topics facing the agriculture industry today!
This issue’s topic is Data Standardization…and what is data standardization anyway? Simply put, data standardization is a set of processes and standards that allow data sets from various sources and formats to be conformed into a common format and source for analysis and interpretation—the foundation for collaboration.
For those familiar with spreadsheets, the concept requires taking two spreadsheets, composed of differing column headings, and unifying the data into a single spreadsheet while not having two or more fields containing redundant information. Let’s use the example of two different growers sharing commodity yields for the year. Perhaps both have a field containing the commodities and yield volumes, but one has planting information in the other columns and the other grower has harvest information in their columns. Does the resulting spreadsheet contain all columns from both (both planting and harvest info)? Or does the resulting spreadsheet only contain a subset of the two—namely the fields they have in common? This is a common decision made by data managers (the person or entity combining the spreadsheets) and is an important part of data standardization.
But why is data standardization so important? When data isn’t standardized, the resulting analytics are similar to the results of a blurry photo—obscured findings at best and completely incomprehensible analytics at worst. A common example of this would be the naming of commodities—each operation can potentially record the same commodity similarly, but if not identical—character for character—a computer is unable to recognize that a commodity with even a leading or trailing space is the same as one without (e.g. “romaine lettuce” vs “ romaine lettuce” vs “romaine lettuce ”.) However, without a process to reconcile those three values to be identified as the same, the resulting analytics (for example a count of how many commodities are being reported) would return a count of three rather than one. Sound tedious? It is! That’s what makes data standardization and the attention of good data managers so important!
The previous example seems simple and straight forward, but let’s take another common example—one company reports a commodity as “romaine lettuce” and another “lettuce, romaine.” Both commodity names are valid—identifiable by other industry participants and even consumers, However, just like the first example, our count of commodities would currently indicate two when, in fact, they are the same and should only be represented as one commodity. So, which one does a data manager use? Herein lies the first obstacle in data standardization—the normalization or standardization of values can be difficult to address nomenclature (in our spreadsheet example it would be what is typed in a particular cell). Does the first commodity name submitted get used? Does the data manager work with both parties to see if they can agree to use one versus the other? Does the data manager just pick one or completely introduce a new name that is neither of those options and simply inform the data providers? All these options, along with other solutions, have been employed by data managers. These solutions become increasingly difficult to implement when the number of participants (and their variant submitted values) increases…what does a data manager do with over a dozen differing values meaning to represent “romaine lettuce”?
Most recently, the Western Growers’ Science team has encountered this type of data standardization issue in our Food Safety Data Sharing Project. This project features the GreenLinkTM platform—a web-based tool developed by Western Growers and powered by statistics and analytical company Creme Global. Through GreenLinkTM, WG member operations that grow, harvest, process, and ship a variety of commodities are sharing their data with Western Growers. GreenLinkTM allows a participant to visualize data from his or her individual operation as well as view anonymized aggregated data from all participants. With data from over 100 growers on multiple commodities across the production chain, data standardization is an important area of focus within GreenLinkTM.
To tackle the task, Western Growers assembled the Data Standardization Working Group. The group is comprised of academic subject matter experts, Creme Global data scientists, Western Growers’ Science team, and others interested in data standardization. The objective of the working group is to assess challenges and opportunities and coordinate efforts around data standardization in the fresh produce industry. With standardized data, powerful and accurate analytics are possible, paving the way to a brighter future for Western Growers’ members so they can continue to do what they do best…growing the best medicine in the world.
Want to share your experiences with data standardization or have a data topic you want us to address in the next issue of The Data Corner? Email us at [email protected] today!