In the last decade, incredible improvements have been made on the data collection side that help the processing pipeline. However, various datasets are managed and maintained by several organizations in disconnected systems. This causes some information about the data to be lost during this transition, and people doing the cleaning have no control over collection.
Insights on how the data goes through various processing stages for analysis will help us build better tools for data cleaning and achieve better analytics. We attempt to understand the data cleaning processes for development through interviews with stakeholders from different types of organizations in, “Examining the Challenges in Development Data Pipeline“.
We interviewed 8 people from international development organizations, and 5 people who worked for the government in Pakistan. This allowed us to get both a global perspective, as the the people from development organizations had worked with projects from well over 100 different countries, and well as a more in depth perspective from a specific country.
The goal is to compile the outstanding issues expressed by the practitioners and point out gaps that have the biggest impacts. This allows us to build a basic taxonomy and identify areas where support tools can help achieve better cleaning of development data.
3 Data Cleaning Challenges
Several cleaning challenges arise once development data has been collected. As we have seen in our findings, some are trivial, while others require sophisticated solutions like building machine learning prediction models. These solutions demand varying degrees of person hours, which in some cases could be avoided through automation, yet in others cases remain inescapable. Based on detailed comments from our interviews, we present our synthesis in this section.
In our opinion, these are the three key areas that have emerged as the source of major frustration for development data. This is based on what participants point out as most time consuming or challenging part of data cleaning for them.
1. Merging data between existing large data sources.
Merging data is the most frustrating process due to several factors. The most common reason is the name resolution problem, where the names of locations do not match up in datasets of the same region. This is often because those names are from a local language that has been transliterated to either English or French.
Moreover, different people will transliterate differently, resulting in various spellings for the same location. This could be avoided by building a master location list as a supporting dataset that systems can use. However, even with some push for this, it continues to be an issue. The best way to solve this is by building NLP based algorithms that can accurately match names with all sorts of spellings. This could further be used for matching patient names, which is also a problem, especially for migrating populations with no other identifiers.
The second reason for merging issues is the mapping of codes and terminologies within datasets that differ. This is mainly due to lack of standardization in this space that continues to be problematic. For instance, one participant explained that different codes are used to denote the cause of death in different countries.
Another participant expressed concerns about the use of different numeric codes for equipment functionality statuses in various tracking systems. This is not just limited to terminology but also the units being used or the format in which the dates is written. This increases the complexity of merging the datasets and requires a lot of hours to untangle the differences.
A solution for this is to use a machine learning model that can predict the unit or variable based on the distribution in the dataset. Indeed, this is not simple due to changing reporting norms that alter the distribution over time. The more deterministic way to do this is to look up this information via a side channel like reading through the documentation or directly asking the source partner if that is possible.
2. Validating data accuracy.
Data validation is a major concern in the development space, and is generally the focus once a digital data collection system is in place. This was an especially hectic process for the Punjab government participants who have to go through the data and manually follow up on all the errors that they find. They recounted how they eyeball the data using the tabular form as well as visualizations.
Then they have to call the source facility to validate any suspicious entries. This is a tedious process for them, and could be made easier with validation algorithms based on the statistics of past entries to highlight the data that needs attention. In this way, the manual follow-up could still be part of the system, while at the same time reduce the cognitive load of finding the errors.
Some international development organizations handle this problem differently. They build a sophisticated model that tries to fit the data to find any values that are outside the norms for a given facility or data point. One organization described how the data was extracted by hand from PDFs, and, even with trained personal, they would discover errors. They would then build models to predict the values, so they could go back and verify if an error was made during extraction. Sometimes the error was made in the original PDF report itself, which is harder to fix given the lack of raw data. In turn, they rely on predicted values.
3. Extracting data from PDFs reports.
Extracting data from PDF files is important in development analytics due to the large amount of historical and even recent data available only in the form of PDF reports. This is challenging not only because extracting numbers is difficult, but also because the numbers are aggregated in different ways with the details tangled in the text of the reports. Even if you write a tailored script for a specific set of reports, putting the numbers in the right perspective deterministically is tough and results in countless person hours spent to verify the extraction.
Even though there are several other commonly mentioned issues, these three areas account for the most frustration due to lack of good supporting tools. This results in significant amount of manual labor to extract, merge and validate data in a complete work flow. There is a need for more tailored solutions addressing these problems that can help users build a complex model for automatic processing, thereby significantly reducing the burden of data cleaning.
On the other hand, common data correction problems such as replacing values, unit conversion and removing duplicates are well researched and can be handled by tool like Wrangler and Google Refine. These tools make it easy by letting user show the intended task through an example conversion.
Data Collection Challenges Continue
Over time, data collection processes and the structure of data have evolved. As a result, legacy data issues around merging or generating time series trends exist. In part this is due to the fact that the indicators being collected have changed either because there is a lack of data standards or the standard was ignored in the collection process.
Moreover, data entry norms are changing. For instance, diseases that were not being reported are now being reported more commonly. One participant from a non-government organization explained this situation as follows: “But there are some things that are very difficult to be backward compatible because it’s not just the code you used to report a disease changed but the entire reporting culture around a disease is evolving”.
Lastly, boundaries of districts and sub-districts are changed over time, making it impossible to do one to one comparison of past data with current one. This requires splitting the aggregate or predicting the numbers based on comparable datasets.
A lightly edited synopsis of Examining the Challenges in Development Data Pipeline by Fahad Pervaiz, Aditya Vashistha, and Richard Anderson at the University of Washington
Sorry, the comment form is closed at this time.