We live in the era of Big Data. That’s not a new statement, it’s not even an exciting one. But there’s an implicit finger wag in there as the statement relates to ICT4D. It says, “if you aren’t collecting massive amounts of data, you’re not being innovative, and you can’t possibly be addressing the problems your project means to solve.”
But what if that’s not the case? What if more data isn’t always the answer, but that the answers is rather systematic and intentional uses of the most necessary project data to improve field staff efficiency, inform project decision making, and provides useful insights for team management?
This has been CRS Laos’ experience with the first and second phases of the Learning and Engaging All in Primary School (LEAPS I/II), a USDA-funded project as part of the McGovern-Dole International Food for Education and Child Nutrition Program.
‘Nice-to-know’ vs. ‘Need-to-know’
There are an infinite number of fascinating questions one can ask about development interventions. The most obvious of these questions being:
- “Does it work?” (internal validity)
- “Will it work in another context?” (external validity)
- “Can you get the same result for less money?” (efficiency)
Yet, it’s incredibly tempting to stray beyond these, or other key questions about a development project, from the esoteric – “does a 20% subsidy for school uniforms for rural school children have an impact on attendance rates for 3rd grade ethnic minority girls without access to electricity but whose parents are literate?” – to the mundane – “does a direct unconditional cash transfer increase household expenditures?”.
With the wide-spread adoption of mobile data collection tools, the marginal cost of collecting an extra data point has shrunk dramatically. Filling out an additional virtual form, or increasing the length of a form by a question has virtually no apparent cost and so brings a natural inclination to collect ever more data.
Data is cheap, but information is still very expensive.
Converting raw data into something useful and actionable still requires a great deal of effort. Consider LEAPS I – it started off with the right mindset, wanting to maximize the use of its mobile data collection devices to collect quality data.
At its height, the project was collecting tens of thousands of data points in over 300 project locations every month. While this provided the project with more than enough raw data, the project lacked sufficient data management systems and tools to extract meaningful information from much of it.
This meant that a great deal of time had to be spent combing through files to find exactly what was needed. LEAPS I worked to solve this problem, in part by establishing a clear distinction between ‘nice-to-know’ and ‘need-to-know’ data.
This reduced field staff workload substantially; improved data timeliness; and established processes to easily extract meaningful information for donor reports. While data is nearly free, turning it into useful information is not.
OK, the problem isn’t always too much data
Having said that, there is, of course, a lot we can do, with large, robust data sets. Even if you’re only collecting exactly what you need to answer all your most burning questions, donor requirements, agency initiatives, pilot studies, and other initiatives can require a lot of data to provide meaningful answers.
Even after stripping the monitoring and evaluation system down to just ‘need-to-know’, LEAPS II is estimated to collect between 10-15 million data points over 5 years. That means strong systems are needed to properly handle large volumes of data.
Staging Databases for Data Management
One of the biggest challenges to working with large amounts of data is data cleaning. A key, but often under-utilized tool is a staging database.
A staging database is a storage area that sits between your source database (typically linked to a mobile data collection platform) and the final destination of your data (either tabular or visual representations used for reporting and analysis).
Staging databases offer a myriad of benefits, but, in particular, they allow a project to:
- Automate cleaning processes of the source data.
- Record cleaning processes made to the source data.
- Revert changes made to the source data, if necessary.
By applying rules to data that is loaded from a source database to a staging database, you can save a large amount of time over modifying individual records when data cleaning. For example, all test data can be removed at once during this step. This can also be used for bulk correcting data that falls in an acceptable data validation range, but is nevertheless incorrect. An example of this might be replacing all instances of 999 or 888 with Null.
Furthermore, by using a staging database to systematically clean data, you can generate metadata about the frequency of different kinds of errors. For example, in LEAPS, examining this data allowed the project to see which data collection forms result in the most data entry errors and for which field staff. Knowing this can ultimately be used to provide targeted support and improve overall data quality.
Lastly, and most importantly, by separating your source database from your staging database, you significantly reduce the risk of a major loss of data. Changes in the staging database can be more easily reverted, and the original data is safely preserved.
In the case of LEAPS II, the staging database has dramatically reduced the amount of time that project staff spend cleaning data. The existence of a staging database also gives the project the peace of mind in knowing that data cleaning processes cannot lead to a permanent loss of data.
From Big Data to Big Visualization
Another challenge with large data sets is how to reduce them into meaningful information. Business intelligence (BI) solutions are an increasingly common remedy that facilitate reporting, analytics, and visualization of large data sets. These tools can be used to create graphics that update in real-time, turning overwhelming amounts of data into actionable pieces of information.
LEAPS II uses data visualizations to identify problems, gain deeper understanding of events, and inform decision making and planning. Because the dashboards update automatically with the newest available data as it comes in from the field, the time between data collection and action is reduced.
For example, real-time data visualization has sped up responses to reported commodity damage, tracks long-term indicator progress, helps identify sites for pilot projects, and more.
This dashboard alerts the team to commodity losses and shows trends in losses geographically and over time allowing the project to respond in near real-time.
But all of this is only possible with strong data management. Even with highly efficient processes, and concise, data rich visualizations, the human mind can only process so much information at a time.
So, remembering to focus on what really matters and determine the best ways to gather, analyze, and convey that information needs to be at the heart of data management.
By John Mulqueen, a Monitoring, Evaluation, Accountability, and Learning Manager at Catholic Relief Services
Very nicely drafted.
Use of appropriate tools and workflow would definitely reduce resources required for data analysis and visualization.
Great work done be CRS, keep it up!!!
Great post! The only thing I would add is that nothing can replace getting clean data from the start. Staging databases are great for resolving some issues and identifying errors, but strong training and support of those who are collecting data saves time, money, and produces data that can lead to meaningful conclusions. Too often M&E systems fail to provide accurate answers to our questions because many variables are missing or inaccurately reported and there is no time or funding to recollect it. Clean data is better than more data. And clean data is essential for turning data into information.
Wonderful read with great insight. I would add that getting the human element right is also part of the equation. Beyond setting up good system to collect clean data, it also helps to involve data collectors and get their buy-in into the purpose of data collection. Not just what to do but why they are doing it. Getting front line data collectors invested in the outcome of any investigation and its potential impact typically leads to improved data quality at the onset.
Loved your article, thanks so much! I’m wondering, what are the most-used or recommended staging databases for development or humanitarian data?
We often use azure.