⇓ More from ICTworks

How to Combine Real-Time Data Flows Into Unified Data Views

By Guest Writer on December 12, 2018

ona canopy data unifted view

We’ve developed a strong partnership with VillageReach, a team that stewards a logistics management system, OpenLMIS, and works to solve health care delivery challenges in low resource communities. Their team in Mozambique recently reached out to us with a problem that we’re seeing more and more frequently among clients: the need to automatically combine data from multiple sources in the same dashboard.

Usually we find the biggest problem our partners face usually isn’t around visualizing data; rather, it’s around how to pull together a bunch of real-time data from multiple sources so they can make sense of it. This means automating the process of ingesting the right data, in the right format, to generate a unified data view for use across all levels and verticals of a team.

Prior to this project, VillageReach staff in Mozambique had to manually collect raw data from two electronic systems – SELV and DHIS2 – four CSV data updates sent via email, and a series of JPEG images produced by a cold chain management system called NexLeaf. Their current approach was to manually compile and share via an Excel spreadsheet something that was very time consuming and limiting from a data analysis standpoint.

Generating a Unified Data View

Since we have been collaborating closely with VillageReach to develop Canopy (it’s currently being used to support the OpenLMIS v3 reporting stack), VillageReach wanted to see if we could assist them in using Canopy to automatically pull these data sources together into a real-time dashboard solution. We had a tight timeline, two months from start to finish, to see if what we were developing with Canopy was up to the challenge.

The platform involves a few components that perform specialized functions:

  • Canopy Flow is used to collect data from each of the 7 different sources and merge it together into a unified data view
  • Canopy EDW stores the data and generates the indicator values
  • Canopy Discover visualizes those indicators in an easy-to-understand interface with self-service capabilities like modifying charts or building custom visualizations

We began with a two-pronged approach. Our data engineering team dug into the systems with APIs (DHIS2, SELV, and NexLeaf) to understand how to retrieve data from them and how they structure data. At the same time, our data analyst team worked with VillageReach to catalog the complete list of indicators across the 7 different data sources on the current dashboard, capture any future ones, and come up with technical definitions mapped to the source systems.

Managing Different Data Sources and Flows

It was a lot of data to map but fortunately our partners were well prepared and excellent in helping us understand the context during this initial period (and throughout the project) which was key to help ensuring the project’s success. Some detail on the data sources:

Scheduling DHIS2 API Calls

DHIS2 contains data on vaccine coverage, utilization, and wastage. Its smallest unit of information is available at the granularity of facility and month. We use Canopy Flow to fetch data from a few DHIS2 APIs on a schedule.

Building APIs for SELV

SELV provides data stock levels and stock outs. Because SELV doesn’t have an API, we configured Canopy Flow to fetch facility data related to these indicators from the SELV database on a nightly basis. Although SELV only stores data at a monthly level granularity, we need to fetch it nightly because a submission could arrive at any point during the month.

Parsing CSVs from Emails for e-SISTFE

e-SISTAFE provides financial data that comes in at the granularity of province and month. e-SISTAFE does not have an API, so a Ministry of Health official generates a CSV extract containing its data. There are 3 other CSV files that Ministry of Health staff compile manually that contain data on community outreach and engagement, with the granularity of district and quarter.

To handle these different CSV files, we use Canopy Flow to read emails from a dedicated inbox. It parses email attachments that meet a specific set of criteria like the file type, name, columns, and data types, and rejects submissions that fail data validation checks. Because this comes in at no predictable or schedulable time, we use data in the email subject line to assign it to a reporting period.

Scraping JPEG Images from NextLeaf

NexLeaf provides JPEG images of charts that are at the granularity of district and month. Unfortunately, we can’t access the raw data from this version of NexLeaf, so Ben, a member of the VillageReach team, wrote a node.js app that runs on a nightly cron job. The app scrapes images off of the web from NexLeaf’s API, and stores them in a database. We then built a component to return those images via an HTML iFrame embedded in Canopy Discover.

Lessons Learned in Sourcing Data

Once we understood the source data and the indicators, we defined a schema for how this multisource data would be merged to generate all of the desired indicators. The process wasn’t easy. We had two key learnings along the way:

Manually-entered CSV Data is Tricky!

Because we had some in-country Ministry of Health staff who were working with older versions of Excel, we could not rely on data validation checks built into the software. We needed to control for casing, date formatting, etc. in Canopy itself; we needed to convert “MAPUTO CIDADE” or “cidade de maputo” to “Maputo Cidade,” for example.

As a result of this, we’ve been strongly recommending the use of a form-based data capture system (like Ona) instead of CSV files when it comes to manually-entered data. Note that CSVs with a standard schema and data structure are much easier to handle

Intimately Understand System APIs

It is important to thoroughly examine source system APIs and understand the level of effort it working with them. Just because the system has an API doesn’t mean that API gives you the information you need in the format you need it. APIs that produce JPEG images are obviously very different from those that produce JSON files, and so they need to be scoped differently

Data Visualization Success

unified data view dashboard

The end result: a single dashboard powered by Canopy Discover which combines data from 7 sources, and that updates itself automatically. Ministry of Health staff across Mozambique now have a central place to view vaccine supply chain indicators, including things like coverage rates, stockouts, vaccine utilization and wastage, and financial data.

By Clay Crosby of Ona and originally published as Dashboards Powered by Multiple Datasources

Filed Under: Data
More About: , , , , , , ,

Written by
This Guest Post is an ICTworks community knowledge-sharing effort. We actively solicit original content and search for and re-publish quality ICT-related posts we find online. Please suggest a post (even your own) to add to our collective insight.
Stay Current with ICTworksGet Regular Updates via Email

Sorry, the comment form is closed at this time.