Everyone is interested in the fancy outputs of mobile data collection: eye-catching and informative web-based data visualizations. Yet few people care about the mess of intermediate analytics and custom data engineering work needed to fuse multiple raw data sources to a single database, and then visualizing that data coherently.
The easiest way to handle the messy intermediate work is to create a custom solution, i.e. a transactional database combined with a common web framework. This seems like a reasonable solution that fulfills project requirements.
How harmful could custom data integration tools be? Quite harmful, unfortunately.
For unique problems that only need to be solved once, this solution makes sense. However, data integration problems are not unique. Problems such as connecting to other systems (e.g. DHIS2, OpenMRS, iHRIS), adjusting or cleaning raw data, and fine-tuning charts and other visualizations need to be solved in almost every project.
We all end up rebuilding the same thing over and over. Worse still, custom solutions become unmaintained, dated and eventually discarded abandonware after funding ceases.
At Ona, we are software solution builders and often see these patterns but do not have the time or the budget to do anything about it. Instead, we write good code – extensible with common standards and thorough testing – but this narrow minded focus on technical excellence lets us overlook that we have spent more resources building out a series of separate web applications than we would have if we had invested in a long term solution.
Open source data tools are the answer
What are our options? Going back to combine these separate applications into a single platform is untenable, because we will be forced to accommodate design decisions made for those single applications, which then becomes an endless game of whack-a-mole with every upstream code change.
We started looking for an answer by looking at our previous work. Every tool we’ve built breaks down into three components:
- Aggregation
- Merging
- Analytics and Visualization
Well-architected and well-maintained open source solutions exist for each of these components:
- Elastic, NiFi, and Kafka can solve the aggregation problem
- Storm, Flink, Spark, and Apex can solve the merging problem
- Hadoop, Druid, Pentaho, Pig, and Hive can solve the analytics problem
- Zeppelin, Superset, and ReDash can solve the visualization problem
Open source data tools aren’t enough
Although these tools exist, they are hard to use. First, they are not packaged in an easy-to-use way, sometimes purposefully. Second, they are built to scale and solve harder technical problems rarely encountered in ICT4D to date.
They are built to operate with high availability, high scalability, high throughput, and low latency – making them challenging to setup and operate in developing country contexts.
So our new problem is: how can we make the awesome but complicated existing tools easier to use?
Enter the Enterprise Data Warehouse (EDW), the modular, open source tool we’re building to make integrating existing data tools simple. The EDW will fill a huge need in the ICT4D open source community and let us build better solutions faster while being adaptable to the changing needs of our partners.
Side note: building an EDW is a significant challenge (read: big investment) — so much so it is the competitive advantage of a number of companies.
EDW tools should be common in ICT4D
We see building an Enterprise Data Warehouse as a disruptive change for ICT4D. A well-tested industry standard software package will make it easy for groups to solve data problems they are facing today. At the same time, a EDW means faster integration with promising new technologies like artificial intelligence, comprehensive data security, and Internet of Things.
To accelerate this shift in the industry, Ona has released open source Ansible roles to automatically set-up and configure data warehouse components.
Additionally, we offer our partners an EDW solution that addresses the unique needs of the ICT4D space including data sovereignty, aggregating data across heterogeneous data sources, mixed deployments that are both in the cloud and on-premise, limited hardware resources, and limited or unstable network connectivity.
Program monitoring based on real-time data is an important part of all the projects we are involved in, and will become essential to any project’s success in the increasingly network connected future (eg. Internet of Things). Building these tools the correct way, by custom building as little as possible, is the most cost effective and — we believe — the best approach.
The EDW prepares us for what’s next
An ecosystem of mature tools covering all the use-cases we outlined above has only become available in the past year or so due to two recent changes: first that these tools are mature and available, and second that successful projects require real-time feedback and expect to be used at scale. These changes have shifted the balance and justify a more thoughtful approach to data systems tooling.
An Enterprise Data Warehouse built from open source components is the perfect solution for this job. It future-proofs our tools and our partners’ projects. We can operate efficiently from village-scale to planet-scale, and when we need to incorporate new technologies, it’s as close to plug-and-play as possible.
For example, a global organization that’s in the process of deploying what they expect to grow into a planet-scale SMS-reporting tool will eventually have datasets in the high terabytes and then petabytes. To handle a rapidly growing dataset of this size, process it in real-time, and accomplish its mission effectively this organization will need to use an EDW that can accommodate this growth and at the same time is capable of handling it’s current smaller-scale needs at a reasonable cost.
In addition to reducing duplicate work, our approach – implementing on top of open standards based data platforms – will mean solutions that cost less and give builders more flexibility. These are essential features to successful ICT4D projects, which supports our raison d’être at Ona: to build technology solutions that improve people’s lives.
Enterprise Data Warehouse Example
In the context of health systems, our streaming data architecture means we can already create a single pipeline that receives information from an electronic medical record system, enhances it with demographic data, and then visualizes indicators on a website, all without building custom software.
This is what it looks like:
Using industry standard data platforms lets us reconfigure and reuse the same system for different health use-cases or for any particular needs a client with data might have. We can also extend this system by adding machine learning tools and connecting them to existing platforms, products and data.
Most importantly, our clients can access the visualization and data ingestion platform themselves. They can play with the charts and data pipelines to discover uses we would have never imagined.
If you are interested in exploring how an Enterprise Data Warehouse could help address your needs please leave a comment below.
By Roger Wong and Peter Lubell-Doughtie of Ona.
We’d like to acknowledge the World Health Organization, John Hopkins University, VillageReach, the Bill and Melinda Gates Foundation, and Johnson & Johnson for supporting us in this work.
Hi, awesome project that would definitely help us at Root Capital. We are interested in integrating data from multiple different collection tools and creating a permission system to allow the enterprises we are collecting from to select what and with who they would like to share. Would love to have a more detailed discussion, if possible. Thanks!