How to Build a Data Pipeline That Handles Hundreds of Different Inputs

How to build a data pipeline

How many different file formats does your ETL system need to parse? For many data pipelines, several well-defined formats will suffice. Things break, and at times require manual intervention, but not so often that a couple engineers can't keep tabs on the system and keep things running relatively smoothly.

But what happens when you need to build a system to process hundreds of disparate data sources (think past your CSV & JSON files to EDI, HL7, COBOL files and even data buried in complex ASCII printouts)? Parsing logic can quickly become brittle when faced with a multitude of edge-cases and errors abound - and engineers are soon frazzled.

Then how does one build a robust parsing engine that supports a plethora of inputs and data structures? And, even more, how do you build one that deals with frequent errors in the data itself requiring a well-defined process to integrate human assistance into the system, and provide structured feedback back to the data providers themselves to help them improve their data collection process?

Massive disparities in the format and structure of healthcare data like this is the norm for any engineer dealing with datasets in the healthcare field.

Meet Chris Hartfield of Clover Health

Data Engineer, Chris Hartfield, has experienced this first-hand at Clover Health - a company in San Francisco that's working to apply data science to help give health-care providers higher quality signals to make meaningful interventions in patient's healthcare.

Clover Health processes massive amounts of healthcare data daily so their data team naturally has experienced all of the previously mentioned challenges in building data pipelines that handles hundreds of different inputs.

Building as sophisticated a data platform as the team has done at Clover Health has come with lots of trial and error, and learning from mistakes to develop rich insights in data processing.

"What surprised me the most was that the majority of files we received are processed manually and that human error abounds in medical data. This created a unique challenge for us because not only did we have to build a parsing system to try and handle some of the human errors or odd file formats we would receive, but we also had to create an effective feedback loop between the vendors and throughout the company of when files were having issues," said Hartfield.

The extent to which Clover considered the human element on both sides of the data equation fascinated me. Chris continued, "although our system can parse and extract data from these files in an automated fashion, our vendors still often generate these files by hand and not through automated versions. So there's a human element to this automated system that involves reaching out to vendors when files don't meet out expectations."

To learn how to apply the Clover team's learnings to make your data pipelines and platform more robust, check out Chris Hartfield's talk.

Data Council Blog

How to Build a Data Pipeline That Handles Hundreds of Different Inputs

Meet Chris Hartfield of Clover Health

Written by Pete Soderling

Subscribe to Email Updates

Fresh Posts

Categories

Subscribe Here!

Follow / Join Us

Contact Us

Menu