No, I don’t want to make this post sound like yet another post to underline the importance of Data Engineering to Data Science, there are so many of them on the internet, already!
I want to talk about the single most important factor (that I feel), which leads to failures of Data Science projects.
Ever since 2017 multiple reports were published by the Gartners, IBMs, and other business analysts that more than 80% of the Big Data & AI, ML projects fail due to lack of reliable data infrastructure.
The condition today is not very different, you may disagree with me on this.
Most of the businesses are sitting on huge amounts of varied data after they have embraced the digital transformation. All of them want to leverage this data to dig out insights for specific business problems.
Enter the Heroes, Data Scientists who can swoosh their magic wands over the data, to make it speak and reveal; but wait, where is the data?
Most of the time it’s sitting in different locations, in proprietary formats, enclosed in some application frameworks, lying buried under tone’s of unwanted data, or available but in a format that is practically unusable.
Data Scientist needs to pocket his Data Science wand, interchange his Data Magician’s (Hero) hat to a Data Engineer’s hat, and now has to search for available tools & techniques to collect all this data in one place, process it, enrich it, and bring it in a form so that his magic works.
IMO, Data Scientists don’t exactly enjoy doing data engineering; but reluctantly agree. Sometimes they want to do it and happily agree to it, since they feel they can do it or want to help with it. Intentions may be good, but, they are also not equipped with the required skills.
This is where things start going south.
Strong software engineering discipline and hygiene is the cornerstone of creating AI, ML-powered or Non-AI, ML-powered solutions.
A strong understanding of software design & architecture, data structures, data modeling, factors affecting scalability and reliability of systems, understanding of cloud-based services to accelerate the data infrastructure is a must.
This is what makes Data Engineering a specialized and distinct function.
This unfortunately is not a pre-requisite for being a pure Data Scientist. Data Science needs skills like applied mathematics, stats, algorithms, ML techniques.
Hence these are not interchangeable, and forcing data scientists to do the engineering work is usually a recipe for failure.
This keeps happening a lot of times, for reasons like saving costs, unavailability of appropriate skills.
As all of us know that all AI, ML, and data projects, are about 80% engineering and 20% Data Science Magic.
The Data Science Magic gets the WOW, but its Data Engineering blood & sweat makes the magic work, reliably and repeatedly.
It’s imperative to include Data Engineering in your star cast, invest, and front-load in it for a successful AI, ML-based business solution, or a service offering.