Creating a robust infrastructure to utilize data streams
It is raining heavily in my city. It reminds me of the data that is produced or recorded by even a moderately sized organization. Numerous data streams are flooding the business every day. In order to harness this force, the business has to judiciously create the data lakes and dam the data without harming the business environment, using the data gathered to gain insights, predictions, and directions from it. Ok, let me stop this analogy before we head towards a cloud burst!
A smart organization builds a robust infrastructure to utilize this data.
- The first step is to collect the data from where it is produced, it could be the recording of client calls, data gathered from instrumentation, sensors, logs, or external systems. The instrumentation and infrastructure engineers would be working at this stage
- The second step is to transform and store this data in a usable format, be it structured or unstructured. Start cleaning and Prepping the data. This is done by the data engineering team
- The third step is to aggregate and label the data. Experimentation, A/B testing, Training data, and using ML algorithms. This task is done by data analysts and data scientists.
Each one of these steps requires a different type of training and expertise. The boundaries between the steps are porous and the people working on them have overlapping skills. So it is imperative that the correct person is assigned the correct role. Not realizing this can result in unhappy scientists and underutilization of the actual skills.
It is best to match the steps and skills correctly.