Scalable Data Pipelines for Mastering & Integration - an ML Approach
Integrating multiple and diverse datasets for analytics are an essential part of a data scientist's life. This is an essential part of the analytics journey, as feature engineering on dirty data will only be faulty. However, current tools do not make the process simpler. There is a wide variety of data attributes and formats to take care of. Preparing for analytics by matching and deduplicating records remains a challenge. Unifying matching records into a definite representation of an entity is both time consuming and error prone. Hence, preparing data for predictive analytics requires manual effort and occupies upto 60-70% of a data scientist's time.
In this talk, we discuss how data engineers and scientists can augment their data preparation by leveraging machine learning. We talk about schema mapping, identifying attributes on disparate data sources which refer to the same values. We discuss data mastering and how it is different from a typical clustering and classification problem. We also elaborate about scaling these approaches, and how machine learning can help.
Come see how ML can be leveraged for data preparation for analytics.
You may also be interested in
Growing up in Honduras, Bolivia, India, Nepal and Indonesia, Thomas believes innovation decision making is one of the most far-reaching...