Automating Data Cleaning and Processing for Ingestion
Situation
As a customer success engineer, I was assigned to create a robust data cleaning and processing pipeline using Python that could be run in Jupyter Notebooks and as a background data processing microservice. The team was transitioning from a technical team to a sales and account management focus. The process needed to be used by both technical and nontechnical staff. The pipeline needed to:
Read data from a CSV file.
Execute a series of cleaning and processing steps.
Allow for custom analysis and manipulations.
Transform the data into a format suitable for storage, ingestion, and use as training and test data for an algorithm.
Task
The main objective was to develop a flexible and reusable pipeline that could handle various datasets and support the entire data preparation workflow in a Jupyter Notebook environment. The pipeline needed to ensure data quality and be adaptable to different analysis and manipulation requirements.
Action
Standardize the analysis done on data coming in from various sources. This meant working with both customers and our internal team to determine what data was needed, and why, and to ensure we had the correct governance practices in place to pass vendor QA checks. I also had to gain a deep understanding of each data source so that we had consistent data definitions in place.
Define a standard set of data cleaning and processing steps. Apply transformations such as scaling and encoding, or do calculations to ensure that data points were uniform across sources.
I created and integrated some interactive widgets to allow users with minimal coding skills to perform analysis and answer questions.
Converted the cleaned and processed DataFrame into formats suitable for storage and algorithm ingestion.
Train staff on how to use the pipeline.
Result
The data cleaning and processing pipeline significantly streamlined the data preparation workflow. It enabled users to load, clean, and transform datasets efficiently within a Jupyter Notebook environment.
The interactive components allowed for flexible analysis and manipulations, catering to various user needs. The processed data was consistently of high quality and ready for algorithm training and testing, ultimately enhancing the accuracy and reliability of the machine learning models developed.
My work in this area also laid the groundwork for the data processing pipeline that removed repetitive cleaning and munging steps from the data science teamโs responsibility.
I earned a promotion to the data platform engineering team ๐.