Data Science Pattern #1: The Progress Serializer

It saved me hundreds of hours as a data scientist

Martin Thoma
3 min readApr 1, 2023
Forgetting that a value might be None / NULL is one of the most common issues. But also dividing by 0 or expecting a list to be non-empty happens when you clean data. Without a way to resume work, you have to start from the beginning.

Data cleaning and preprocessing is a necessary and time-consuming task in data science. I typically do this in iterations with a subset of the original dataset. Once I think I have found all the issues, I run the cleaning code on the complete dataset.

There are several different ways to create the cleaned dataset. Assuming you can clean the data by looking at single elements in isolation, I’ll walk you through different approaches. Let’s go!

The Straight-forward approach

def clean_data(data):
cleaned_data = []
for element in data:
cleaned = clean(element)
cleaned_data.append(cleaned)
return cleaned_data

If the clean function throws an exception, you lose all the work. You have to fix the issue and run clean_data from the start.

If running clean_datais quick (e.g. less than 20 minutes), it’s fine. If not, you might want to use the progress serializer pattern.

The Progress Serializer Pattern

The progress serializer pattern makes sure that no progress is lost, even if the data cleanup didn’t run completely.

--

--

Martin Thoma

I’m a Software Engineer with over 10 years of Python experience (Backend/ML/AI). Support me via https://martinthoma.medium.com/membership