Data Science Pattern #1: The Progress Serializer
It saved me hundreds of hours as a data scientist
Data cleaning and preprocessing is a necessary and time-consuming task in data science. I typically do this in iterations with a subset of the original dataset. Once I think I have found all the issues, I run the cleaning code on the complete dataset.
There are several different ways to create the cleaned dataset. Assuming you can clean the data by looking at single elements in isolation, I’ll walk you through different approaches. Let’s go!
The Straight-forward approach
def clean_data(data):
cleaned_data = []
for element in data:
cleaned = clean(element)
cleaned_data.append(cleaned)
return cleaned_data
If the clean
function throws an exception, you lose all the work. You have to fix the issue and run clean_data
from the start.
If running clean_data
is quick (e.g. less than 20 minutes), it’s fine. If not, you might want to use the progress serializer pattern.
The Progress Serializer Pattern
The progress serializer pattern makes sure that no progress is lost, even if the data cleanup didn’t run completely.