Data machine: Cleaner | Python with Pets

Tags

Data cleaning, Data sets, Data machines, Validation

Summary

Write a function that takes a data set as a param. Some of the records in the data set might have errors. The function returns a data set with no errors.

Situation

You've got a data set (a list of dictionaries). Some of the records in the data set might have errors. You want to remove the errors before analysis.

Action

Write a function that takes a data set (a list of dictionaries) as a param, and returns another data set with no errors.

Call it like this:

cleaned_goat_scores = clean_goat_scores(raw_goat_scores)

Here's an example:

def clean_goat_scores(raw_goat_scores):
# Create a new list for the clean records.
cleaned_goat_scores = []
# Loop over raw records.
for raw_record in raw_goat_scores:
# Is the record OK?
if is_record_ok(raw_record):
# Yes, make a new record with the right data types.
clean_record = {
'Goat': raw_record['Goat'],
'Before': float(raw_record['Before']),
'After': float(raw_record['After'])
}
# Add the new record to the clean list.
cleaned_goat_scores.append(clean_record)
# Send the cleaned list back.
return cleaned_goat_scores

Line 5 loops over the records. Line 7 calls is_record_ok, a function that returns True if the data is OK, and False if it isn't.

Here's an example of is_record_ok:

def is_record_ok(record):
# Check name.
goat_name = record['Goat']
if goat_name == '' or goat_name == None:
return False
# Check Before value.
before = record['Before']
if not is_score_ok(before):
return False
# Check After value.
after = record['After']
if not is_score_ok(after):
return False
return True

It goes through each field, returning False if there's a problem.

Line 8 calls a function that tests a numeric field. Here's an example:

def is_score_ok(score_in):
# Is it a number?
try:
score_number = float(score_in)
except Exception:
return False
# Check range.
if score_number < 0 or score_number > 100:
return False
# All OK.
return True