Data machine: Cleaner

Summary

Write a function that takes a data set as a param. Some of the records in the data set might have errors. The function returns a data set with no errors.

Situation

You've got a data set (a list of dictionaries). Some of the records in the data set might have errors. You want to remove the errors before analysis.

Action

Write a function that takes a data set (a list of dictionaries) as a param, and returns another data set with no errors.

Call it like this:

  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)

Here's an example:

  1. def clean_goat_scores(raw_goat_scores):
  2.     # Create a new list for the clean records.
  3.     cleaned_goat_scores = []
  4.     # Loop over raw records.
  5.     for raw_record in raw_goat_scores:
  6.         # Is the record OK?
  7.         if is_record_ok(raw_record):
  8.             # Yes, make a new record with the right data types.
  9.             clean_record = {
  10.                 'Goat': raw_record['Goat'],
  11.                 'Before': float(raw_record['Before']),
  12.                 'After': float(raw_record['After'])
  13.             }
  14.             # Add the new record to the clean list.
  15.             cleaned_goat_scores.append(clean_record)
  16.     # Send the cleaned list back.
  17.     return cleaned_goat_scores

Line 5 loops over the records. Line 7 calls is_record_ok, a function that returns True if the data is OK, and False if it isn't.

Here's an example of is_record_ok:

  1. def is_record_ok(record):
  2.     # Check name.
  3.     goat_name = record['Goat']
  4.     if goat_name == '' or goat_name == None:
  5.         return False
  6.     # Check Before value.
  7.     before = record['Before']
  8.     if not is_score_ok(before):
  9.         return False
  10.     # Check After value.
  11.     after = record['After']
  12.     if not is_score_ok(after):
  13.         return False
  14.     return True

It goes through each field, returning False if there's a problem.

Line 8 calls a function that tests a numeric field. Here's an example:

  1. def is_score_ok(score_in):
  2.     # Is it a number?
  3.     try:
  4.         score_number = float(score_in)
  5.     except Exception:
  6.         return False
  7.     # Check range.
  8.     if score_number < 0 or score_number > 100:
  9.         return False
  10.     # All OK.
  11.     return True
Where referenced