Showing bad data

Tags

Seeing errors

Sometimes, errors in data can be hard to see. For example...

Reflect

Is there anything wrong with this?

Snor1ax is the BEST!

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Adela
Adela

Should be Snorlax with an l (letter l), not Snor1ax with a 1 (the digit).

Right! When you're skimming many records, that can be hard to see.

It can help to print out which records have errors. Maybe show the Goat field of invalid records, like this:

  • Bad record:
  • Bad record:
  • Bad record: Bertha
  • Bad record: Bessie
  • Bad record: Boyd
  • Bad record: Bridgette
  • Bad record: Carrie
  • Bad record: Darell
  • Bad record: Deborah
  • Bad record: Gerald
  • Bad record: Johnnie
  • Bad record: Long
  • Bad record: Vincent

Two records are missing goat names. That's what the first two lines show you.

A coupla new lines

Here's code Ray wrote:

  • def clean_goat_scores(raw_goat_scores):
  •     # Create a new list for the clean records.
  •     cleaned_goat_scores = []
  •     # Loop over raw records.
  •     for raw_record in raw_goat_scores:
  •         # Is the record OK?
  •         if is_record_ok(raw_record):
  •             # Yes, make a new record with the right data types.
  •             clean_record = {
  •                 'Goat': raw_record['Goat'],
  •                 'Before': float(raw_record['Before']),
  •                 'After': float(raw_record['After'])
  •             }
  •             # Add the new record to the clean list.
  •             cleaned_goat_scores.append(clean_record)
  •     # Send the cleaned list back.
  •     return cleaned_goat_scores
Reflect

Add code to show the Goat field of invalid records.

If you were logged in as a student, the lesson would pause here, and you'd be asked to type in a response. If you want to try that out, ask for an account on this site.
Ethan
Ethan

Something like:

  •     for raw_record in raw_goat_scores:
  •         # Is the record OK?
  •         if is_record_ok(raw_record):
  •             ...
  •             cleaned_goat_scores.append(clean_record)
  •        else:
  •             print('Bad record:' + str(raw_record['Goat']))

Nice!

Make it a param

You could add a param to clean_goat_scores to control whether it shows bad records. We could even make it an optional param.

  • def clean_goat_scores(raw_goat_scores, show_bad_records):
  •     ...
  •     for raw_record in raw_goat_scores:
  •         # Is the record OK?
  •         if is_record_ok(raw_record):
  •             ...
  •         else:
  •             if show_bad_records:
  •                 # Show the record that has an issue.
  •                 print('Bad record:' + str(raw_record['Goat']))
  •         ...

If you want to see the bad record goat names...

  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores, True)

If you don't want to see them...

  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores, False)

Make it optional

We can do one better, so if you call the function in the usual way...

  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)

... you don't see the messages. But you can put the param in if you want to see the messages.

  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores, True)

Python supports optional parameters. If the caller leaves a param out, you can tell the function what value to give it.

Here's the final code, with docstring. If it's called without the second param...

  • cleaned_goat_scores = clean_goat_scores(raw_goat_scores)

... you won't get the error report.

  • def clean_goat_scores(raw_goat_scores, show_bad_records = False):
  •     '''
  •     Clean score data. Optionally display the names of goats in invalid records.
  •  
  •     Parameters
  •     ----------
  •     raw_goat_scores : Data set (list of dictionaries)
  •         Data set with possible errors, and wrong types.
  •     show_bad_records : boolean, optional
  •         Identify bad records? The default is False.
  •  
  •     Returns
  •     -------
  •     clean_scores : Data set (list of dictionaries)
  •         Valid records only.
  •  
  •     '''
  •     # Create a new list for the clean records.
  •     cleaned_goat_scores = []
  •     # Loop over raw records.
  •     for raw_record in raw_goat_scores:
  •         # Is the record OK?
  •         if is_record_ok(raw_record):
  •             # Yes, make a new record with the right data types.
  •             clean_record = {
  •                 'Goat': raw_record['Goat'],
  •                 'Before': float(raw_record['Before']),
  •                 'After': float(raw_record['After'])
  •             }
  •             # Add the new record to the clean list.
  •             cleaned_goat_scores.append(clean_record)
  •         else:
  •             if show_bad_records:
  •                 # Show the record that has an issue.
  •                 print('Bad record:' + str(raw_record['Goat']))
  •  
  •     # Send the list back.
  •     return cleaned_goat_scores
Adela
Adela

Hey, that's cool!

Summary

  • Data errors are hard to see sometimes.
  • You can add a param to a cleaning function to show bad records.
  • You can make the param optional.